Global Minimizers of Sigmoid Contrastive Loss

Imagine you are trying to teach a robot to understand the world by showing it pictures and their corresponding descriptions. The goal is for the robot to learn that a picture of a "cat" and the word "cat" belong together, while a picture of a "dog" and the word "cat" do not.

In the world of AI, this is called Contrastive Learning. The robot uses a mathematical "scorecard" (a loss function) to check how well it's doing. If the score is high, it means the robot is confused. If the score is zero, the robot has perfectly learned the connections.

This paper, titled "Global Minimizers of Sigmoid Contrastive Loss," dives deep into a specific, modern way of scoring these connections (used by Google's SigLIP models) and explains why it works so well, even when the robot is dealing with billions of examples.

Here is the breakdown using simple analogies:

1. The Problem: The "Perfect Match" Myth

For a long time, researchers thought the best way to teach a robot was to make the picture of a cat and the word "cat" land in the exact same spot in the robot's memory. Imagine trying to glue two different objects (a photo and a word) onto the same pin.

However, in reality, pictures and words are fundamentally different. A photo is a grid of pixels; a word is a sequence of sounds or letters. Trying to glue them into the exact same spot is like trying to fit a square peg into a round hole. It creates a "Modality Gap"—a natural separation between the two types of data.

2. The Solution: The "Flexible Thermostat"

The paper focuses on a specific scoring method called Sigmoid Loss. Previous theories assumed the robot had fixed settings for how strict it should be.

The authors discovered that the secret sauce in modern models (like SigLIP) is that the robot is allowed to adjust its own "thermostat" and "bias" while it learns.

Temperature (Thermostat): Controls how "strict" the robot is. A low temperature means it's very picky; a high temperature means it's more relaxed.
Bias: A nudge that shifts the goalposts slightly.

By letting the robot tune these two knobs itself, it can find a "zero-error" state much more easily than if the knobs were fixed.

3. The Discovery: "Constellations"

The authors introduced a new concept called Constellations.

Imagine you are an astronomer looking at the night sky.

The Stars: Each star is a pair of data (a picture and its matching word).
The Constellation: The pattern they form.

The paper proves that for the robot to achieve a perfect score (zero loss), the stars don't need to be glued together. Instead, they just need to form a specific, stable pattern:

Matching pairs (Cat image + "Cat" word) must be close enough to each other.
Mismatched pairs (Cat image + "Dog" word) must be far enough away.
There is a "safety margin" between the close pairs and the far pairs.

The authors call this a $(m, b_{rel})$ -Constellation. It's a geometric arrangement where the matching pairs are separated from the non-matching pairs by a clear gap, like stars in a constellation that are distinct from the background noise.

4. The "Modality Gap" is a Feature, Not a Bug

One of the most surprising findings is about the Modality Gap.

Old View: "Oh no! The robot isn't aligning the pictures and words perfectly. There's a gap between them. This is a failure."
New View (This Paper): "Actually, that gap is good!"

The paper proves that because pictures and words are so different, they should live in slightly different regions of the robot's memory. Trying to force them to overlap perfectly actually hurts performance. The "gap" allows the robot to keep the two types of data distinct while still knowing they belong together. It's like having two different drawers in a filing cabinet: one for photos, one for text. They are separate, but you know exactly which drawer to open for the other.

5. Why This Matters for the Real World

The paper solves a major mystery: How can we store billions of items in a relatively small memory space?

Imagine you have a library with 10 billion books ( $N$ ) but only a few thousand shelves ( $d$ ).
Old theories said this was impossible unless the shelves were huge.
This paper shows that by using the "Constellation" pattern and the "flexible thermostat," you can pack billions of items into a small space without them crashing into each other.

6. The Practical Upgrade: "Relative Bias"

Finally, the authors propose a small tweak to how we train these robots. Instead of just adjusting the "bias" (the nudge), they suggest adjusting the Relative Bias (the nudge relative to the temperature).

The Analogy:
Imagine you are teaching a child to catch a ball.

Old Way: You throw the ball at a fixed speed and tell the child, "Catch it if it's within 1 foot of your hand."
New Way (Relative Bias): You tell the child, "Catch it if it's within 10% of your arm's reach."

The new way adapts to the situation. The authors show that this small change makes the robot learn faster and creates a wider safety margin, making it much better at finding the right answer when it's searching through millions of options (like finding the right image for a search query).

Summary

This paper explains that the secret to modern AI's ability to understand images and text isn't forcing them to be identical. It's about letting the AI find a stable, separated pattern (a Constellation) where matching items are close, non-matching items are far, and the two types of data (images vs. text) are allowed to stay in their own distinct "neighborhoods." By letting the AI tune its own strictness and bias, it achieves this perfect balance naturally.

1. Problem Statement

The paper addresses the theoretical gaps in understanding representation synchronization (aligning embeddings from different modalities, e.g., images and text) using contrastive learning. While models like CLIP and SigLIP have achieved empirical success, existing theoretical frameworks fail to explain their behavior in practical regimes.

Key Gaps Identified:

Regime Mismatch: Prior theoretical works assume either the embedding dimension $d$ is larger than the number of data pairs $N$ ( $d \ge N$ ) or $N \to \infty$ for fixed $d$ . In practice, modern models (e.g., SigLIP2) operate in the regime $d \ll N \ll 2^d$ (e.g., $d \approx 10^3$ , $N \approx 10^{10}$ ).
Rigid Geometries: Existing theories suggest minimizing configurations are simplexes (when $d \ge N$ ) or perfectly aligned vectors (when $N \to \infty$ ). This fails to explain the "Modality Gap" observed empirically, where image and text embeddings are synchronized but remain in disjoint, linearly separable regions rather than coinciding.
Hyperparameter Limitations: Prior work often treats temperature and bias as fixed or analyzes them in asymptotic limits, whereas modern implementations (SigLIP, SigLIP2) use trainable inverse temperature and bias.

2. Methodology

The authors analyze the Sigmoid Contrastive Loss with trainable inverse temperature ( $t$ ) and bias ( $b$ ).

The Loss Function:
$L_{Sig} = \sum_{i=1}^N \log(1 + e^{-t\langle U_i, V_i \rangle + b}) + \sum_{i \neq j} \log(1 + e^{t\langle U_i, V_j \rangle - b})$
Where $U_i$ and $V_i$ are unit-norm embeddings for image-text pairs.

Core Theoretical Approach:

Characterization of Zero-Loss: The authors rigorously characterize the global minima of this loss function in the practical regime ( $N \gg d$ ).
Combinatorial Objects: They introduce a new combinatorial object called $(m, b_{rel})$ -Constellations, defined by two parameters:
- Margin ( $m$ ): The separation between positive and negative inner products.
- Relative Bias ( $b_{rel}$ ): The ratio $b/t$ , representing the threshold for separation.
Geometric Bounds: They connect these constellations to spherical codes to derive bounds on the maximum number of pairs $N$ that can be embedded in dimension $d$ for a given margin and bias.
Modality Gap Analysis: Using convex geometry theorems (Helly's, Carathéodory's), they prove conditions under which image and text embeddings become linearly separable.
Reparameterization: They propose a new parameterization of the loss function that explicitly treats relative bias ( $b_{rel}$ ) as a trainable parameter rather than just bias $b$ .

3. Key Contributions

A. The Geometry of Zero-Loss: $(m, b_{rel})$ -Constellations

The paper proves that a configuration achieves zero loss with trainable $t$ and $b$ if and only if it forms an $(m, b_{rel})$ -Constellation. This requires:

$\langle U_i, V_i \rangle \ge m + b_{rel}$ (Positive pairs are above the threshold).
$\langle U_i, V_j \rangle \le -m + b_{rel}$ (Negative pairs are below the threshold).

This generalizes previous findings, showing that perfect alignment ( $U_i = V_i$ ) is not required for zero loss. Instead, the embeddings only need to be separated by a margin relative to a bias threshold.

B. The Modality Gap

The authors provide a theoretical justification for the Modality Gap (the phenomenon where image and text embeddings are disjoint).

Theorem 3.6: In the regime $N > d$ , if $|b_{rel}| < m$ , there exists a hyperplane that separates the set of image embeddings $\{U_i\}$ from the set of text embeddings $\{V_j\}$ (specifically, $V_j$ for at least $N-d$ indices).
Implication: This explains why CLIP and SigLIP models do not align modalities perfectly; the loss minimization naturally drives them into disjoint regions to satisfy the margin constraints.

C. Retrieval Robustness and Dimensionality

Retrieval: The authors prove that nearest-neighbor search on any $(m, b_{rel})$ -Constellation yields perfect retrieval. A larger margin $m$ implies greater robustness to approximation errors (crucial for Approximate Nearest Neighbor search).
Cardinality Bounds: They establish lower and upper bounds on the size of constellations ( $N$ ) as a function of dimension $d$ , margin $m$ , and bias $b_{rel}$ . This provides a theoretical basis for choosing embedding dimensions in practice.

D. Relative Bias Parameterization

The authors propose reparameterizing the loss to train $b_{rel}$ directly:
$L_{RB-Sig} = \sum \log(1 + e^{-t\langle U_i, V_i \rangle + t b_{rel}}) + \dots$

Advantage 1 (Locked Encoders): This parameterization implicitly adds linear adapters, allowing a trainable encoder to synchronize with a locked (frozen) encoder without explicit adapter architecture.
Advantage 2 (Convergence): Experiments show that standard bias parameterization ( $b$ ) often converges to $b_{rel} \approx 0$ , limiting the solution space. Explicitly training $b_{rel}$ allows the model to find configurations with larger margins and faster convergence.

4. Experimental Results

Real-World Validation: Analyzed 8 SigLIP models on ImageNet. All models exhibited the predicted Modality Gap (perfect linear separability between image and text embeddings) and satisfied the $(m, b_{rel})$ -Constellation conditions.
Margin vs. Dimension: A strong correlation was found between embedding dimension and margin (Pearson $\approx 0.95$ ), validating the theoretical bounds.
Synthetic Experiments:
- Models trained with trainable $t$ and $b_{rel}$ converged to zero loss significantly faster than those with fixed temperature or standard bias.
- The proposed parameterization successfully synchronized a locked image encoder with a trainable text encoder, outperforming models with fixed hyperparameters.
- Multi-modal synchronization ( $k > 2$ modalities) showed that increasing the number of modalities generally increases the final margin, leading to more robust representations.

5. Significance and Impact

Theoretical Foundation: This is the first rigorous characterization of global minima for contrastive learning in the practically relevant regime ( $d \ll N$ ).
Explaining Empirical Phenomena: It resolves the mystery of the "Modality Gap," proving it is a necessary consequence of minimizing sigmoid loss with trainable parameters, not a failure of alignment.
Practical Guidelines:
- Suggests that trainable temperature and relative bias are critical for achieving high-quality representations.
- Provides a theoretical basis for using linear adapters (or implicit ones via reparameterization) when synchronizing with frozen encoders.
- Offers a method to guide optimization toward specific solution geometries (e.g., larger margins for better retrieval) by fixing the relative bias.
Future Directions: The work opens avenues for optimizing embedding dimensions based on desired margins and exploring the "Linear Representation Hypothesis" across modalities (where $U_i - V_i \approx \text{constant vector}$ ).

In summary, the paper bridges the gap between theoretical contrastive learning and modern large-scale practice, demonstrating that the success of models like SigLIP relies on a specific geometric structure of separated, margin-maximizing constellations rather than perfect alignment.