CLIP Behaves like a Bag-of-Words Model Cross-modally but not Uni-modally

The Big Problem: CLIP is "Distracted"

Imagine you have a super-smart librarian named CLIP. This librarian has read millions of books and seen millions of pictures. Their job is to match a picture to the correct sentence description.

Usually, CLIP is amazing. But recently, researchers noticed a weird glitch. If you show CLIP a picture of a red square and a blue circle, and ask it to choose between two descriptions:

"A red square and a blue circle" (Correct)
"A blue square and a red circle" (Wrong)

CLIP often gets it wrong. It seems to just count the words it sees: "Red? Check. Square? Check. Blue? Check. Circle? Check." It doesn't care which color belongs to which shape.

In computer science, we call this a "Bag-of-Words" problem. It's like putting all the ingredients for a cake into a bag, shaking them up, and saying, "I have flour, eggs, and sugar, so I must have a cake!" But you forgot to mix them in the right order. CLIP was treating the image and the text as just a messy bag of concepts, ignoring how they fit together.

The Investigation: Is the Librarian Blind or Just Clumsy?

The researchers asked a crucial question: Is CLIP actually "blind" to the connection between the red color and the square shape? Or is it just that the way it compares pictures to words is clumsy?

To find out, they ran three clever tests:

The "Solo" Test (Uni-modal): They asked CLIP to look only at the text (ignoring the picture) and tell them, "In the sentence 'red square and blue circle,' which color goes with the square?"
- Result: CLIP got it right almost 100% of the time! It knew the connection perfectly when looking at just the words.
- The Text Test: They did the same with just the picture. They asked, "In this image, is the square red or blue?"
- Result: CLIP also got this right! It knew the connection perfectly when looking at just the image.

The Discovery: The information was already there. CLIP wasn't blind. It knew that "red" belonged to "square" and "blue" belonged to "circle" inside its own brain for both pictures and words.

The "Crowded Room" Test: They added more objects (5, 10, even 20 shapes).
- Result: Even in a messy, crowded scene, CLIP's text brain could still separate the colors from the shapes. Its picture brain got a little confused by the clutter, but it still knew the basics.
The "Spot the Imposter" Test: They showed CLIP a picture with many red cubes and green spheres, but hid one red sphere (which doesn't belong).
- Result: CLIP could spot the "imposter" red sphere because it recognized the unique combination of red + sphere, even though it had seen red cubes and green spheres before. This proved CLIP wasn't just a "bag of words"; it understood the specific binding of features.

The Real Culprit: The "Translator" is Broken

So, if CLIP knows the answer in its head, why does it fail when matching a picture to a sentence?

The researchers realized the problem isn't the knowledge; it's the translation.

Imagine CLIP has two separate brains:

Brain A (The Picture Brain): Speaks "Image Language."
Brain B (The Text Brain): Speaks "Word Language."

Both brains know the truth: "Red goes with Square." But when they try to talk to each other, they are speaking different dialects. The "Image Language" version of "Red Square" doesn't quite line up with the "Word Language" version of "Red Square." They are slightly out of sync, like two people trying to dance to the same song but starting on different beats.

Because of this misalignment, when CLIP tries to match them, it gets confused and just grabs the closest words it can find (the "Bag of Words" approach).

The Solution: A Simple "Translator" Layer

The researchers didn't need to retrain the whole librarian (which would be expensive and slow). They just needed to fix the translator.

They added a tiny, simple linear layer (think of it as a small, adjustable filter or a translator) to the text side. This layer learned how to rotate and shift the "Word Language" so it perfectly matched the "Image Language."

Before: The words and pictures were dancing out of sync.
After: The translator fixed the rhythm. Now, "Red Square" in the text perfectly aligned with "Red Square" in the picture.

The Result: With this tiny fix, CLIP's ability to match complex descriptions skyrocketed. It went from guessing randomly to getting it right 95% of the time.

Why This Matters (The Takeaway)

This is a huge win for efficiency.

Old Way: To fix CLIP, you might have to retrain the whole massive model from scratch (like rebuilding the library).
New Way: You just add a tiny, cheap "adapter" (like putting a new translator in the room). You don't need to change the library or re-read the books.

In short: CLIP wasn't stupid; it was just out of sync. The information was there all along, waiting to be unlocked with a simple, lightweight adjustment. This means we can make existing AI systems much smarter without the massive cost of retraining them.

1. Problem Statement

Vision-Language Models (VLMs) like CLIP (Contrastive Language-Image Pretraining) have achieved state-of-the-art performance in various downstream tasks. However, a critical limitation has been identified: CLIP struggles with compositionality, specifically the ability to correctly bind attributes to their corresponding objects in complex scenes (e.g., distinguishing "a red square and a blue triangle" from "a blue square and a red triangle").

Previous research suggests CLIP behaves like a Bag-of-Words (BoW) model, treating inputs as unordered sets of concepts rather than structured relationships. The prevailing assumption is that CLIP's encoders lack the internal representation of attribute-object bindings, necessitating expensive retraining to fix.

The Core Question: Does CLIP's BoW behavior stem from a lack of binding information within the individual image and text embeddings (encoder limitation), or is it a result of poor cross-modal alignment (matching limitation) despite the information being present?

2. Methodology

The authors employ a three-stage investigation to diagnose the source of the failure and propose a lightweight solution.

A. Diagnosing Uni-Modal Binding (The "Is the info there?" Test)

To determine if binding information exists within the encoders, the authors decouple the modalities and test them independently using Linear Probing:

Setup: They freeze the pre-trained CLIP encoders (Image and Text) and train linear classifiers (probes) on top of the embeddings.
Task: The probes are trained to predict specific attributes (e.g., color) for specific objects (e.g., "cube") within a scene containing multiple objects.
Datasets: They use synthetic datasets (CLEVR, PUG:SPAR, and a new dataset PUG:SPARE) designed with controlled object/attribute combinations to prevent shortcut learning.
Robustness Checks:
- Object Count: They increase the number of objects in a scene to test if binding signals degrade.
- Conjunctive Search: They adapt a visual search task where the model must identify an object defined by a unique binding of features (e.g., a "red sphere" among "red cubes" and "green spheres") to prove the embedding encodes the relationship, not just the presence of features.

B. Diagnosing Cross-Modal Alignment (The "Is the matching broken?" Test)

Baseline: They reproduce standard BoW benchmarks (e.g., choosing between a correct caption and a permuted caption for an image). As expected, standard CLIP performs near random chance (~50%), confirming the cross-modal failure.
Hypothesis: If the information exists uni-modally (proven in Step A) but fails cross-modally, the issue is the alignment mechanism, not the encoding.

C. Proposed Solution: LABCLIP

To validate the alignment hypothesis, the authors propose LABCLIP (Linear Attribute Binding CLIP):

Mechanism: Instead of retraining the massive CLIP encoders, they introduce a simple, learnable linear transformation matrix ( $A$ ) applied to the text embeddings before the cross-modal dot product.
- Standard: $\langle f_{image}(x_{img}), f_{text}(x_{txt}) \rangle$
- LABCLIP: $\langle f_{image}(x_{img}), A \cdot f_{text}(x_{txt}) \rangle$
Training Strategy:
- The CLIP encoders remain frozen.
- The matrix $A$ is trained using contrastive loss with hard negative samples.
- Negative Samples: Created by permuting attribute-object pairs in captions (e.g., changing "red cube and blue sphere" to "blue cube and red sphere") while keeping the image constant. This forces the linear layer to learn to distinguish correct bindings from permuted ones.

3. Key Contributions

Re-evaluation of CLIP's BoW Nature: The paper demonstrates that CLIP is not a BoW model uni-modally. The attribute-object binding information is already linearly separable and robustly encoded within both the image and text embeddings.
Identification of the Root Cause: The failure is attributed to cross-modal misalignment. The encoders learn the bindings, but the contrastive pre-training objective does not sufficiently encourage aligning these specific binding signals across modalities.
LABCLIP: A lightweight, efficient method to recover compositional reasoning. It requires training only a small linear layer (approx. 500K parameters) on top of frozen encoders, avoiding the cost of retraining the backbone (400M+ parameters).
New Dataset (PUG:SPARE): An extension of the PUG dataset that removes positional biases (where attributes were correlated with left/right positions), ensuring models learn true binding rather than spatial shortcuts.

4. Key Results

Uni-Modal Performance

Linear Probing: On synthetic datasets (CLEVR, PUG:SPARE), linear probes trained on frozen CLIP embeddings achieve >95% accuracy in identifying attributes for specific objects. This is significantly higher than random baselines and comparable to fine-tuned models.
Scalability: Text embeddings maintain high binding accuracy (>80%) even as the number of objects increases. Image embeddings show a gradual decline but remain well above chance, proving binding signals persist in cluttered scenes.
Conjunctive Search: CLIP's visual embeddings successfully distinguish "incongruent" objects (unique bindings) from distractors with >80% accuracy, whereas zero-shot classification fails.

Cross-Modal Performance (LABCLIP)

Synthetic Benchmarks: LABCLIP dramatically improves cross-modal binding accuracy on CLEVR, PUG:SPAR, and PUG:SPARE.
- Example (CLEVR): Accuracy jumps from 0.58 (Standard CLIP) to 0.95 (LABCLIP), approaching the upper bound of fine-tuned CLIP (0.99).
Real-World Benchmarks: LABCLIP significantly outperforms standard CLIP on ARO, SugarCrepe, and COCO retrieval tasks, matching or approaching the performance of NegCLIP (a model fine-tuned with negative pairs).
Efficiency: LABCLIP is >100x faster to train than full fine-tuning methods and is backward compatible with existing CLIP vector databases (no need to re-extract image features).

Downstream Effects

Trade-off: While LABCLIP improves compositional reasoning, it shows a slight decrease in zero-shot performance on single-object classification tasks (CIFAR, ImageNet). This suggests a trade-off between broad object discrimination and fine-grained attribute binding.
Spatial Reasoning: LABCLIP trained on attribute binding matches CLIP's spatial reasoning performance but can be improved further with specific spatial supervision.

5. Significance

Theoretical Insight: The paper fundamentally shifts the understanding of CLIP's limitations. It proves that the "knowledge" of how objects relate to attributes exists within the pre-trained model; the problem is purely one of retrieval/alignment.
Practical Impact: It offers a highly efficient path to improve VLMs. Instead of the computationally expensive process of retraining encoders or re-indexing massive vector databases, practitioners can simply train a lightweight linear adapter to unlock the latent compositional capabilities of existing CLIP models.
Future Directions: The work suggests that future VLM designs should focus more on alignment strategies that preserve internal structural information rather than assuming encoders lack specific reasoning capabilities.

In summary, the paper argues that CLIP is not "blind" to composition; it is just "misaligned." By applying a simple linear transformation, we can unlock the rich, structured information already present in the model's embeddings.