Half-Truths Break Similarity-Based Retrieval

The Problem: The "Yes, And..." Trap

Imagine you are playing a game of "Guess the Picture" with a very smart but slightly gullible robot.

Round 1: You show the robot a picture of a dog. You say, "This is a dog." The robot looks at the picture, looks at your words, and says, "Yes! That's a perfect match!" (Score: 10/10).
Round 2: You show the same picture of the dog. But this time, you say, "This is a dog riding a skateboard."
- Reality: The dog is just sitting there. It is not on a skateboard.
- The Robot's Reaction: Surprisingly, the robot gets more excited. It says, "Wow! A dog! And a skateboard! That's even more detailed! Score: 12/10!"

This is the core problem the paper identifies. Current AI models (like CLIP) are so eager to find any matching words that they get tricked by Half-Truths.

A Half-Truth is a sentence that is mostly correct but has one tiny, plausible lie added to it.

The Lie: "The dog is on a skateboard."
The Trap: Because the robot recognizes the word "dog" and the word "skateboard," it thinks the sentence is a better description than the simple, truthful one. It fails to check if the dog is actually on the skateboard.

The authors call this the "Conjunction Fallacy." It's like a human thinking, "Linda is a bank teller" is less likely than "Linda is a bank teller and is active in the feminist movement," even though adding details makes a scenario less likely, not more. The AI thinks adding details makes the match better, even when the details are wrong.

Why Does This Happen?

Think of the AI's brain as a Bag of Words that is trying to match a puzzle.

When it sees the picture, it pulls out a bag of "visual tokens" (dog, park, grass).
When it reads the sentence "Dog on skateboard," it pulls out "text tokens" (dog, skateboard).
It sees "Dog" matches "Dog." It sees "Skateboard" matches... well, it doesn't see a skateboard in the picture, but it's so focused on the "Dog" match that it ignores the missing skateboard. It treats the sentence like a grocery list: "Do we have a dog? Yes! Do we have a skateboard? Maybe! Close enough!"

The AI isn't checking the relationships (is the dog on the board?). It's just counting how many words overlap.

The Solution: CS-CLIP (The "Detail Detective")

The authors created a new training method called CS-CLIP (Component-Supervised CLIP). Instead of just teaching the AI to match the whole sentence to the whole picture, they taught it to act like a forensic accountant checking a receipt.

The Training Analogy:
Imagine you are training a new employee to check receipts.

Old Way: You show them a receipt and say, "Does this match the order?" They just glance at the total and say "Looks good."
New Way (CS-CLIP): You break the receipt down line by line.
- "Here is the item: Brown Horse."
- "Here is a fake receipt line: White Horse."
- "Tell me which one matches the photo."
- "Here is the relationship: Horse near Barn."
- "Here is a fake relationship: Horse inside Barn."
- "Tell me which one is true."

By forcing the AI to practice spotting the difference between a "Brown Horse" and a "White Horse," or a "Horse near a barn" and a "Horse inside a barn," it learns to pay attention to the specific details and how things connect, not just the general vibe.

The Results: From Gullible to Sharp

After this "detail detective" training, the AI changed its behavior:

Before (CLIP): If you added a fake detail, the AI thought the description was better. It was easily fooled.
After (CS-CLIP): If you add a fake detail, the AI immediately says, "Wait, that doesn't fit. The score should go down."

The Stats in Plain English:

Old AI: Only caught the lie about 40% of the time. (It was fooled more often than not).
New AI (CS-CLIP): Catches the lie about 69% of the time.
The Hardest Part: The AI was terrible at spotting wrong relationships (like "dog on skateboard"). The old AI got this right only 33% of the time. The new AI got it right 65% of the time.

Why Should You Care?

This matters because we want AI to be a reliable assistant, not a "yes-man."

Search Engines: If you search for "red car," you don't want the AI to show you a "red car with a unicorn on top" just because it matches the words "red" and "car."
Safety: If a robot is told "The door is open," it needs to know if the door is actually open, not just that the words "door" and "open" are in its database.

Summary

The paper shows that current AI models are too easily tricked by adding extra, fake details to a description. They think "more words = better match." The authors fixed this by teaching the AI to check every single word and relationship individually, turning it from a gullible guesser into a sharp-eyed detective that knows when a story doesn't add up.

1. Problem Statement: The Half-Truth Vulnerability

The paper identifies a critical failure mode in current CLIP-style vision-language models (dual encoders) regarding compositional understanding.

The Intuition: If a text description $A$ correctly describes an image, adding a plausible but incorrect detail to create a "half-truth" $A^-$ should decrease the image-text similarity score.
The Reality: The authors demonstrate that CLIP and similar models often violate this intuition. Appending a single incorrect but plausible detail (e.g., adding "and zebras" to a description of a dog, or changing "near the log" to "away from the log") frequently increases the similarity score, causing the model to prefer the incorrect description over the correct, shorter one.
The "Half-Truth" Diagnostic: The authors formalize this as the Half-Truth Vulnerability. They define a metric, Half-Truth Accuracy ( $Acc_{HT}$ ), which measures the fraction of cases where the model correctly prefers the anchor (correct short description) over the half-truth (correct + 1 incorrect detail).
- Findings: On the MS-COCO dataset, standard CLIP achieves only 40.6% $Acc_{HT}$ overall. The problem is significantly worse for relations (e.g., spatial or action relationships), where CLIP scores drop to 32.9% (worse than random chance). Even advanced variants like SigLIP2 and sentence-level hard-negative methods (NegCLIP) struggle to reliably penalize these incorrect additions.

2. Root Cause Analysis

The authors trace this vulnerability to weak supervision on caption parts during contrastive training:

Global vs. Local Alignment: Standard contrastive training aligns full sentences with images. While this provides strong supervision at the sentence level, it fails to explicitly enforce that individual entities (objects/attributes) and relations are correctly grounded.
Bag-of-Words Behavior: The models tend to rely on coarse overlaps (e.g., detecting the presence of a "dog" and a "log") rather than verifying the specific composition (e.g., where the dog is relative to the log). Consequently, adding a plausible word increases the overall "bag-of-words" overlap, boosting the score even if the semantic composition is wrong.

3. Methodology: CS-CLIP (Component-Supervised CLIP)

To address this, the authors propose CS-CLIP, a fine-tuning approach that introduces unit-level supervision without altering the standard dual-encoder architecture used at inference time.

Core Components:

Unit Parsing:
- Captions are parsed into Entity Units (noun phrases with attributes, e.g., "brown horse") and Relation Units (directed relations between entities, e.g., "horse near barn").
- This is done using a text-only LLM pipeline.
Foil Generation:
- For each unit, a minimally edited foil is generated. This foil preserves fluency and context but changes the meaning (e.g., "brown horse" $\to$ "white horse"; "horse near barn" $\to$ "horse inside barn").
Training Objective:
- The model is fine-tuned to distinguish between the correct unit and its foil.
- Loss Function: The total loss combines:
  - $L_{global}$ : Standard sentence-level contrastive loss (using NegCLIP-style hard negatives).
  - $L_{unit}$ : A new unit-level loss that pulls the image embedding closer to the correct unit embedding and pushes it away from the foil embedding (and other in-batch units).
- Architecture: Crucially, CS-CLIP maintains the standard dual-encoder structure. At test time, it uses the same retrieval scoring as standard CLIP, ensuring compatibility with existing systems.

4. Key Contributions

Diagnostic: Introduction of the Half-Truth Diagnostic, a new benchmark revealing that adding incorrect details often increases similarity scores in state-of-the-art models, particularly for relational errors.
Method: Development of CS-CLIP, which utilizes explicit unit-level supervision (contrasting entities/relations against matched foils) to enforce compositional sensitivity.
Benchmarks: Demonstration that reducing half-truth errors correlates with broader gains in compositional understanding across 16 established benchmarks (e.g., ARO, Winoground, SugarCrepe).

5. Experimental Results

Half-Truth Accuracy (COCO)

CS-CLIP achieves 69.3% overall $Acc_{HT}$ , a massive improvement over zero-shot CLIP (40.6%) and NegCLIP (56.5%).
Relation Handling: The most significant gain is in relation additions, where CS-CLIP reaches 65.5% (vs. 32.9% for CLIP and 48.3% for NegCLIP), effectively reversing the trend where models prefer incorrect relations.
Entity Handling: CS-CLIP achieves 75.4% on entity additions, tying with the best baseline (FSC-CLIP).

Compositional Benchmarks

CS-CLIP achieves the best average Image-to-Text (I2T) accuracy (57.8%) across 16 compositional benchmarks, outperforming CLIP by 5.7 percentage points.
It also achieves the best Group Accuracy (requiring correctness in both I2T and T2I directions) and Text-to-Image (T2I) accuracy.
The results show a positive correlation between Half-Truth robustness and general compositional performance, suggesting the method addresses a fundamental weakness rather than overfitting to a specific test.

Downstream Performance

Zero-Shot Classification: There is a modest trade-off. CS-CLIP's average zero-shot accuracy drops slightly (from 63.6% to 59.9% Acc@1), which is comparable to other COCO-fine-tuned baselines.
Retrieval: CS-CLIP improves retrieval performance, achieving the best average Recall@1 on MS-COCO and Flickr8k among fine-tuned models.

6. Significance and Impact

Reliability in Query Refinement: The work highlights that current models are unreliable when users refine queries with additional details. CS-CLIP makes retrieval more robust to "half-truth" queries, which is crucial for search engines and accessibility tools.
Compositional Understanding: The paper proves that explicitly supervising the model on the parts of a sentence (entities and relations) is more effective for compositional reasoning than relying solely on sentence-level negatives or architectural changes.
Generalizability: By preserving the standard dual-encoder inference pipeline, CS-CLIP offers a drop-in improvement for existing CLIP-based systems, making it highly practical for deployment.

In summary, the paper argues that similarity-based retrieval fails when it cannot distinguish between a correct description and a plausible but incorrect extension. By shifting the training focus from global sentence alignment to component-level alignment, CS-CLIP significantly mitigates this failure, leading to more robust and compositionally aware vision-language models.