Language-Guided Invariance Probing of Vision-Language Models

Imagine you have a super-smart robot assistant that can look at a picture and describe it perfectly. You might think, "Great! If I show it a photo of a red dog, it will always say 'red dog,' no matter how I ask."

But what if you asked, "Is that a blue cat?" and the robot said, "Yes, that looks like a blue cat to me"? Or what if you described the same red dog as "a canine with crimson fur" and the robot got confused, thinking it was a different picture entirely?

This paper introduces a new way to test these robots, called LGIP (Language-Guided Invariance Probing). Think of it as a "Truth and Consistency Test" for AI.

Here is how it works, broken down into simple concepts:

1. The Two Rules of a Good Robot

The authors say a truly smart vision-language robot needs to follow two golden rules:

Rule #1: The "Same Story" Rule (Invariance)
If you tell the robot the same story but change the words (like saying "a happy dog" vs. "a joyful pup"), the robot should recognize it's the same picture. It shouldn't get confused by fancy wording.
- Analogy: If you wear a red hat or a red scarf, you are still you. The robot shouldn't think you are a different person just because your outfit description changed.
Rule #2: The "Lie Detector" Rule (Sensitivity)
If you tell the robot a lie about the picture (like saying "a blue dog" when it's clearly red), the robot should immediately say, "Nope, that doesn't match." It needs to be sensitive to the truth.
- Analogy: If you point to a banana and say "That's an apple," a good friend should correct you. A bad friend might just nod and agree.

2. The Test: The "Flip" and the "Paraphrase"

The researchers took 40,000 photos (from a famous dataset called MS COCO) and ran them through 9 different famous AI models (like CLIP, SigLIP, and EVA).

They did two things to the text descriptions:

The Paraphrase (The Reword): They rewrote the sentence to mean the exact same thing but used different words.
The Semantic Flip (The Lie): They took a key word and swapped it for something wrong.
- Original: "A cat sits on a red chair."
- Flip: "A dog sits on a blue chair."

Then, they asked the AI: "Does this new sentence match the picture?"

3. The Results: Who Passed and Who Failed?

The test revealed some surprising secrets that standard tests missed.

The "Honor Roll" (CLIP, OpenCLIP, EVA):
These models are like the responsible students.
- They understood that "joyful pup" and "happy dog" are the same (Low Invariance Error).
- They firmly rejected the lie about the blue dog (High Sensitivity).
- Verdict: They are robust and reliable.
The "Confused Students" (SigLIP family):
These models are like students who are good at memorizing facts but bad at understanding context.
- They got confused when the words changed slightly (High Invariance Error).
- The scary part: Sometimes, they actually preferred the lie over the truth! If you showed them a picture of a cat and said "This is a person," the SigLIP model sometimes gave the lie a higher score than the real description.
- Verdict: They are brittle. They might look smart on standard tests, but they fail when you try to trick them with simple word swaps.

4. Why Does This Matter?

You might ask, "Why do we care if an AI gets confused by a synonym?"

Imagine you are using an AI to help a doctor find a specific medical scan, or to help a blind person navigate a room.

If the AI is inconsistent, it might miss a crucial image just because the doctor used a different medical term.
If the AI is not sensitive to lies, it might tell a blind person, "There is a clear path ahead," when there is actually a wall, because the AI hallucinated the description.

The Big Takeaway

This paper is like a stress test for the AI's brain. It shows that just because an AI gets a high score on a standard exam (like identifying objects), it doesn't mean it truly understands the relationship between words and pictures.

The authors found that some models (like EVA and OpenCLIP) are built with a stronger "truth detector," while others (like SigLIP) are surprisingly fragile. This new test, LGIP, is a simple, cheap way to check if your AI is actually smart or just good at guessing.

1. Problem Statement

Vision-Language Models (VLMs) like CLIP, OpenCLIP, and SigLIP have achieved state-of-the-art zero-shot performance. However, standard evaluation metrics (e.g., retrieval accuracy) fail to characterize how these models behave under controlled linguistic perturbations. Specifically, there is a lack of diagnostic tools to determine:

Linguistic Invariance: Does the model maintain stable similarity scores when the text is rephrased without changing the meaning?
Semantic Sensitivity: Does the model correctly down-weight (reduce similarity for) text that contradicts salient visual attributes (e.g., changing "a red dog" to "a blue dog")?

Existing benchmarks often conflate these behaviors or rely on binary correctness, making it difficult to diagnose whether a model is brittle to surface-level phrasing or insensitive to semantic conflicts.

2. Methodology: Language-Guided Invariance Probing (LGIP)

The authors propose LGIP, a lightweight, model-agnostic diagnostic benchmark designed to isolate and measure linguistic robustness.

A. Data and Perturbation Construction

Using the MS COCO dataset (40,000 images with 5 human captions each), LGIP generates two types of textual perturbations for each image-caption pair:

Paraphrases (Meaning-Preserving):
- Simple: Adding wrappers/prefixes (e.g., "A photo of...").
- Advanced: Syntactic and lexical changes (passive voice, synonym substitution, clause reordering) while preserving semantics.
Semantic Flips (Meaning-Changing):
- Rule-based single-token substitutions targeting specific attributes: Object (e.g., dog $\to$ cat), Color (e.g., red $\to$ blue), and Count (e.g., one $\to$ two).
- These flips are designed to contradict the visual evidence in the image.

B. Evaluation Metrics

LGIP evaluates frozen VLM encoders using three primary metrics:

Invariance Error ( $E_{inv}$ ): The mean absolute difference in similarity scores between the original caption and its paraphrases. Lower is better.
Semantic Sensitivity ( $E_{sens}$ ): The mean gap between the similarity score of the original caption and the flipped caption ( $s(I, c) - s(I, c^\dagger)$ ). Higher is better.
Positive Rate (PR): The proportion of cases where the original caption scores higher than the flipped caption. Higher is better (0.5 is random chance).

C. Experimental Setup

Models Tested: 9 frozen dual-encoder VLMs, including CLIP (ViT-B/16, ViT-L/14), OpenCLIP (ViT-L/14, ViT-H/14), EVA02-CLIP, and the SigLIP family (base, large, SigLIP2).
Protocol: The image remains fixed; only the text input is perturbed. Metrics are computed on ~1.2M paraphrase comparisons and ~80k flip triples.

3. Key Contributions

LGIP Benchmark: Introduction of a diagnostic protocol that disentangles paraphrase invariance from semantic sensitivity, providing continuous, interpretable metrics beyond aggregate accuracy.
Comprehensive Analysis: Application of LGIP across 9 popular VLMs, revealing systematic weaknesses not captured by standard zero-shot benchmarks.
Insight into Architectural Differences: Identification of a clear performance gap between contrastive softmax-based models (CLIP/EVA) and pairwise sigmoid-based models (SigLIP), linking training objectives to robustness behaviors.

4. Key Results

A. Invariance vs. Sensitivity Trade-off

Top Performers: EVA02-CLIP and large OpenCLIP variants achieve the best trade-off. They exhibit low invariance error (stable under paraphrasing) and high semantic sensitivity (strongly reject flipped captions).
Underperformers: SigLIP family models (especially base versions) show substantially higher invariance error and low semantic sensitivity.
- SigLIP models often score flipped captions higher than human descriptions (PR $\approx$ 0.5 or lower), indicating they fail to detect semantic contradictions.
- SigLIP2 shows improvement but still lags behind CLIP-style models, particularly on object-level edits.

B. Granular Failure Modes

Attribute Specificity: CLIP-family models maintain high Positive Rates across object, color, and count flips. SigLIP models show uneven performance, often failing to distinguish object substitutions (e.g., "cat" $\to$ "person").
Strength Calibration: When flips are stratified by "conflict strength," CLIP models show monotonic improvement in PR as conflict increases. SigLIP models show flatter, inconsistent scaling, indicating poor calibration of semantic conflict.
Combined Perturbations: When paraphrases and flips are combined, the performance gap widens, with SigLIP models struggling significantly more than CLIP models.

C. Qualitative Findings

Visual examples confirm that while EVA02-CLIP correctly assigns higher similarity to the original caption, SigLIP often assigns higher scores to the semantically flipped caption (e.g., preferring "A person sits on a bike" over "A cat sits on a bike" when the image shows a cat).

5. Significance and Implications

Beyond Accuracy: The paper demonstrates that high zero-shot classification accuracy does not guarantee robustness to linguistic variations or semantic grounding. A model can be accurate on standard benchmarks but brittle to simple text edits.
Training Objective Correlation: The results suggest that the contrastive softmax loss (used in CLIP/EVA), which enforces relative ranking within a batch, is more effective at learning semantic sensitivity than the pairwise sigmoid loss (used in SigLIP), which scores pairs independently.
Actionable Insights: LGIP provides a diagnostic tool for developers to identify specific failure modes (e.g., poor object grounding). It suggests that future training should incorporate structured negative captions (semantic flips) and paraphrase-consistency losses to improve robustness.
Practical Impact: Low semantic sensitivity can lead to hallucinations in Visual Question Answering (VQA) and incorrect rankings in image retrieval systems where diverse prompts are used.

In conclusion, LGIP reveals that EVA02-CLIP and OpenCLIP currently offer the most robust linguistic grounding, while SigLIP variants, despite strong efficiency and zero-shot scores, exhibit systematic vulnerabilities to semantic contradictions that standard metrics fail to detect.