Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

Imagine CLIP (the AI model at the heart of this study) as a super-smart art critic who has memorized millions of pictures and their descriptions. This critic doesn't just "see" an image; it translates every photo into a unique secret code (an "embedding") that captures the essence of what the picture is about.

The researchers in this paper asked a simple but profound question: "What happens to this secret code if we mess with the picture?"

They took 9 different ways to "mess with" an image—like blurring it, adding static noise, flipping it, or stretching it—and watched how the AI's secret code changed. They wanted to know: Does the AI still recognize the cat, or does it suddenly think it's a dog? Does the code stay the same, or does it drift away?

Here is the breakdown of their findings, using some everyday analogies:

1. The Experiment: The "Makeover" Test

Think of the original image as a portrait of a person. The researchers applied 9 different "makeovers" to this portrait:

The "Gritty" Makeovers: Adding random static (noise), blurring the face, or making the colors weird (color jitter).
The "Geometric" Makeovers: Stretching the face (elastic), tilting the perspective, or zooming in/out.
The "Simple" Makeovers: Flipping the image left-to-right or just making it brighter/darker.

They then asked the AI: "What is the secret code for this new version?" and compared it to the code for the original.

2. The Big Discovery: Not All Makeovers Are Equal

The study found that the AI is very sensitive to some changes but very stubborn about others.

The "Traumatic" Changes (High Impact):
- Noise (Static): Imagine taking a photo and covering it in TV static. This was the most damaging. It scrambled the AI's secret code almost completely. The AI got so confused it barely recognized the original subject.
- Perspective & Stretching: If you take a photo of a face and stretch it like taffy or view it from a weird angle, the AI's code shifted significantly. It's like looking at a friend through a funhouse mirror; the AI struggles to match the distorted shape to its memory.
- Blur & Pixel Dropping: Blurring the image or cutting out chunks of it (like a puzzle with missing pieces) also confused the AI's code.
The "Harmless" Changes (Low Impact):
- Flipping & Brightness: If you flip a photo horizontally or just make it a bit brighter, the AI's secret code barely changed at all. It's like the AI saying, "Oh, it's still the same cat, just looking at the mirror or wearing sunglasses." The AI is very good at ignoring these simple tricks.

3. How They Measured the "Drift"

The researchers didn't just guess; they used a toolkit of "rulers" to measure exactly how far the code moved:

The "Attention Map" (The AI's Gaze): They looked at where the AI was looking in the picture.
- Analogy: In the original photo, the AI's "gaze" was focused tightly on the dog's face. When they added noise, the gaze became scattered and confused, looking at random spots. When they blurred the image, the gaze spread out, unable to find a focal point.
The "Distance" Ruler: They measured the mathematical distance between the original code and the new code.
- Analogy: If the original code is "Home," and the new code is "Home + 1 mile," that's a small shift. If the new code is "Home + 100 miles," that's a massive shift. Noise pushed the code 100 miles away; Flipping only moved it a few inches.

4. Why Does This Matter?

You might ask, "So what? Why do we care if an AI gets confused by static?"

This is crucial for Safety and Trust:

Robustness: If an AI is used to drive a car or diagnose a disease, it needs to be robust. If a little bit of rain (blur) or a dirty windshield (noise) makes the AI's internal logic collapse, that's dangerous. This study tells us exactly which types of mess-ups break the AI's brain.
Understanding the Brain: By seeing how the code shifts, we are learning how the AI "thinks." It turns out the AI has learned to ignore simple changes (like brightness) but relies heavily on sharp details and specific shapes.

The Bottom Line

This paper is like a stress test for an AI's brain. It shows us that while the AI is smart, it has specific weak spots.

Strong: It handles simple flips and brightness changes easily.
Weak: It gets easily thrown off by static noise, heavy blurring, or weird distortions.

By understanding these weaknesses, scientists can build better, safer AI systems that don't get confused when the real world gets a little messy.

1. Problem Statement

Vision Language Models (VLMs), particularly Contrastive Language-Image Pre-training (CLIP), have achieved strong generalization by mapping images and text into a shared latent space. However, the internal mechanisms of how these models represent images under various perturbations remain poorly understood.

The Gap: Existing research on CLIP interpretability focuses primarily on text-image relationships or single-image explanations. There is a lack of systematic analysis regarding how image augmentations (e.g., noise, blur, rotation) alter the model's learned embeddings.
The Core Question: Do augmentations merely add noise, or do they fundamentally shift the semantic understanding and representation of an image within the CLIP embedding space? If they shift, to what degree, and which augmentations cause the most drastic changes?

2. Methodology

The authors conducted a systematic analysis using the CLIP base patch32 model.

Dataset

Source: A subset of the Conceptual Captions dataset.
Scale: 13,312 images from the validation set for primary metrics; a subsample of 2,000 images for additional qualitative and custom metrics.

Augmentation Pipeline

Nine common image augmentation techniques were applied using the albumentations library:

Gaussian Noise
Gaussian Blur
Color Jitter
Scale, Rotate, and Shift
Horizontal Flip
Elastic Transform
Perspective Transform
Random Brightness and Contrast
Coarse Dropout (pixel block removal)

Evaluation Metrics

The study employs a hybrid approach of quantitative statistical analysis and qualitative visualization:

Embedding Distance/Similarity:
- Cosine Similarity: Measures the angular similarity between original and augmented embeddings.
- L2 Distance (Euclidean): Measures the magnitude of the shift in the vector space.
- Pairwise Distance & Dendrogram Clustering: Used to group augmentations based on how similarly they shift embeddings.
Structural & Attention Metrics:
- Attention Map Shift: Analyzes changes in the final layer attention maps (qualitative).
- Patch Similarity: Compares corresponding image patches between original and augmented versions.
- Edge Preservation: Measures the retention of edge structures (using Sobel-like gradients).
- Detail Preservation: Analyzes the standard deviation of patches to determine if fine details are lost.
- Kernel Density Estimation (KDE): Visualizes the distribution of L2 distances.

3. Key Contributions

Systematic Dissection of Augmentation Impact: The paper provides the first comprehensive quantitative analysis of how nine specific augmentations individually affect CLIP's embedding space.
Multi-Dimensional Metric Suite: Introduces a novel combination of metrics (including attention shift, patch similarity, and detail preservation) specifically tailored to measure representation drift in VLMs.
Identification of Invariance vs. Sensitivity: Clearly distinguishes which augmentations CLIP is invariant to (e.g., horizontal flip, brightness contrast) versus those that cause significant semantic drift (e.g., noise, perspective transforms).
Open Source Implementation: The authors released the code and analysis pipeline to facilitate future research in mechanistic interpretability and adversarial defense.

4. Key Results & Findings

Categorization of Augmentation Impact

The study classifies the augmentations into three distinct tiers based on their impact on embedding shifts:

High Impact (Drastic Shift):
- Noise: Caused the most significant deviation. It stands apart from all other augmentations in clustering, preserving very few details decodable by humans, leading to a complete breakdown of the original semantic representation.
- Perspective Transform & Shift/Scale Rotation: These geometric distortions significantly spread the attention focus away from the primary subject, causing large embedding shifts.
Medium Impact (Color/Detail Variance):
- Blur, Coarse Dropout, and Color Jitter: These showed higher variance in both L2 distance and cosine similarity. They degrade edge preservation and detail, leading to noticeable but less catastrophic shifts than noise.
Low Impact (High Invariance):
- Horizontal Flip, Brightness/Contrast, and Elastic Transform: These demonstrated minimal embedding shifts. CLIP exhibits strong invariance to color-invariant transformations and simple spatial flips, maintaining high cosine similarity and low L2 distance.

Qualitative Observations

Attention Maps: Noise and perspective transforms caused the most diverse and scattered attention maps. Blurring removed the "fixation" on the main object, increasing the heatmap spread.
Clustering: Dendrogram analysis revealed that while noise forms its own distinct cluster, other augmentations like blur, shift/scale, and perspective rotation form tight sub-clusters, indicating they affect the representation space in similar ways.
Asymmetry in Evaluation: The study noted that "Horizontal Flip" performed poorly in patch similarity metrics for asymmetric images because the metric compares patch $i$ to patch $i$ (spatially corresponding), rather than the flipped counterpart. This highlights a limitation in standard patch-matching metrics for certain transformations.

5. Significance and Future Work

Robustness & Defense: The findings provide a foundation for understanding VLM robustness. Knowing which augmentations cause semantic drift is crucial for developing defenses against adversarial attacks that manipulate images to fool VLMs.
Mechanistic Interpretability: The study suggests that CLIP does not treat all visual transformations equally, hinting at specific underlying mechanisms in its feature representation layers.
Future Directions:
- Layer-wise Analysis: Extending the study to analyze shifts across different layers of the network.
- Cross-Modal Alignment: Investigating if embedding shifts correlate with specific text keywords describing the augmentation.
- Generalization: Testing these findings on other VLMs (e.g., BLIP, Kosmos-2, Flamingo) and more complex transformations like style transfer or domain shifts.

Conclusion

This paper establishes that CLIP's representation learning is not uniformly invariant to all image augmentations. While it handles color and simple flips robustly, it is highly sensitive to noise, geometric distortions, and detail-destroying transformations. This "embedding shift" analysis is a critical step toward making VLMs more interpretable, robust, and secure against adversarial manipulation.