Embedding Shift Dissection on CLIP: Effects of Augmentations on VLM's Representation Learning

This study investigates the impact of nine common image augmentations on CLIP's embedding shifts using various similarity and distance metrics, revealing that noise, perspective transforms, and scaling cause the most drastic changes to provide foundational insights for improving Vision Language Model robustness and mechanistic interpretability.

Ashim Dahal, Saydul Akbar Murad, Nick Rahimi

Published 2026-03-24
📖 4 min read☕ Coffee break read

Imagine CLIP (the AI model at the heart of this study) as a super-smart art critic who has memorized millions of pictures and their descriptions. This critic doesn't just "see" an image; it translates every photo into a unique secret code (an "embedding") that captures the essence of what the picture is about.

The researchers in this paper asked a simple but profound question: "What happens to this secret code if we mess with the picture?"

They took 9 different ways to "mess with" an image—like blurring it, adding static noise, flipping it, or stretching it—and watched how the AI's secret code changed. They wanted to know: Does the AI still recognize the cat, or does it suddenly think it's a dog? Does the code stay the same, or does it drift away?

Here is the breakdown of their findings, using some everyday analogies:

1. The Experiment: The "Makeover" Test

Think of the original image as a portrait of a person. The researchers applied 9 different "makeovers" to this portrait:

  • The "Gritty" Makeovers: Adding random static (noise), blurring the face, or making the colors weird (color jitter).
  • The "Geometric" Makeovers: Stretching the face (elastic), tilting the perspective, or zooming in/out.
  • The "Simple" Makeovers: Flipping the image left-to-right or just making it brighter/darker.

They then asked the AI: "What is the secret code for this new version?" and compared it to the code for the original.

2. The Big Discovery: Not All Makeovers Are Equal

The study found that the AI is very sensitive to some changes but very stubborn about others.

  • The "Traumatic" Changes (High Impact):

    • Noise (Static): Imagine taking a photo and covering it in TV static. This was the most damaging. It scrambled the AI's secret code almost completely. The AI got so confused it barely recognized the original subject.
    • Perspective & Stretching: If you take a photo of a face and stretch it like taffy or view it from a weird angle, the AI's code shifted significantly. It's like looking at a friend through a funhouse mirror; the AI struggles to match the distorted shape to its memory.
    • Blur & Pixel Dropping: Blurring the image or cutting out chunks of it (like a puzzle with missing pieces) also confused the AI's code.
  • The "Harmless" Changes (Low Impact):

    • Flipping & Brightness: If you flip a photo horizontally or just make it a bit brighter, the AI's secret code barely changed at all. It's like the AI saying, "Oh, it's still the same cat, just looking at the mirror or wearing sunglasses." The AI is very good at ignoring these simple tricks.

3. How They Measured the "Drift"

The researchers didn't just guess; they used a toolkit of "rulers" to measure exactly how far the code moved:

  • The "Attention Map" (The AI's Gaze): They looked at where the AI was looking in the picture.
    • Analogy: In the original photo, the AI's "gaze" was focused tightly on the dog's face. When they added noise, the gaze became scattered and confused, looking at random spots. When they blurred the image, the gaze spread out, unable to find a focal point.
  • The "Distance" Ruler: They measured the mathematical distance between the original code and the new code.
    • Analogy: If the original code is "Home," and the new code is "Home + 1 mile," that's a small shift. If the new code is "Home + 100 miles," that's a massive shift. Noise pushed the code 100 miles away; Flipping only moved it a few inches.

4. Why Does This Matter?

You might ask, "So what? Why do we care if an AI gets confused by static?"

This is crucial for Safety and Trust:

  • Robustness: If an AI is used to drive a car or diagnose a disease, it needs to be robust. If a little bit of rain (blur) or a dirty windshield (noise) makes the AI's internal logic collapse, that's dangerous. This study tells us exactly which types of mess-ups break the AI's brain.
  • Understanding the Brain: By seeing how the code shifts, we are learning how the AI "thinks." It turns out the AI has learned to ignore simple changes (like brightness) but relies heavily on sharp details and specific shapes.

The Bottom Line

This paper is like a stress test for an AI's brain. It shows us that while the AI is smart, it has specific weak spots.

  • Strong: It handles simple flips and brightness changes easily.
  • Weak: It gets easily thrown off by static noise, heavy blurring, or weird distortions.

By understanding these weaknesses, scientists can build better, safer AI systems that don't get confused when the real world gets a little messy.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →