Enhancing CLIP Robustness via Cross-Modality Alignment

The Big Picture: The "Over-Confident" Translator

Imagine you have a super-smart translator named CLIP. This translator is amazing at looking at a picture and instantly knowing what it is, even if it's never seen that specific picture before. If you show it a photo of a golden dog on a beach, it knows exactly what to say.

The Problem:
However, CLIP has a very weak spot. If someone puts a tiny, invisible sticker on the photo (an "adversarial perturbation")—like a few pixels of noise that a human eye can't even see—CLIP suddenly gets confused. It might look at that same golden dog and scream, "That's a toaster!"

Why does this happen?
Think of CLIP as having two brains: one for Images and one for Text. In a perfect world, the "Image Brain" and the "Text Brain" hold hands tightly. But when an attacker messes with the image, the "Image Brain" gets dizzy and lets go of the "Text Brain." They drift apart, and the translator loses its way.

The Solution: Meet COLA

The researchers created a new method called COLA (Cross-modaLity Alignment). Think of COLA as a GPS and a Translator's Guidebook that helps CLIP find its way back home, even when the road is blocked by noise.

COLA does this in two clever steps:

Step 1: The "Magic Filter" (Subspace Projection)

Imagine the "Image Brain" is looking at a messy room full of clutter (the noise from the attack). It's hard to see the golden dog because of all the junk.

COLA says, "Wait a minute! We know what a 'dog' looks like based on our text descriptions." It creates a safe zone (a subspace) built entirely out of the descriptions of dogs, cats, cars, etc.

It then takes the messy, attacked image and projects it onto this safe zone.

The Analogy: Imagine you are trying to find a specific book in a library that has been ransacked. Instead of digging through the trash on the floor, you go straight to the "Dog" section of the shelves. You force the messy image to sit only in the "Dog" section.
The Result: The "trash" (the adversarial noise) gets filtered out because it doesn't fit the "Dog" description. The image is now clean and aligned with the text again.

Step 2: The "Group Hug" (Optimal Transport)

Once the image is cleaned up, COLA doesn't just look at one version of the image. It creates a few slightly different versions (cropped, flipped, resized) to be sure. It also asks a smart AI to write 50 different sentences describing the same class (e.g., "A golden retriever," "A dog running," "A pet on sand").

Now, instead of matching one image to one sentence, COLA matches the whole group of image variations to the whole group of text descriptions.

The Analogy: Imagine you are trying to identify a person in a crowd. Instead of just looking at their face once, you look at them from five different angles, and you ask five different witnesses to describe them. If the person matches the descriptions from all angles and all witnesses, you are 100% sure who they are.
The Result: Even if the attack tries to confuse one angle or one description, the "group consensus" remains strong. The image and text stay locked together.

Why is this a Big Deal?

No Retraining Needed: Usually, to fix a broken AI, you have to teach it all over again (which takes weeks and huge computers). COLA is like a plug-in tool. You can take any existing CLIP model, plug COLA in, and it works immediately. No new training required.
It Works on Everything: The researchers tested this on 14 different types of tests (cars, flowers, satellites, food). In almost every case, COLA stopped the AI from getting tricked by attacks, while still keeping it smart on normal, clean photos.
It's Fast: Because it doesn't need to retrain, it's actually faster than other defense methods that try to "fight back" against the attack.

The Bottom Line

CLIP is a brilliant but easily confused translator.
Adversarial attacks are like tiny, invisible bugs that make CLIP hallucinate.
COLA is the immune system that filters out the bugs and reminds CLIP of what it actually knows, ensuring that a picture of a dog stays a dog, even when someone tries to trick the computer.

It's a simple, powerful, and free way to make AI safer and more reliable for real-world use, like self-driving cars or medical diagnosis, where getting it wrong could be dangerous.

1. Problem Statement

Vision-Language Models (VLMs) like CLIP excel at zero-shot classification but are highly vulnerable to adversarial perturbations. Small, imperceptible changes to input images can cause severe misclassification.

Root Cause: Existing defenses (adversarial training, prompt tuning) often overlook a fundamental issue: modality misalignment. Under adversarial attacks, the alignment between image embeddings and text embeddings breaks down.
- Global Misalignment: Adversarial noise pushes image features away from their semantic text prototypes in the embedding space.
- Local Structural Misalignment: Attacks distort the local structure of the feature space, causing nearby image embeddings to scatter and lose consistency.
Limitations of Current Methods: Most existing solutions require computationally expensive retraining (adversarial fine-tuning) or introduce high inference latency (test-time optimization). They often fail to restore the underlying semantic alignment between modalities.

2. Methodology: COLA Framework

The authors propose COLA (CrOss-modaLity Alignment), a training-free, architecture-free framework that restores alignment using Optimal Transport (OT) and Subspace Projection. The method operates in two main stages:

A. Global Feature Alignment via Subspace Projection

To filter out non-semantic distortions caused by adversarial noise:

Subspace Construction: The method constructs a subspace spanned by clean class text features. It performs Singular Value Decomposition (SVD) on the matrix of all class text embeddings to extract the top- $C$ principal components ( $U_C$ ).
Projection: Adversarial image embeddings ( $\hat{x}$ ) are projected onto this text-induced subspace:
$\Pi(\hat{x}) = U_C U_C^\top \hat{x}$
Effect: This projection removes perturbation components orthogonal to the semantic subspace, effectively recovering the pairwise similarity of clean image features and restoring global alignment with text prototypes.

B. Local Structural Alignment via Optimal Transport (OT)

To refine alignment and account for local variations (e.g., background noise, different augmentations):

Discrete Distribution Modeling:
- Images: Instead of a single embedding, an adversarial image is modeled as a discrete distribution over $N$ augmented views (original + random crops/flips).
- Texts: Each class is modeled as a discrete distribution over $M$ fine-grained text descriptions generated by an LLM.
OT Cost Matrix: The alignment cost between an image view and a text description is calculated using the projected features (from Step A) rather than raw features. The cost is defined as $1 - \cos(\Pi(\hat{x}_n), z_m)$ .
Optimal Transport: The method computes the OT distance (using the Sinkhorn algorithm) between the image distribution and each class's text distribution. The class with the minimum transport cost is selected as the prediction.
- Theoretical Guarantee: The authors prove that this projection-based OT approach yields a larger decision margin than standard OT or CLIP, implying better generalization and robustness.

3. Key Contributions

Novel Defense Mechanism: Introduces the first training-free test-time defense for CLIP that explicitly addresses both global and local modality misalignment caused by adversarial attacks.
Theoretical Foundation: Provides theoretical proofs showing that subspace projection reduces cosine similarity distortion and that the proposed OT framework increases decision margins, leading to improved robustness.
Plug-and-Play Compatibility: The method requires no model retraining, no architectural changes, and no additional learnable parameters. It can be applied directly to pre-trained CLIP models or existing fine-tuned variants.
Efficiency: Unlike test-time counterattack methods (e.g., TTC) that require iterative optimization, COLA is computationally efficient, avoiding costly inference loops.

4. Experimental Results

The method was evaluated on 14 zero-shot classification benchmarks (including ImageNet, Caltech101, Flowers, etc.) under PGD and CW adversarial attacks.

Performance on ImageNet:
- Under PGD attacks, COLA achieved an average improvement of +6.7% in robust accuracy compared to standard CLIP.
- It significantly outperformed strong baselines like TTC (Test-Time Counterattacks) and TTE, achieving robust accuracy of 50.0% on ImageNet (vs. 40.0% for TTC) while maintaining high clean accuracy (62.8% vs. 51.7%).
Performance on Variants:
- On challenging domain-shift datasets like ImageNet-A and ImageNet-R, COLA showed massive gains (e.g., +7% to +10% over baselines).
Robustness to Strong Attacks: Even under a high attack budget ( $\epsilon = 4/255$ ), where baseline models collapsed to near-zero accuracy, COLA maintained significant robustness (e.g., >20% on several datasets).
Efficiency: COLA was ~30% faster than TTC (28 mins vs. 40 mins on ImageNet) while delivering superior accuracy.
Ablation Studies: Confirmed that both the subspace projection and the OT-based distribution matching are critical; removing either significantly degrades performance.

5. Significance

Safety-Critical Applications: By enhancing the reliability of VLMs against adversarial manipulation, COLA makes these models safer for high-stakes domains like autonomous driving, medical diagnosis, and security systems.
Paradigm Shift: Moves the field away from resource-heavy adversarial training toward efficient, mathematically grounded test-time alignment strategies.
Generalizability: Demonstrates that restoring the geometric and semantic alignment between modalities is a more effective defense strategy than simply hardening the model against specific noise patterns.

In summary, COLA offers a robust, efficient, and theoretically sound solution to the vulnerability of Vision-Language Models, proving that restoring cross-modality alignment via optimal transport is a powerful defense against adversarial attacks.