TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models

Imagine you have a super-smart robot librarian named CLIP. This librarian has read millions of books and looked at millions of photos. If you show it a picture of a dog and ask, "Is this a cat or a dog?" it will almost always get it right, even if it's never seen that specific dog before. It's incredibly fast and smart.

The Problem: The "Magic Trick" Attack
However, this librarian has a weakness. A hacker can perform a tiny, invisible "magic trick" on a photo. They add a few pixels of noise that look like static on an old TV to a picture of a dog. To your human eye, it still looks like a dog. But to the robot, the "magic trick" confuses its brain so badly that it suddenly thinks, "Oh, that's definitely a cat!"

This is called an adversarial attack. It's like a magician tricking a judge into thinking a rabbit is a hat.

The Old Solutions: Too Slow or Too Clumsy
Previously, to fix this, scientists tried two things:

Retraining: They tried to teach the robot new tricks by showing it thousands of these "tricked" photos. But this takes forever, costs a fortune, and the robot forgets how to do its other jobs.
Test-Time Adaptation: They tried to make the robot "think harder" every time it sees a picture. But the problem is, they made the robot think harder for every single picture, even the normal ones. This slowed everything down and sometimes made the robot less accurate on normal photos.

The New Solution: TTP (Test-Time Padding)
The authors of this paper propose a clever, lightweight trick called Test-Time Padding (TTP). Think of it as a "Security Guard + Magic Frame" system.

Here is how it works, step-by-step:

1. The Security Guard (Detection)

Imagine you hand a photo to the robot. Before it looks at the photo, the security guard (TTP) puts a thick, white border (padding) around the image.

If the photo is normal: Adding a white border doesn't change what the robot sees. The robot still thinks, "That's a dog." The border didn't confuse it.
If the photo is a "tricked" photo: The "magic trick" was very delicate. When the guard adds a big white border, it disrupts the delicate trick. The robot suddenly realizes, "Wait, that border changed things! This photo is suspicious!"

The system measures how much the robot's opinion changed when the border was added.

Small change? It's a normal photo. Let it pass.
Big change? It's a tricked photo! Stop the line.

2. The Magic Frame (Adaptation)

If the system catches a "tricked" photo, it doesn't just throw it away. It puts the photo in a custom-made frame.

Instead of just using a random white border, the system quickly calculates the perfect border size and color to cancel out the hacker's magic trick.
It's like a detective adjusting the lighting in a room until the shadow hiding the criminal disappears.
Once the "magic" is neutralized, the robot looks at the photo again and correctly identifies it as a dog.

3. The Panel of Judges (Ensemble)

To be extra sure, the system doesn't just look at the photo once. It creates several slightly different versions of the photo (some with different crops, some with different colors) and asks the robot to look at all of them.

It then asks: "Which of these views looks most like the 'real' version of the photo?"
It combines the answers from the most reliable views to make the final decision.

Why is this a big deal?

It's Fast: It doesn't need to retrain the robot. It just works on the fly, like a security guard checking IDs at the door.
It's Smart: It knows the difference between a normal photo and a tricked one. It only uses the heavy-duty "magic frame" when it's absolutely necessary.
It's Universal: It works on different types of robots (models) and different types of photos (datasets) without needing to be re-tuned.

In Summary:
Think of TTP as a smart bouncer at a club.

He puts a "test sticker" (padding) on everyone's ID.
If the ID looks normal with the sticker, he lets them in immediately (keeping the line moving fast).
If the ID looks weird with the sticker, he knows it's a fake. He then uses a special tool (trainable padding) to peel off the fake layer and reveal the real ID before letting them in.

This keeps the club safe from imposters (adversarial attacks) without slowing down the entry for honest guests (clean data).

1. Problem Statement

Vision-Language Models (VLMs), particularly CLIP, have achieved state-of-the-art zero-shot recognition performance but remain highly vulnerable to adversarial perturbations. Existing defense strategies face significant limitations:

Training-time defenses: Methods like Adversarial Prompt Tuning (APT) require costly retraining and labeled adversarial data, which is often unavailable.
Test-time defenses: Existing approaches often apply uniform adaptation to all inputs, failing to distinguish between clean and adversarial samples. This leads to suboptimal performance, where clean accuracy is compromised by unnecessary adaptation, or adversarial examples are not robustly handled.
Detection limitations: Recent test-time counterattack methods (e.g., TTC) rely on feature stability under noise but suffer from low detection accuracy and poor generalization across different datasets and model architectures.

The core challenge is to develop a lightweight, retraining-free defense that can reliably detect adversarial inputs and adapt specifically to them without degrading the model's performance on clean data.

2. Methodology: Test-Time Padding (TTP)

The authors propose Test-Time Padding (TTP), a two-stage "detect-then-adapt" framework that operates entirely in the input space without modifying model weights or text prompts.

Core Insight

The authors observe that applying spatial padding to an image disrupts the attention patterns of the model.

Clean samples: Exhibit minimal change in feature embeddings when padded.
Adversarial samples: Exhibit a significant shift in feature embeddings (cosine similarity) because the padding helps "break" the adversarial perturbation and restore the model's original attention focus.

The TTP Pipeline

The framework consists of three distinct stages:

Adversarial Detection (Similarity Shift):
- The input image $x$ is encoded by the frozen CLIP image encoder to get embedding $z$ .
- A fixed padding operation $P_{fix}(x)$ is applied, and the result is encoded to get $z^{pad}$ .
- The cosine similarity $s$ between $z$ and $z^{pad}$ is calculated.
- Decision: If $s > \tau$ (a universal threshold, e.g., 0.8), the sample is deemed clean and passed directly to the classifier. If $s \leq \tau$ , it is flagged as adversarial.
Trainable Test-Time Padding (Adaptation):
- For detected adversarial examples, the system generates multiple augmented views.
- A lightweight, trainable padding module $P_\theta$ is introduced.
- The parameters $\theta$ are updated in a single step via entropy minimization over high-confidence (low-entropy) augmented views. This optimizes the padding to specifically restore the disrupted attention patterns for that specific instance.
Similarity-Aware Ensemble:
- To produce the final prediction, the system aggregates predictions from the augmented views.
- Instead of simple averaging, it assigns adaptive weights ( $w_i$ $w_{i}$ ) based on a similarity score. The score favors views that are:
  - Close to the padded adversarial embedding (indicating successful attention restoration).
  - Far from the original unpadded adversarial embedding (indicating noise suppression).
- The final prediction is a weighted sum of these probabilities.

3. Key Contributions

Unified Detection Mechanism: Demonstrated that spatial padding induces a distinct "similarity shift" between clean and adversarial inputs, enabling a universal, dataset-agnostic detection threshold that outperforms previous methods like TTC.
Targeted Adaptation: Introduced a single-step trainable padding strategy that optimizes instance-specific parameters to restore attention patterns, coupled with a similarity-aware ensemble for robust final predictions.
Clean Accuracy Preservation: By accurately detecting clean inputs and bypassing adaptation for them, TTP preserves the original zero-shot accuracy of CLIP, unlike methods that degrade clean performance through uniform adaptation.
Plug-and-Play Efficiency: The method requires no retraining, no access to model internals (gradients/architecture), and no external knowledge, making it highly transferable across different CLIP backbones.

4. Experimental Results

Experiments were conducted on eight fine-grained classification datasets (e.g., Caltech101, OxfordPets, Flower102) using three CLIP backbones (ViT-B/32, ViT-B/16, ViT-L/14) under PGD attacks ( $\epsilon=4.0$ ).

Adversarial Robustness: TTP consistently outperformed state-of-the-art baselines (TTC, R-TPT, Ensemble, MTA).
- On ViT-B/32, TTP achieved an average adversarial accuracy of 39.7%, significantly surpassing R-TPT (35.3%) and TTC (6.8%).
- It showed robust performance across all model scales, including the large ViT-L/14.
Detection Accuracy: TTP achieved nearly 100% detection accuracy across diverse datasets and architectures using a single threshold, whereas TTC showed significant fluctuations and lower accuracy.
Clean Accuracy: TTP maintained clean accuracy comparable to the vanilla CLIP model (e.g., 90.9% vs 91.4% on Caltech101 for ViT-B/32), proving it does not sacrifice performance on benign data.
Versatility: The method remained effective against various attack types (CW, DeepFool, FGSM) and could be seamlessly integrated with other test-time adaptation techniques (like TPT) to further boost clean accuracy.

5. Significance

This work addresses a critical gap in VLM security by providing a lightweight, retraining-free defense that effectively balances robustness and accuracy.

Paradigm Shift: It moves away from "one-size-fits-all" adaptation to a detect-then-adapt paradigm, ensuring resources are only spent on compromised inputs.
Practicality: Since it operates in the input space and requires no model modification, it is immediately deployable in safety-critical scenarios (e.g., medical imaging, robotics) where retraining large foundation models is infeasible.
Generalizability: The reliance on input-space padding rather than model-specific features makes TTP a robust, transferable solution for the next generation of vision-language systems.