Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP

Imagine you have a super-smart robot assistant named CLIP. This robot is amazing at looking at pictures and understanding what they are. If you show it a picture of a banana, it says, "Banana!" If you show it a gun, it says, "Firearm!" It's used in hospitals, on the internet, and in self-driving cars.

But there's a tricky problem.

The Problem: The "Magic Sticky Note" Trick

Imagine you take a picture of a banana. Then, you stick a bright yellow sticky note on it that says "GUN" in big, bold letters.

If you show this to the robot, it gets confused. Because the robot is so good at reading text, it ignores the banana and screams, "GUN!" It has been tricked.

This is called a Typographic Attack. Bad actors can use this to:

Make a self-driving car think a stop sign is a speed limit sign.
Trick a hospital AI into thinking a harmless skin spot is cancer.
Force a chatbot to say something dangerous (a "jailbreak").

The Old Way: The "Hard-Work" Fix

Scientists tried to fix this before by re-teaching the robot from scratch. They would show it thousands of examples of "fake" pictures and say, "No, that's still a banana!"

The downside: This takes a massive amount of computer power, costs a lot of money, and is slow. It's like trying to fix a leaky faucet by rebuilding the entire house.

The New Solution: Dyslexify

The authors of this paper came up with a clever, "mechanic-style" fix called Dyslexify.

Think of the robot's brain (the neural network) as a giant factory with thousands of workers (called attention heads).

The Investigation: The researchers put on their detective hats and watched the factory. They discovered that when the robot sees text, a specific group of workers in the second half of the factory line suddenly gets very excited. They grab the text, ignore the picture, and shout it to the boss (the final decision-maker).
The Diagnosis: These specific workers are the "typographic specialists." They are the ones causing the robot to get tricked by the sticky notes.
The Cure: Instead of retraining the whole factory, the researchers simply told those specific workers: "Take a break. Ignore the text. Just look at the picture."

They didn't retrain the robot. They didn't teach it new lessons. They just silenced the specific part of the brain that was listening to the text.

Why is this cool?

It's Fast: You don't need a supercomputer. You can do this on a regular laptop.
It's Precise: It's like removing a specific bad apple from a basket without throwing away the whole basket. The robot still recognizes bananas, cars, and cats perfectly.
It's Safe: In the medical tests, they showed that if you put a fake "Malignant" (cancer) label on a harmless skin spot, the normal robot thinks it's cancer. But the Dyslexify robot ignores the fake label and correctly says, "It's just a harmless spot."

The Trade-off (The "Dyslexic" Part)

The paper calls these new robots "Dyslexic." Why?
Because by silencing the text-reading workers, the robot becomes worse at reading text.

If you need a robot to read a street sign or do Optical Character Recognition (OCR), this robot will struggle.
But: That's the point! The researchers say, "If you are using this robot for safety (like in a hospital or a car), you don't want it to be tricked by text. You want it to ignore the text and focus on the real image."

The Analogy: The Security Guard

Imagine a security guard at a museum.

Normal Guard: Sees a painting of a vase. Someone holds up a sign saying "This is a bomb." The guard panics and calls the police.
Old Fix: You spend years training the guard to ignore signs. It takes forever.
Dyslexify Fix: You put noise-canceling headphones on the guard. They can still see the painting perfectly, but they literally cannot hear the sign being shouted at them. The painting is safe, and the guard stays calm.

Summary

Dyslexify is a smart, low-effort way to make AI safer. It finds the tiny part of the AI's brain that listens to text, turns it off, and creates a "blind to text" version of the AI. This makes it much harder for hackers to trick it, especially in life-or-death situations like medicine, without needing to spend millions on retraining.

1. Problem Statement

Typographic Attacks pose a significant security risk to Vision-Language Models (VLMs) like CLIP. These attacks involve injecting text into an image (e.g., via stickers, overlays, or printed text) to manipulate the model's output.

Vulnerability: CLIP models, trained on massive image-text datasets, often prioritize textual features over visual ones when text is present. This leads to targeted misclassifications, malicious content generation, and "jailbreaking" of safety filters.
Limitations of Existing Defenses: Current defenses typically rely on gradient-based optimization (e.g., fine-tuning the model or learning a defense prefix). These methods are computationally expensive, require retraining, lack interpretability regarding why the model fails, and do not scale well to billion-parameter models on consumer hardware.

2. Methodology: Dyslexify

The authors propose Dyslexify, a gradient-free, mechanistic defense that intervenes directly in the model's architecture during inference. The approach is based on Mechanistic Interpretability.

A. Mechanistic Analysis

The authors first investigated where and how CLIP processes typographic information:

Layer-wise Probing: They trained linear probes on the cls token embeddings at every layer of OpenCLIP models (ViT-B to ViT-bigG).
- Finding: Object recognition capabilities develop gradually across layers. However, typographic understanding emerges abruptly in the latter half of the model's layers.
Component Analysis: They analyzed the contributions of Attention vs. MLP blocks.
- Finding: Attention layers consistently add linearly decodable information to the cls token, while MLP layers tend to compress or discard it.
Typographic Attention Score ( $T_{i,\ell}$ ): They defined a metric to quantify how much spatial attention a specific attention head $H_{i,\ell}$ $H_{i, ℓ}$ dedicates to typographic regions.
- Finding: A small subset of attention heads in the later layers exhibits extremely high scores ( $T_{i,\ell} \geq \mu + 2\sigma$ ), indicating a strong spatial bias toward text. These heads act as "sinks" for typographic information.

B. The Defense Mechanism

Dyslexify constructs a "Typographic Circuit" consisting of these specific high-scoring attention heads and selectively ablates them.

Circuit Construction Algorithm:
1. Rank all attention heads by their Typographic Attention Score ( $T_{i,\ell}$ ).
2. Iteratively add heads to the ablation set $C$ in descending order of their score.
3. Constraint: A head is added only if:
  - It improves robustness on a typographic benchmark ( $\Delta Acc_{typo} > 0$ ).
  - It does not degrade performance on a standard visual benchmark beyond a tolerance threshold $\epsilon$ (e.g., 1%).
4. The process stops if adding a head violates the accuracy constraint or if $k$ consecutive heads are skipped.
Ablation Implementation: The method modifies the residual stream of the cls token by zeroing out the contribution of the selected heads ( $H_{i,\ell,cls} \leftarrow 0$ ). This is done without updating model weights or requiring backpropagation.

3. Key Contributions

Mechanistic Understanding: The paper identifies that a sparse set of attention heads in the second half of CLIP layers are causally responsible for typographic vulnerability. It introduces the Typographic Attention Score to locate these heads.
Gradient-Free Defense: Dyslexify offers a defense that requires no fine-tuning and no gradients. It scales efficiently to billion-parameter models on consumer-grade hardware.
Causal Validation: Through controlled interventions (manipulating attention sinks), the authors demonstrate a causal link: suppressing these specific heads reduces the model's susceptibility to text-based attacks while preserving object recognition.
Medical Application: The method is validated on a safety-critical medical foundation model (skin lesion diagnosis), proving its utility in high-stakes domains.
Model Release: The authors release a family of "dyslexic" CLIP models that are robust against typographic attacks.

4. Experimental Results

The authors evaluated Dyslexify across various model sizes (ViT-B, L, H, G, BigG) and datasets.

Robustness Gains:
- On a typographic variant of ImageNet-100, Dyslexify improved accuracy by up to 22.06% (and up to 31% on specific synthetic datasets).
- It consistently outperformed the "Defense-Prefix" baseline on typographic benchmarks.
Preservation of General Capabilities:
- Standard zero-shot classification accuracy on non-typographic datasets (e.g., ImageNet-100, Food-101, Aircraft) dropped by less than 1% in nearly all cases.
- This demonstrates a favorable trade-off: high robustness with minimal loss in general utility.
Medical Use Case:
- In melanoma detection, typographic attacks reduced accuracy by up to 22%. Dyslexify recovered up to 19.3% of this lost accuracy and even improved baseline performance on some non-attacked datasets.
Efficiency:
- Dyslexify is significantly faster than gradient-based defenses (e.g., 3.8x faster on ViT-B) and has a low memory footprint, allowing it to run on models >1B parameters on a single GPU where fine-tuning-based defenses fail due to OOM errors.
Trade-off (OCR):
- As expected, Dyslexify degrades Optical Character Recognition (OCR) capabilities (dropping 8–30% on IIIT5K). The authors argue this is an acceptable trade-off for safety-critical applications where text manipulation is a risk.

5. Significance and Conclusion

Paradigm Shift: The work moves beyond "black-box" adversarial training toward white-box, mechanistic interventions. It proves that specific model behaviors can be surgically removed without retraining.
Safety-Critical Deployment: By releasing pre-robustified models, the paper provides a practical "drop-in replacement" for industries (like healthcare) where the risk of adversarial text manipulation outweighs the need for text recognition.
Limitations: The defense focuses on the cls token. Applications relying heavily on spatial tokens (e.g., LLaVA, image generation) might still be vulnerable, as text information could propagate through those pathways. Additionally, the method is not designed for scenarios where OCR is the primary task.

In summary, Dyslexify offers a computationally efficient, interpretable, and highly effective solution to typographic attacks in CLIP, leveraging mechanistic interpretability to surgically remove vulnerability without compromising the model's core visual capabilities.