VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models

Imagine you have a very smart, super-advanced robot assistant that can look at pictures and answer questions about them. You ask it, "Is that cup on a table or a hand?" and it confidently says, "On a table." But you know for a fact it's on a hand.

You ask yourself: Why did the robot get that wrong? Did it not see the hand? Did it see the hand but ignore it? Or did it see a hand but think it was a table?

Usually, with these complex AI models, the answer is a mystery. It's like a "black box"—you put data in, and an answer comes out, but you can't see what's happening inside the brain.

This paper introduces VisualScratchpad, a new tool that acts like a transparent X-ray vision glass for these AI robots. It lets researchers peek inside the robot's brain while it's thinking, to see exactly which visual ideas it's grabbing onto and which ones it's ignoring.

Here is how it works, using some simple analogies:

1. The "Dictionary" of Visual Ideas (Sparse Autoencoders)

Inside the AI's brain, information is stored in a messy, tangled way. Imagine a library where every book is a mix of three different stories glued together. It's hard to find just the story about "cats."

The researchers use a tool called a Sparse Autoencoder to untangle this mess. Think of it as a magical librarian who takes those glued-together books and separates them into individual, clean pages. Now, instead of a messy mix, the AI has a neat dictionary where one page is purely "red," another is "striped," and another is "gloves." This makes it much easier to see what specific visual idea the AI is looking at.

2. The "Spotlight" Connection (Attention Maps)

Once the AI has its clean dictionary of visual ideas, it needs to decide which ones matter for the question you asked.

VisualScratchpad uses a Spotlight (called an attention map). Imagine the AI is reading a question like "Is the cup on a hand?" As it reads the word "hand," the Spotlight shines brightly on the part of the image showing the hand. VisualScratchpad connects the "hand" word in the question to the "gloves" or "skin" pages in the visual dictionary.

This tells us: Okay, the AI saw the hand, and it knows the word "hand." So why did it still get it wrong?

3. The "Control Panel" (Causal Analysis)

This is the coolest part. VisualScratchpad isn't just a camera; it's a remote control.

If the AI gets it wrong, researchers can use the tool to say, "Let's turn off the 'gloves' idea" or "Let's turn up the volume on the 'sitting' idea."

The Experiment: They "ablate" (turn off) a specific visual concept.
The Result: If the AI suddenly changes its answer from "sitting" to "standing," we know for sure that the "sitting" concept was the culprit. It's like pulling a specific wire in a machine to see which light bulb goes out.

The Three Big Mistakes They Found

Using this tool, the researchers found three common reasons why these smart robots fail:

The "Lost in Translation" Problem: The AI saw the hand and the glove perfectly, but it couldn't connect the word "hand" to the picture of the glove. It's like knowing the word "dog" and seeing a picture of a dog, but your brain just won't click them together.
- Fix: When the researchers asked, "Is the cup on a hand with a mitten?", the AI finally made the connection and got it right.
The "Bad Hunch" Problem: The AI saw a walker (a device for walking) and immediately thought, "Oh, people with walkers sit in wheelchairs!" It ignored the fact that the person was actually standing. It relied on a misleading clue.
- Fix: When they turned off the "wheelchair" idea in the AI's brain, it suddenly realized the person was standing.
The "Hidden Clue" Problem: Sometimes the AI sees two possible answers at once (like an optical illusion that looks like both a duck and a rabbit). It picks one (the duck) but ignores the other (the rabbit) even though it's there.
- Fix: When they turned down the "duck" volume and turned up the "rabbit" volume, the AI changed its answer. This proves the AI knew about the rabbit all along; it just chose to ignore it.

Why This Matters

Before this tool, debugging AI was like trying to fix a car engine by guessing which part is broken. VisualScratchpad gives mechanics a diagnostic computer that shows exactly which wire is loose.

It helps us build AI that is more trustworthy, less prone to silly mistakes, and easier to understand. Instead of just saying "the AI is wrong," we can now say, "The AI saw the hand, but it forgot to connect it to the word 'hand'." That is a huge step toward making AI safe and reliable for everyone.

1. Problem Statement

Vision-Language Models (VLMs) frequently produce incorrect answers despite high overall performance, yet their failure modes are often opaque and difficult to explain. Current interpretability methods face two main challenges:

Superposition: A single neuron in a model often activates for multiple unrelated concepts, making it hard to isolate specific semantic meanings.
Lack of Systematic Tools: While Sparse Autoencoders (SAEs) have shown promise in decomposing dense representations into interpretable, sparse features (latents), there is a lack of practical interfaces that support systematic analysis, causal testing, and debugging of VLMs during inference.
Cross-Modal Disconnect: It is unclear whether VLM failures stem from the vision encoder failing to capture visual cues, the language model failing to utilize captured cues, or the model relying on misleading associations.

2. Methodology: VisualScratchpad

The authors introduce VisualScratchpad, an interactive interface designed to inspect, analyze, and debug VLMs by linking visual concepts to language tokens. The pipeline consists of four core components:

A. Visual Concept Extraction via SAEs

Architecture: The authors train vanilla Sparse Autoencoders (SAEs) directly on the frozen CLIP-ViT-large vision encoder.
Process: Intermediate image token representations ( $z$ ) are expanded into a high-dimensional sparse basis (32,768 latents) using a ReLU activation.
Granularity: By training on ImageNet-1K, the SAE decomposes visual representations into granular, semantically meaningful units (e.g., textures, objects, scenes) that are more interpretable than the original dense vectors.

B. Linking Visual Concepts to Text via Attention

To connect visual features to the language model's output without the confounding effects of projection layers, the authors propose an attention-weighted linking mechanism:

Mechanism: They utilize the cross-attention map from the language model, which indicates how much a specific text token attends to different image patches.
Re-ranking: The SAE latent activations (computed per image patch) are multiplied by the text-to-image attention weights. This performs an attention-weighted average, promoting visual concepts from regions the model focuses on for a specific token to the top of the ranking.
Result: This creates a direct link between specific text tokens and the underlying visual concepts driving them.

C. Causal Analysis via Latent Ablation

To verify if a concept causally influences the model's output, the system allows for latent ablation:

Selection: Users select a set of latents to ablate. Because SAE latents can be hierarchically related, the authors introduce a Token-Latent Heatmap.
Heatmap Visualization: This view clusters latents based on their activation similarity across output tokens. It helps users identify a "sufficient set" of latents representing a single semantic concept (e.g., a cluster for "shark" vs. "fish").
Intervention: The system zeroes out (or modifies) the activations of selected latents before decoding. If the model's output changes (e.g., from "sitting" to "standing"), the concept is confirmed to have causal influence.

D. Interactive Interface

VisualScratchpad provides a unified dashboard featuring:

SAE Exploration: Statistics on latent activation, sparsity, and UMAP projections of concept clusters.
Inference & Observation: Real-time visualization of attention maps, token-wise latent activations, and the token-latent heatmap.
Steering: Tools to modify latent values to test "what-if" scenarios and debug model behavior.

3. Key Contributions

Novel Interface: The first interactive tool specifically designed for inference-time visual concept analysis and causal debugging in VLMs, bridging the gap between SAE theory and practical VLM debugging.
Attention-Based Linking: A method to map SAE-derived visual concepts directly to language tokens using cross-attention, avoiding the need to interpret the complex internal layers of the language model itself.
Token-Latent Heatmap: A novel visualization technique that clusters latents by activation similarity, enabling users to identify coherent semantic groups for effective causal ablation.
Systematic Failure Mode Discovery: The framework moves beyond anecdotal debugging to systematically categorize VLM failures.

4. Results: Three Failure Modes

Through case studies using the MMVP dataset and optical illusions, the authors identified three distinct failure modes:

Case 1: Limited Cross-Modal Alignment
- Scenario: The model sees a cup on a hand wearing a mitten but answers "on a surface."
- Finding: The vision encoder correctly captured the "glove/mitten" concept, but the language model failed to align the visual concept with the linguistic token "hand."
- Fix: Rephrasing the prompt to explicitly mention "hand with a mitten" corrected the output, proving the visual cue existed but was underutilized due to alignment issues.
Case 2: Grounding on Misleading Cues
- Scenario: An elderly person with a walker is misidentified as "sitting" (in a wheelchair) rather than "standing."
- Finding: The model attended to the "walker" but activated concepts associated with "wheelchair" and "sitting" due to spurious correlations.
- Fix: Ablating the "sitting/wheelchair" concepts flipped the prediction to "standing," revealing the model relied on misleading associative cues.
Case 3: Unused Hidden Cues
- Scenario: An optical illusion (e.g., Duck-Rabbit) where the model initially describes a "duck."
- Finding: The vision encoder captured concepts for both the duck and the rabbit. However, the model only utilized the dominant "duck" features.
- Fix: Ablating "duck" latents and amplifying "rabbit" latents shifted the output to describe a "rabbit," demonstrating that VLMs often encode richer visual information than they express in their final output.

5. Significance and Impact

Trustworthy AI: VisualScratchpad provides a principled method for debugging VLMs, moving from "black box" observation to mechanistic understanding of why a model fails.
Safety & Alignment: By identifying misleading cues and unused hidden information, the tool aids in improving model safety and reducing hallucinations.
Scalability: The approach offers a framework for scalable auto-annotation of dataset biases and can be extended to other multimodal architectures.
Community Resource: The release of the interface and demos encourages the community to adopt systematic causal analysis in VLM research, addressing the current lack of tools for SAE-based debugging in multimodal settings.

In conclusion, VisualScratchpad demonstrates that VLM errors are often not due to a lack of visual perception, but rather due to misalignment between visual and linguistic concepts, reliance on spurious correlations, or the suppression of valid hidden cues. The proposed interface makes these internal mechanisms accessible for systematic correction.