Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking

Imagine you have a super-smart robot assistant that can look at a picture and answer questions about it, like a human. You ask, "How many fingers is that person holding up?" and it says, "Five." But if you ask it to look at a hand with six fingers, it might still say "Five" because it's confused.

For a long time, we've treated these robots (called Vision-Language Models or VLMs) like black boxes. We know what goes in (a photo and a question) and what comes out (an answer), but we have no idea what's happening inside the box. It's like trying to understand how a car engine works by only looking at the steering wheel and the gas pedal.

This paper introduces a new way to take the hood off and see the engine running. The authors built a "circuit tracing" framework that lets us see exactly how the robot connects the dots between what it sees and what it thinks.

Here is how they did it, explained with some everyday analogies:

1. The "Translator" (Transcoders)

Inside the robot's brain, information is stored in a messy, jumbled language that only the machine understands. It's like a giant pile of LEGO bricks where every color is mixed together, and one brick might represent "red," "car," and "fast" all at once. This makes it impossible to tell what the robot is actually thinking.

The researchers built a special tool called a Transcoder. Think of this as a high-tech translator or a sorting machine.

It takes that messy pile of LEGO bricks.
It sorts them out so that each brick now has only one clear meaning (e.g., one brick is just "red," another is just "wheel").
Now, instead of a jumbled mess, we have a clean, organized library of specific ideas.

2. The "Family Tree" (Attribution Graphs)

Once the ideas are sorted, the researchers wanted to know: How does the robot get from "seeing a picture of Mars" to "thinking about a Space Shuttle"?

They drew a Family Tree (or a flowchart) called an Attribution Graph.

Imagine a detective tracing a rumor. They start with the final answer ("Space Shuttle") and work backward.
They ask: "Who told you that?" The graph shows that a specific "Mars" idea passed a message to a "Red Planet" idea, which then passed a message to a "Space Shuttle" idea.
This map shows the cause-and-effect chain. It proves that the robot didn't just guess; it followed a specific path of logic.

3. The "Remote Control" (Intervention)

The most exciting part is that they didn't just watch; they tweaked the system to prove their theory. This is like having a remote control for the robot's brain.

The Experiment: They found the specific "circuit" (the path of ideas) that the robot uses to recognize a picture of Mars.
The Switch: They turned off the "Mars" signal and turned on the "Earth" signal in the same spot.
The Result: Suddenly, the robot stopped talking about Mars and started talking about Earth, even though the picture was still of Mars!
What this means: This proves that the circuit they found is the real reason the robot gave that answer. If you break the circuit, the behavior breaks.

What Did They Discover?

By using this new "X-ray vision," they found some fascinating things about how these robots think:

The Assembly Line: The robot processes images in steps. At the bottom (early layers), it just sees shapes and colors (like "red circle"). As the information moves up, it starts combining these shapes into concepts (like "planet"). The "magic" of mixing vision and language happens in the middle and top layers.
The "Six-Finger" Mistake: Why do robots sometimes count fingers wrong? They found that the robot's "eyes" (the vision part) send a strong signal saying "Hand," and the robot's "brain" (the language part) gets so excited about the word "Hand" that it ignores the actual count. It's like a student who knows the answer is "5" because that's what the teacher usually asks, so they ignore the fact that there are actually 6 fingers in the picture.
Hidden Associations: The robot has secret connections. If you show it a picture of Mars, it doesn't just think "Planet"; it also lights up the "Space Shuttle" circuit in its brain, even if you didn't ask about a shuttle. It's like seeing a picture of a beach and suddenly thinking about "ice cream" because your brain associates the two.

Why Does This Matter?

Before this, if a robot made a mistake, we had to guess why. Now, we can look at the circuit map and say, "Ah, the robot is confused because it's mixing up the 'Hand' signal with the 'Count' signal."

This is a huge step toward making AI transparent, trustworthy, and safe. It's the difference between blindly trusting a black box and understanding the engine well enough to fix it when it sputters. The authors have even made their tools open-source, so other scientists can start taking apart and understanding these complex machines too.

1. Problem Statement

Vision-Language Models (VLMs) have achieved remarkable success in multimodal tasks (e.g., visual question answering, image captioning, and complex reasoning). However, they remain "black boxes" with opaque internal mechanisms. While interpretability research has advanced significantly for text-only Large Language Models (LLMs) through methods like circuit discovery and sparse autoencoders (SAEs), these techniques have not been effectively applied to VLMs.

Key Challenges:

Modality Integration: VLMs must integrate visual and linguistic data with different statistical properties and semantics.
Lack of Causal Understanding: Existing VLM interpretability methods rely heavily on correlation (e.g., attention visualization) rather than causal circuit discovery.
Polysemanticity: Neural representations in VLMs often encode multiple unrelated concepts simultaneously, making it difficult to isolate specific reasoning pathways.

The paper aims to answer: How do VLMs internally bind visual features to tokens, implement cross-modal reasoning, and coordinate attention?

2. Methodology

The authors introduce the first framework for circuit tracing in VLMs, adapting techniques from LLM interpretability to the multimodal domain. The framework consists of five core components:

A. Transcoders for Feature Decomposition

To address polysemanticity, the authors replace the standard MLP (Multi-Layer Perceptron) layers in the VLM with Transcoders.

Mechanism: A transcoder is a sparse autoencoder (SAE) trained to mimic the input-output behavior of the original MLP layer.
Sparsity: Instead of using an $L_1$ penalty, they enforce sparsity via Top-K selection, retaining only the $k$ largest activations. This yields monosemantic features (features that represent a single concept).
Residual Tracking: Since transcoders are approximations, the reconstruction error ($MLP(x) - TC(x)$) is tracked as a separate "error node" in the circuit graph to ensure causal accuracy.

B. Attribution Graph Construction

The authors construct a directed graph to trace causal relationships between features.

Linearization: By freezing non-linearities (ReLUs, attention patterns) at their values for a specific prompt, the model becomes locally linear.
Attribution Calculation: The contribution of a source feature $s$ to a target node $t$ is calculated as $A_{s \to t} = a_s \cdot w_{s \to t}$ , where $a_s$ is the activation magnitude and $w_{s \to t}$ is the "virtual weight" derived from the decoder of $s$ , the frozen Jacobian of the transformer, and the encoder of $t$ .
Graph Pruning: Edges with negligible attribution are pruned to create a sparse, interpretable computational graph.

C. Feature Interpretation & Attention Analysis

Activation Analysis: Features are interpreted by analyzing their activation patterns across diverse image-text pairs to identify commonalities (e.g., specific objects, textures, or concepts).
Vision Encoder Attention: To interpret image tokens, the authors compute attention-rollout maps for the SigLIP vision encoder. This visualizes which image regions contribute to specific token embeddings passed to the language model.

D. Circuit Discovery

Human-in-the-Loop: While automated discovery exists, the authors use human experts to group features with similar functions into shared nodes, creating a simplified, high-level circuit graph.
Validation: Circuits are validated through Intervention (steering) and Circuit Patching.
- Steering: Modifying feature activations to observe output changes.
- Patching: Transplanting a sub-circuit from one context to another to test if the behavior transfers.

3. Key Contributions

First VLM Circuit Tracing Framework: The paper establishes the first systematic method to trace causal computational circuits in open-source VLMs (specifically Gemma3-4B-it).
Multimodal Transcoders: They successfully extend transcoder technology to decompose multimodal, polysemantic representations into interpretable, monosemantic features.
Causal Validation: Through feature steering and circuit patching, they prove that the discovered circuits are not just correlations but causal mechanisms that can be controlled.
New Insights into Multimodal Reasoning: The framework reveals how visual and semantic concepts are hierarchically integrated and how distinct visual latent spaces are preserved within the language model component.

4. Key Results & Empirical Findings

The framework was applied to Gemma3-4B-it (using a SigLIP vision encoder). Key findings include:

Hierarchical Integration: Visual and semantic concepts are processed separately in early layers. Joint encoding (features representing both visual and semantic concepts) only emerges in higher layers (around Layer 20), supporting the "progressive binding" hypothesis.
Visual Math Reasoning: For image-based arithmetic (e.g., "1 + 2"), the model utilizes visual circuits to compute the result (e.g., activating a feature for the numeral "3") rather than purely semantic computation.
Hallucination Mechanism (The "Six-Finger" Problem): The study reveals that hallucinations (e.g., counting six fingers) arise from an interaction between the vision encoder (which emphasizes generic "hand" semantics) and internal circuit dynamics. Visual features for the digit "6" are suppressed, while "hand" features strongly activate the "five" circuit, overriding the actual count.
Parallel Pathways: The model maintains distinct visual and semantic streams deep into the network. For example, an image of Mars triggers "space shuttle" associations via visual similarity, independent of semantic text cues. These streams only converge in the final layers.
Distinct Visual Latent Space: Visually similar features (e.g., sea otters, seals, beavers) cluster and co-activate, indicating the language model component preserves a distinct visual representation space.

5. Significance and Impact

Explainability: Moves VLM interpretability from high-level correlation to low-level causal circuit analysis, allowing researchers to diagnose why a model fails (e.g., specific hallucination modes).
Controllability: Demonstrates that VLMs can be steered or patched by manipulating specific internal features, offering a path to mitigate biases or errors without retraining.
Scientific Insight: Provides a blueprint for understanding how machines integrate vision and language, suggesting that multimodal reasoning relies on a complex interplay of parallel visual and semantic pathways that eventually converge.
Future Design: The findings suggest that future VLM architectures could be optimized by explicitly managing the integration points of visual and semantic streams.

6. Limitations

Human Effort: Circuit discovery currently requires significant manual annotation by experts, limiting scalability.
Attention Map Quality: Vision encoder attention maps can sometimes be noisy or fail to localize relevant regions precisely.
Cross-Layer Superposition: The current per-layer transcoder approach may miss cross-layer superposition effects, which could be significant in VLMs due to high feature density.
Model Specificity: The study is limited to Gemma3; extending this to other architectures with different vision encoders or attention mechanisms remains a future challenge.

Code Availability: The authors have released their code and models at https://github.com/UIUC-MONET/vlm-circuit-tracing.