Circuit Insights: Towards Interpretability Beyond Activations

Imagine a Large Language Model (LLM) like a massive, high-tech kitchen where a giant robot chef is cooking up sentences. We know the robot can make delicious meals (write great text), but we have no idea how it does it. Is it following a secret recipe? Is it guessing? Is it just memorizing?

For a long time, scientists trying to understand this robot chef (a field called Interpretability) had two main problems:

The "Black Box" Problem: They could only watch the robot's hands (the final output) or peek at the ingredients it grabbed at the very last second (activations). They couldn't see the internal wiring.
The "Human Bottleneck": To understand what the robot was thinking, they had to hire a second, even bigger robot (an AI explainer) to look at the data and guess what the first robot was doing. This just pushed the mystery one step back.

This paper, "Circuit Insights," introduces two new tools, WeightLens and CircuitLens, that act like X-ray glasses and a flow-chart detective, allowing us to see exactly how the robot's brain works without needing a second robot to guess for us.

Here is a simple breakdown of how they work:

1. The Problem with the Old Way

Previously, scientists tried to understand the robot by watching which ingredients it grabbed most often when cooking a specific dish.

The Flaw: Sometimes the robot grabs "salt" not because it wants salt, but because the recipe called for "salty soup." The ingredient (activation) doesn't tell the whole story; the context matters.
The Result: The explanations were often vague, like saying, "The robot is thinking about food," which isn't very helpful.

2. The New Tool: WeightLens (The "Blueprint" Reader)

WeightLens is like reading the robot's permanent blueprint instead of watching it cook.

How it works: Instead of waiting to see what the robot grabs in a specific situation, WeightLens looks at the permanent connections (weights) inside the robot's brain. It asks: "If this part of the brain is turned on, what other parts are permanently wired to it?"
The Analogy: Imagine a light switch in your house. You don't need to wait for someone to flip it to know what light it controls; you just look at the wiring diagram. WeightLens looks at the wiring.
The Benefit: It can instantly tell you what a specific part of the robot's brain is good at (e.g., "This neuron loves the word 'apple'") without needing a massive dataset or a second AI to guess. It's fast, efficient, and works great for simple, clear tasks.

3. The New Tool: CircuitLens (The "Flow Detective")

CircuitLens is for the messy, complicated stuff where the blueprint isn't enough because the robot changes its mind based on the conversation.

How it works: Sometimes the robot only grabs "salt" if the sentence starts with "The soup is..." CircuitLens traces the entire path of the signal. It follows the signal from the input (the prompt), through the robot's internal "circuits" (attention heads), all the way to the output (the final word).
The Analogy: Imagine a game of "Telephone" played in a crowded room.
- Old way: You just listen to the final message.
- CircuitLens: It records who whispered to whom, who ignored the message, and who amplified it. It groups similar conversations together.
The Benefit: It handles "polysemanticity" (when one part of the brain does two different things). It realizes that sometimes a neuron is about "money" and other times it's about "time," depending on the crowd. It separates these into different groups so we get a clear explanation for each.

4. Why This Matters (The Big Picture)

The authors combined these two tools to create a super-powerful way to understand AI:

No More Guessing: We don't need to rely on a second, giant AI to explain the first one. We can look at the math and the wiring directly.
Scalability: This works for huge models. It's like upgrading from trying to understand a city by walking every street (manual inspection) to using a satellite map (automated analysis).
Safety: If we understand exactly how the robot thinks, we can spot if it's about to say something dangerous or lie. It's like having a safety inspector who knows exactly which wire to cut if the robot starts acting up.

Summary

WeightLens looks at the wiring diagram to see what a part of the brain is built to do.
CircuitLens follows the traffic flow to see what a part of the brain is actually doing in a specific situation.

Together, they let us peek inside the "black box" of AI, turning a mysterious magic trick into a clear, understandable machine. Instead of saying "The AI is smart," we can finally say, "The AI is smart because this specific circuit connects the word 'doctor' to the concept of 'medicine'."

Here is a detailed technical summary of the paper "Circuit Insights: Towards Interpretability Beyond Activations" (ICLR 2026).

1. Problem Statement

The field of mechanistic interpretability aims to uncover the internal logic of Large Language Models (LLMs). While existing methods have made progress, they face three critical limitations:

Reliance on Manual Inspection: Current circuit discovery often requires extensive human analysis and is limited to "toy tasks" rather than scalable, automated workflows.
Dependence on Explainer LLMs and Datasets: Automated methods (e.g., using Sparse Autoencoders or SAEs) typically pass activation patterns to a larger "explainer" LLM to generate natural language descriptions. This shifts the "black box" problem to another LLM, introducing risks of unfaithful explanations, and heavily depends on the quality and size of the dataset used for sampling.
Activation-Only Blind Spots: Analyzing raw activations often fails to distinguish between input-dependent context and input-invariant structural relationships. It struggles with polysemanticity (features responding to multiple concepts) and noisy inputs, often missing the underlying circuit dynamics that drive feature activation.

2. Methodology

The authors propose two complementary frameworks, WeightLens and CircuitLens, built upon Transcoders. Transcoders are sparse approximations of MLP layers that decompose feature attributions into input-dependent (activation) and input-invariant (weight-based) components.

A. WeightLens: Input-Invariant Automated Interpretability

WeightLens interprets features directly from learned weights, eliminating the need for datasets or explainer LLMs for context-independent features.

Core Assumption: Strong weight-based connections that are statistical outliers represent meaningful structural relationships.
Process:
1. Projections: Project the feature's encoder vector ( $f_{enc}$ ) into the input embedding space ( $W_{emb} \cdot f_{enc}$ ) and the decoder vector ( $f_{dec}$ ) into vocabulary logits ( $f_{dec} \cdot W_U$ ).
2. Outlier Detection: Identify candidate tokens using Z-scores (thresholds vary by model, e.g., 4.0 for GPT-2, 4.5 for Gemma-2).
3. Upstream Analysis: Compute connections to earlier layers ( $W_{dec} \cdot f_{enc}$ ) to inherit token descriptions from contributing features.
4. Validation: Perform a forward pass to verify if candidate tokens actually activate the feature in isolation. This filters out noise and ensures the token description is robust.
5. Post-processing: Apply lemmatization to consolidate inflected forms into canonical descriptions.

B. CircuitLens: Automated Interpretability Based on Circuits

CircuitLens addresses context-dependent features by analyzing the full circuit of interactions (inputs, attention heads, and outputs) rather than just raw activations.

Sampling Strategy: Instead of selecting only the top-activating samples (which biases toward dominant concepts), the method uses inverse-frequency quantile sampling to capture rare but strong activations across the entire distribution.
Circuit-Based Pattern Detection:
- Input-Centric: Uses attribution formulas to identify specific (attention head, token) pairs that drive activation. It masks the input to retain only these contributing tokens, isolating the trigger pattern.
- Output-Centric: Analyzes how the activated feature influences downstream logits to determine which output tokens are generated as a result.
Circuit-Based Clustering:
- Aggregates contributing elements (features, attention heads) for each input into a vector.
- Applies a frequency filter to remove noise (keeping only elements appearing in a fraction $\rho$ of inputs).
- Uses DBSCAN on Jaccard similarity matrices to cluster inputs based on shared circuit structures. This separates polysemantic features into monosemantic clusters.
Description Generation: An explainer LLM is fed only the isolated, masked patterns (not the full context) for each cluster, synthesizing a unified description for the feature.

3. Key Contributions

WeightLens Framework: A novel method for interpreting transcoder features using only model weights. It achieves performance matching or exceeding activation-based baselines for context-independent features without requiring external datasets or LLMs.
CircuitLens Framework: A method that extends interpretability to context-dependent features by isolating causal circuit patterns. It reduces the burden on explainer LLMs by pre-identifying relevant input/output patterns and handles polysemanticity through circuit-based clustering.
Decoupling from Datasets: Demonstrates that structural information (weights) can provide robust interpretations, reducing the reliance on massive datasets and the "black box" nature of explainer LLMs.
Comprehensive Evaluation: Introduces a rigorous evaluation using the FADE framework (Clarity, Responsiveness, Purity, Faithfulness) across multiple models (GPT-2, Gemma-2-2B, Llama-3.2-1B).

4. Results and Findings

Performance vs. Baselines:
- WeightLens variants perform on par with or better than activation-maximization baselines (Neuronpedia, MaxAct*) on Clarity and Responsiveness metrics.
- CircuitLens methods, particularly when combined with WeightLens tokens, show improved robustness to dataset size. Descriptions derived from smaller datasets (24M tokens) remain competitive with those from massive datasets (2.3B tokens).
Layer-wise Analysis:
- Early Layers: Highly interpretable via weights; features often correspond to specific tokens (e.g., "apologize for").
- Middle Layers: Exhibit high polysemanticity and context-dependence. WeightLens struggles here, but CircuitLens successfully clusters these into distinct sub-concepts.
- Later Layers: Features often act as key-value pairs or influence specific output phrases (e.g., "the basis of").
Polysemanticity: Circuit-based clustering successfully disentangles features that respond to multiple concepts (e.g., a feature activating on both "colors" and "materials"), revealing distinct sub-circuits that activation-only methods miss.
Faithfulness: Faithfulness scores remain low across all methods (including baselines). The authors attribute this to the redundancy of transcoder architectures compared to SAEs, suggesting that steering single features rarely produces large output effects.

5. Significance

This work represents a paradigm shift in mechanistic interpretability by moving "beyond activations."

Scalability: By leveraging weight-based analysis, the approach offers a scalable alternative that does not require re-running models on massive datasets for every new feature analysis.
Robustness: It mitigates the risks associated with relying on explainer LLMs and dataset biases, providing more faithful structural insights.
Mechanistic Insight: It bridges the gap between static weight analysis and dynamic circuit discovery, offering a more complete picture of how features are constructed and how they influence model behavior.
Future Direction: The paper suggests that future work should extend these methods to other architectures (like SAEs) and refine clustering hyperparameters to better balance the detection of polysemantic patterns against over-fragmentation.

In conclusion, WeightLens and CircuitLens provide a robust, efficient, and scalable toolkit for understanding LLM internals, significantly advancing the field toward automated, reliable, and dataset-independent mechanistic interpretability.

Circuit Insights: Towards Interpretability Beyond Activations

1. The Problem with the Old Way

2. The New Tool: WeightLens (The "Blueprint" Reader)

3. The New Tool: CircuitLens (The "Flow Detective")

4. Why This Matters (The Big Picture)

Summary

1. Problem Statement

2. Methodology

A. WeightLens: Input-Invariant Automated Interpretability

B. CircuitLens: Automated Interpretability Based on Circuits

3. Key Contributions

4. Results and Findings

5. Significance

More like this

DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

A Survey of Reasoning in Autonomous Driving Systems: Open Challenges and Emerging Paradigms

PACED: Distillation at the Frontier of Student Competence

Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Reversible Lifelong Model Editing via Semantic Routing-Based LoRA