Circuit Insights: Towards Interpretability Beyond Activations

This paper introduces WeightLens and CircuitLens, two complementary methods that advance mechanistic interpretability by analyzing feature weights and component interactions directly, thereby overcoming the limitations of activation-based approaches in scalability, robustness, and the ability to capture circuit-level dynamics without relying on external explainer models or datasets.

Elena Golimblevskaia, Aakriti Jain, Bruno Puri, Ammar Ibrahim, Wojciech Samek, Sebastian Lapuschkin

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine a Large Language Model (LLM) like a massive, high-tech kitchen where a giant robot chef is cooking up sentences. We know the robot can make delicious meals (write great text), but we have no idea how it does it. Is it following a secret recipe? Is it guessing? Is it just memorizing?

For a long time, scientists trying to understand this robot chef (a field called Interpretability) had two main problems:

  1. The "Black Box" Problem: They could only watch the robot's hands (the final output) or peek at the ingredients it grabbed at the very last second (activations). They couldn't see the internal wiring.
  2. The "Human Bottleneck": To understand what the robot was thinking, they had to hire a second, even bigger robot (an AI explainer) to look at the data and guess what the first robot was doing. This just pushed the mystery one step back.

This paper, "Circuit Insights," introduces two new tools, WeightLens and CircuitLens, that act like X-ray glasses and a flow-chart detective, allowing us to see exactly how the robot's brain works without needing a second robot to guess for us.

Here is a simple breakdown of how they work:

1. The Problem with the Old Way

Previously, scientists tried to understand the robot by watching which ingredients it grabbed most often when cooking a specific dish.

  • The Flaw: Sometimes the robot grabs "salt" not because it wants salt, but because the recipe called for "salty soup." The ingredient (activation) doesn't tell the whole story; the context matters.
  • The Result: The explanations were often vague, like saying, "The robot is thinking about food," which isn't very helpful.

2. The New Tool: WeightLens (The "Blueprint" Reader)

WeightLens is like reading the robot's permanent blueprint instead of watching it cook.

  • How it works: Instead of waiting to see what the robot grabs in a specific situation, WeightLens looks at the permanent connections (weights) inside the robot's brain. It asks: "If this part of the brain is turned on, what other parts are permanently wired to it?"
  • The Analogy: Imagine a light switch in your house. You don't need to wait for someone to flip it to know what light it controls; you just look at the wiring diagram. WeightLens looks at the wiring.
  • The Benefit: It can instantly tell you what a specific part of the robot's brain is good at (e.g., "This neuron loves the word 'apple'") without needing a massive dataset or a second AI to guess. It's fast, efficient, and works great for simple, clear tasks.

3. The New Tool: CircuitLens (The "Flow Detective")

CircuitLens is for the messy, complicated stuff where the blueprint isn't enough because the robot changes its mind based on the conversation.

  • How it works: Sometimes the robot only grabs "salt" if the sentence starts with "The soup is..." CircuitLens traces the entire path of the signal. It follows the signal from the input (the prompt), through the robot's internal "circuits" (attention heads), all the way to the output (the final word).
  • The Analogy: Imagine a game of "Telephone" played in a crowded room.
    • Old way: You just listen to the final message.
    • CircuitLens: It records who whispered to whom, who ignored the message, and who amplified it. It groups similar conversations together.
  • The Benefit: It handles "polysemanticity" (when one part of the brain does two different things). It realizes that sometimes a neuron is about "money" and other times it's about "time," depending on the crowd. It separates these into different groups so we get a clear explanation for each.

4. Why This Matters (The Big Picture)

The authors combined these two tools to create a super-powerful way to understand AI:

  • No More Guessing: We don't need to rely on a second, giant AI to explain the first one. We can look at the math and the wiring directly.
  • Scalability: This works for huge models. It's like upgrading from trying to understand a city by walking every street (manual inspection) to using a satellite map (automated analysis).
  • Safety: If we understand exactly how the robot thinks, we can spot if it's about to say something dangerous or lie. It's like having a safety inspector who knows exactly which wire to cut if the robot starts acting up.

Summary

  • WeightLens looks at the wiring diagram to see what a part of the brain is built to do.
  • CircuitLens follows the traffic flow to see what a part of the brain is actually doing in a specific situation.

Together, they let us peek inside the "black box" of AI, turning a mysterious magic trick into a clear, understandable machine. Instead of saying "The AI is smart," we can finally say, "The AI is smart because this specific circuit connects the word 'doctor' to the concept of 'medicine'."