Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

This paper presents a pipeline that bridges mechanistic interpretability and natural language explanations by identifying causally important attention heads in GPT-2 Small, generating high-quality explanations via LLMs, and evaluating their faithfulness to reveal that while explanations can be sufficient, they often lack comprehensiveness due to distributed backup mechanisms.

Ajay Pravin Mahale

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Imagine you have a incredibly smart, but completely opaque, robot chef. This chef can cook a perfect meal (answer a question) every time, but if you ask, "How did you do that?" the chef just stares back with a blank screen. You can see the ingredients it grabbed (the words it paid attention to), but you don't know why it grabbed them or if those ingredients actually caused the delicious taste.

This paper is about building a translator that turns the robot's secret internal wiring diagrams into a story a human can understand.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Black Box" Chef

Large Language Models (like the one in this study) are like giant, complex kitchens. Inside, there are thousands of tiny workers (called "neurons" or "attention heads").

  • Old way: Researchers used to just look at which workers were "looking" at the ingredients (Attention Weights). But the paper says this is like watching a worker glance at a tomato and assuming they used the tomato. They might just be looking at it, but someone else actually chopped it.
  • The Goal: The authors wanted to find the actual workers who did the heavy lifting and then write a plain-English story about what those specific workers did.

2. The Method: The "Surgery" and the "Translator"

The authors built a three-step pipeline to solve this:

  • Step 1: The Surgery (Activation Patching)
    Imagine the robot chef is cooking a dish. The researchers perform "surgery" on the kitchen. They swap the positions of two ingredients (e.g., swapping "Mary" and "John" in a sentence) and see how the robot's brain reacts.

    • If the robot gets confused when they swap the names, they know, "Aha! This specific worker is the one responsible for tracking names."
    • They found 6 specific workers (out of thousands) who were doing 61% of the actual work to get the answer right.
  • Step 2: The Translator (Generating Explanations)
    Now that they know who did the work, they need to explain it to a human. They tried two ways:

    • The Robot Script (Template): A pre-written sentence like, "The robot picked 'Mary' because Worker A looked at her." (Boring, generic, and often missing details).
    • The Storyteller (LLM): They fed the data about the 6 workers into another AI and asked it to write a natural story. "GPT-2 picked 'Mary' because Worker A focused 66% of its energy on her, while ignoring John."
    • Result: The Storyteller was 66% better at writing a clear, accurate explanation than the Robot Script.
  • Step 3: The Lie Detector (Faithfulness Check)
    How do we know the story is true? They used a "Lie Detector" test (called ERASER metrics):

    • Sufficiency Test: "If we only use the workers mentioned in the story, can the robot still cook the meal?"
      • Result: 100% Yes. The story identified the main chefs perfectly.
    • Comprehensiveness Test: "If we remove the workers mentioned in the story, does the robot fail?"
      • Result: Only 22% Yes. This is the big surprise. Even if you fire the 6 main workers, the robot can still cook the meal, just a little worse.

3. The Big Surprise: The "Backup Plan"

The most interesting finding is the gap between Sufficiency (100%) and Comprehensiveness (22%).

Think of it like a football team.

  • The explanation says: "The quarterback (Worker A) threw the winning pass." (True! He did it 100% of the time).
  • But when you ask, "What if the quarterback gets injured?" the team doesn't collapse. The backup quarterback, the running back, and the receiver all step up and still manage to score, just not as elegantly.

The robot has distributed backup mechanisms. It doesn't rely on just one path; it has redundant paths. This makes the robot very robust (hard to break), but it makes it very hard to explain simply because there isn't just one reason it got the answer right.

4. The Warning: Confidence \neq Truth

The study found something scary for users: The robot's confidence is a lie.

  • If the robot says, "I am 99% sure of this answer," you might think the explanation is solid.
  • The study found zero correlation between how confident the robot is and how accurate the explanation is.
  • Analogy: It's like a student who guesses the answer on a test with 100% confidence but actually got it right by luck. You can't trust their "I know this!" feeling to tell you if their reasoning is sound.

Summary

This paper built a tool that:

  1. Finds the real "workers" inside the AI using surgery-like experiments.
  2. Writes a human-readable story about those workers using a smart translator.
  3. Warns us that even when the AI is 100% right, the story we tell about why it's right might only capture a small part of the truth because the AI has secret backup plans.

The Takeaway: We can finally explain how AI works, but we must be careful not to oversimplify. The AI is smarter and more complex than a single sentence can describe, and its confidence doesn't guarantee its reasoning is simple.