Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Imagine you have a incredibly smart, but completely opaque, robot chef. This chef can cook a perfect meal (answer a question) every time, but if you ask, "How did you do that?" the chef just stares back with a blank screen. You can see the ingredients it grabbed (the words it paid attention to), but you don't know why it grabbed them or if those ingredients actually caused the delicious taste.

This paper is about building a translator that turns the robot's secret internal wiring diagrams into a story a human can understand.

Here is the breakdown of their work using simple analogies:

1. The Problem: The "Black Box" Chef

Large Language Models (like the one in this study) are like giant, complex kitchens. Inside, there are thousands of tiny workers (called "neurons" or "attention heads").

Old way: Researchers used to just look at which workers were "looking" at the ingredients (Attention Weights). But the paper says this is like watching a worker glance at a tomato and assuming they used the tomato. They might just be looking at it, but someone else actually chopped it.
The Goal: The authors wanted to find the actual workers who did the heavy lifting and then write a plain-English story about what those specific workers did.

2. The Method: The "Surgery" and the "Translator"

The authors built a three-step pipeline to solve this:

Step 1: The Surgery (Activation Patching)
Imagine the robot chef is cooking a dish. The researchers perform "surgery" on the kitchen. They swap the positions of two ingredients (e.g., swapping "Mary" and "John" in a sentence) and see how the robot's brain reacts.
- If the robot gets confused when they swap the names, they know, "Aha! This specific worker is the one responsible for tracking names."
- They found 6 specific workers (out of thousands) who were doing 61% of the actual work to get the answer right.
Step 2: The Translator (Generating Explanations)
Now that they know who did the work, they need to explain it to a human. They tried two ways:
- The Robot Script (Template): A pre-written sentence like, "The robot picked 'Mary' because Worker A looked at her." (Boring, generic, and often missing details).
- The Storyteller (LLM): They fed the data about the 6 workers into another AI and asked it to write a natural story. "GPT-2 picked 'Mary' because Worker A focused 66% of its energy on her, while ignoring John."
- Result: The Storyteller was 66% better at writing a clear, accurate explanation than the Robot Script.
Step 3: The Lie Detector (Faithfulness Check)
How do we know the story is true? They used a "Lie Detector" test (called ERASER metrics):
- Sufficiency Test: "If we only use the workers mentioned in the story, can the robot still cook the meal?"
  - Result: 100% Yes. The story identified the main chefs perfectly.
- Comprehensiveness Test: "If we remove the workers mentioned in the story, does the robot fail?"
  - Result: Only 22% Yes. This is the big surprise. Even if you fire the 6 main workers, the robot can still cook the meal, just a little worse.

3. The Big Surprise: The "Backup Plan"

The most interesting finding is the gap between Sufficiency (100%) and Comprehensiveness (22%).

Think of it like a football team.

The explanation says: "The quarterback (Worker A) threw the winning pass." (True! He did it 100% of the time).
But when you ask, "What if the quarterback gets injured?" the team doesn't collapse. The backup quarterback, the running back, and the receiver all step up and still manage to score, just not as elegantly.

The robot has distributed backup mechanisms. It doesn't rely on just one path; it has redundant paths. This makes the robot very robust (hard to break), but it makes it very hard to explain simply because there isn't just one reason it got the answer right.

4. The Warning: Confidence $\neq$ Truth

The study found something scary for users: The robot's confidence is a lie.

If the robot says, "I am 99% sure of this answer," you might think the explanation is solid.
The study found zero correlation between how confident the robot is and how accurate the explanation is.
Analogy: It's like a student who guesses the answer on a test with 100% confidence but actually got it right by luck. You can't trust their "I know this!" feeling to tell you if their reasoning is sound.

Summary

This paper built a tool that:

Finds the real "workers" inside the AI using surgery-like experiments.
Writes a human-readable story about those workers using a smart translator.
Warns us that even when the AI is 100% right, the story we tell about why it's right might only capture a small part of the truth because the AI has secret backup plans.

The Takeaway: We can finally explain how AI works, but we must be careful not to oversimplify. The AI is smarter and more complex than a single sentence can describe, and its confidence doesn't guarantee its reasoning is simple.

Here is a detailed technical summary of the paper "Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations" by Ajay Pravin Mahale.

1. Problem Statement

Large Language Models (LLMs) operate as "black boxes," making their internal decision-making processes opaque. While two distinct fields attempt to address this, they currently operate in isolation:

Mechanistic Interpretability: Reverse-engineers model computations at the circuit level (e.g., identifying specific attention heads) but expresses findings in technical terms (e.g., "L9H9 contributes 17.4%") that are not human-readable.
Explainable AI (XAI): Generates human-readable natural language (NL) rationales but often relies on correlational signals (like raw attention weights) that do not reflect true causal mechanisms.

The Core Gap: There is no established pipeline to automatically translate causally identified internal circuits into faithful, human-understandable natural language explanations. This paper investigates whether mechanistic circuit analysis can be converted into NL explanations that are both causally grounded and faithful to the model's actual behavior.

2. Methodology

The authors propose a three-stage pipeline to bridge circuit analysis and natural language generation, evaluated on the Indirect Object Identification (IOI) task using GPT-2 Small (124M parameters).

A. Task and Model

Task: IOI, where the model must complete a sentence like "When Mary and John went to the store, John gave a drink to [Mary]."
Ground Truth: The IOI task has a well-characterized "circuit" (previously identified by Wang et al., 2023), providing a known baseline for evaluation.
Dataset: 50 prompts generated from 25 name pairs and 2 templates.

B. Circuit Identification (Activation Patching)

Instead of relying on attention weights, the authors use Activation Patching to identify causally important components:

Corruption: Create a corrupted input by swapping name positions (e.g., swapping "Mary" and "John").
Intervention: Patch the activations of specific attention heads from the clean run into the corrupted run.
Metric: Calculate the Effect Recovery for each head ( $h$ ):
$\text{Effect}_h = \frac{LD_{\text{patched}} - LD_{\text{corrupt}}}{LD_{\text{clean}} - LD_{\text{corrupt}}}$
Where $LD$ is the logit difference between the correct indirect object and the subject. Heads with high effect recovery are deemed causally critical.

C. Explanation Generation

Two methods are compared for generating NL explanations based on the identified circuit data (head names, attention percentages, roles):

Template-based: Fixed sentence structures filled with extracted values (e.g., "The model predicts X because Head Y attends to it...").
LLM-generated: An LLM is prompted with structured circuit data to generate 1–2 sentence contextual explanations that reference specific mechanisms and percentages.

D. Faithfulness Evaluation (Adapted ERASER Metrics)

The authors adapt the ERASER framework (originally for token-level rationales) to the circuit level:

Sufficiency: Do the cited heads fully account for the prediction? (Measured by the logit difference recovered when only cited heads are active).
Comprehensiveness: Does removing (ablating) the cited heads degrade the prediction? (Measured by the drop in logit difference when cited heads are zeroed out).
Quality: Evaluated on conciseness, inclusion of specific percentages, and causal language.

3. Key Contributions

Pipeline Development: A novel end-to-end pipeline translating mechanistic circuit findings into natural language explanations.
Metric Adaptation: The first adaptation of ERASER metrics (Sufficiency/Comprehensiveness) for evaluating circuit-level attributions rather than token-level rationales.
Comparative Analysis: The first direct comparison between template-based and LLM-generated explanations for mechanistic interpretability.
Failure Taxonomy: A classification of why explanations diverge from mechanisms, identifying specific failure modes like distributed computation and redundant head activity.

4. Key Results

Circuit Identification

The pipeline identified 6 attention heads (including Name Movers and S-Inhibition heads) that account for 61.4% of the total logit difference.
This aligns with prior ground truth (Wang et al., 2023), validating the activation patching approach.

Faithfulness Metrics

Sufficiency: The circuit-based method achieved 100% sufficiency, meaning the cited heads were sufficient to reproduce the model's prediction.
Comprehensiveness: The method achieved only 22% comprehensiveness. Ablating the top 6 heads caused only partial degradation, revealing that the model relies on distributed backup mechanisms (other heads compensate when the primary circuit is removed).
Comparison: The circuit-based method outperformed an attention-based baseline by 75% on the F1 score (36.0% vs. 20.6%), proving that high attention weights do not necessarily imply causal importance.

Explanation Quality

LLM vs. Template: LLM-generated explanations outperformed template baselines by 66% in overall quality.
Specifics: LLM explanations successfully included specific attention percentages (100% of cases) and contextual references to both names, whereas templates were generic and lacked specific data.

Failure Analysis & Confidence

Divergence Categories: Three failure modes were identified where explanations diverged from mechanisms:
1. Distributed Computation: Behavior emerges from many heads with moderate contributions rather than a dominant few.
2. Missing Cited Head: Specific prompts rely on heads not in the fixed "top-6" circuit (e.g., L10H10).
3. Redundant Activity: Adding more heads to the explanation does not increase causal coverage due to redundancy.
Confidence Correlation: There is no correlation ( $r = 0.009$ ) between model confidence and explanation faithfulness. High-confidence predictions can still rely on distributed mechanisms poorly captured by the explanation.

5. Significance and Implications

Beyond Attention: The study confirms that faithful explanations require causal grounding (via patching) rather than simple attention weight analysis.
The "Sufficiency-Comprehensiveness Gap": The finding of 100% sufficiency but low comprehensiveness (22%) suggests that transformers implement redundant computation. This makes models robust to ablation but difficult to explain concisely, as no single small subset of neurons is strictly necessary for the task.
Trust and Safety: The lack of correlation between confidence and faithfulness implies that users cannot trust model confidence scores as a proxy for explanation quality. Systems should explicitly report comprehensiveness metrics alongside explanations to prevent overconfidence.
Scalability: LLMs are shown to be effective at translating complex circuit data into readable, faithful narratives, offering a scalable path for interpretability as models grow more complex.

Limitations

The study is currently limited to a single task (IOI), a single small model (GPT-2 Small), and a fixed set of 50 prompts. It does not yet include human evaluation of explanation utility or gradient-based baselines (like Integrated Gradients). Future work aims to expand to larger models and adaptive per-instance circuits.