Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

Imagine a giant, super-smart library robot (a Transformer model) that can do two very different things:

Recall: It can instantly recite facts it memorized, like "What is the capital of France?"
Reasoning: It can solve a brand-new puzzle it has never seen before, like figuring out a secret code in a language it doesn't know.

For a long time, scientists wondered: Does this robot use the same "brain cells" to do both tasks, or does it have separate departments for each?

This paper is like a detective story where the researchers put the robot under a microscope to find out. Here is what they discovered, explained simply:

1. The Robot Has a "Factory Floor" Layout

Think of the robot's brain as a multi-story factory with 28 floors (layers).

The Bottom Floors (Early Layers): These are the Librarians. When you ask a simple fact, these floors do the heavy lifting. They are great at grabbing stored information quickly.
The Top Floors (Deep Layers): These are the Architects. When you ask a complex puzzle, the information travels up to these floors. Here, the robot connects the dots, follows rules, and builds new logic.
The Middle Floors: These are a bit of a mix, where the Librarians and Architects sometimes chat.

The Discovery: The researchers found that the robot doesn't mix these jobs randomly. It has a clear hierarchy: Bottom = Memory, Top = Thinking.

2. The Specialized Workers (Heads and Neurons)

Inside each floor, there are thousands of tiny workers (called attention heads and neurons).

Some workers are Fact-Checkers. They only wake up when you ask about history or geography.
Some workers are Logic-Engineers. They only wake up when you ask the robot to solve a riddle or translate a strange language.

The researchers found that about 74% of these workers are specialists. They don't try to do everything; they have a specific job description. It's like a hospital where the surgeons don't also do the accounting, and the accountants don't perform heart surgery.

3. The "Surgery" Experiment (The Proof)

To prove this wasn't just a guess, the researchers performed "brain surgery" on the robot. They temporarily turned off specific parts of the brain to see what happened.

Scenario A: Turning off the "Librarian" floors.
- Result: The robot forgot facts (like the capital of France) and got them wrong. But, it could still solve the logic puzzles perfectly!
- Analogy: It's like taking away a person's memory of names, but they can still solve a math problem.
Scenario B: Turning off the "Architect" floors.
- Result: The robot could still recite facts perfectly. But when asked to solve a puzzle, it got confused and failed.
- Analogy: It's like a person who knows all the rules of chess but can't actually play the game because they can't plan ahead.

Why Does This Matter?

This is a big deal for making AI trustworthy.

No More "Hallucinations": If we know exactly which part of the brain handles facts, we can fix it when it lies about history.
Better Reasoning: If we know which part handles logic, we can train that specific part to get smarter at solving problems without messing up its memory.
Transparency: It proves that AI isn't just a "black box" guessing randomly. It has a structured, logical architecture that we can understand and map out.

The Bottom Line

The paper shows that AI models are not just one big blob of intelligence. They are modular machines with distinct, separate circuits for remembering and thinking. By understanding this separation, we can build smarter, safer, and more reliable AI systems in the future.

1. Problem Statement

Large Language Models (LLMs) excel at two distinct cognitive capabilities: recall (retrieving memorized facts) and reasoning (performing multi-step logical inference). However, it remains unclear whether these functions rely on overlapping or separable internal circuits within the transformer architecture.

The Gap: While mechanistic interpretability has revealed modular subsystems (e.g., induction heads, MLP key-value encoders), it is unknown if these modules correspond to functional divisions like "memory retrieval" vs. "inference."
The Risk: Without understanding these mechanisms, it is difficult to distinguish between genuine reasoning and "specification gaming" or hallucination, hindering the development of trustworthy and scientifically grounded AI.
Objective: To causally identify and isolate the specific layers, attention heads, and neurons responsible for recall versus reasoning, moving beyond correlation to causal evidence.

2. Methodology

The authors employed a mechanistic interpretability pipeline using the Qwen2.5-7B-Instruct model. The approach involved five specific hypotheses (H1–H5) regarding layer, head, and neuron specialization.

A. Dataset Design

To control for confounding variables, the authors created a hybrid dataset of 60 tasks (30 recall, 30 reasoning):

Reasoning Tasks: Sourced from the International Linguistic Olympiad (IOL) and related benchmarks (e.g., Lingoly). These require rule induction and compositional inference on low-resource languages (e.g., Chickasaw), ensuring the model cannot rely on pre-trained factual memory.
Recall Tasks: Synthetic counterfactual queries derived from factual triples (Country-Capital-Continent). These test direct factual retrieval while preserving the syntactic structure of the reasoning tasks to ensure a fair comparison.

B. Analysis Pipeline

Activation Tracing: The model was run through the nnsight framework to capture hidden states, attention distributions, and MLP outputs for all 28 layers.
Metric Computation:
- Layer-wise: Hidden-state norms, attention entropy, and MLP magnitude.
- Head/Neuron-level: Task-specific contribution scores and causal relevance rankings.
Statistical Validation:
- Used Mann–Whitney U tests for non-parametric comparison of activation distributions.
- Applied FDR correction (α=0.01) for layers/heads and Bonferroni correction (α=0.0001) for neurons.
- Validated stability via 5-fold cross-validation.
Causal Intervention (The Core Test):
- Activation Patching/Ablation: Selectively zeroing or replacing activations in identified "recall" or "reasoning" circuits to observe the impact on task performance.

3. Key Findings & Results

A. Layer Specialization (H1)

The study identified a consistent hierarchical stratification across the 28 layers:

Early Layers (0–6): Primarily support factual recall. Characterized by lower entropy and higher activation magnitudes focused on entity tokens.
Middle Layers (7–16): Exhibit mixed activations.
Deep Layers (17–27): Concentrate computations for multi-step inference and compositional reasoning.
Validation: This hierarchy remained stable under cross-validation and normalization changes.

B. Attention Head Specialization (H2)

74% of attention heads showed significant task preferences.
Recall Heads: Cluster in shallower layers; focus narrowly on entity tokens.
Reasoning Heads: Emerge in deeper layers; attend broadly to relational and structural markers.
Selectivity: Many heads exhibit "near-binary" selectivity, functioning almost exclusively for one task type.

C. MLP Neuron Selectivity (H3)

Approximately one-third of MLP neurons met strict task-specificity criteria ( $|d| > 1.0$ , $p < 10^{-4}$ ).
Spatial Clustering:
- Layer 4: Dense hub for factual retrieval neurons.
- Layer 22: Dense hub for high-level compositional reasoning neurons.
Activation Pattern: These neurons show "near-binary" firing—highly active for their specific task and silent for the other.

D. Causal Intervention Results (H5)

The most critical finding is the selective impairment observed during ablation experiments:

Disabling Recall Circuits: Reduced factual recall accuracy by 15% ± 3% while leaving reasoning performance largely unchanged (3% variation).
Disabling Reasoning Circuits: Reduced reasoning accuracy by 14% ± 4% while leaving recall performance largely unchanged.
Control: Random ablations of equal size resulted in negligible changes (~2%).

4. Key Contributions

Causal Disentanglement: Provides the first causal evidence that recall and reasoning are supported by partially distinct but complementary computational circuits in transformer models, rather than a single monolithic process.
Architectural Mapping: Maps specific functional roles to architectural components:
- Recall: Early layers, shallow attention heads, and specific MLP neurons (e.g., Layer 4).
- Reasoning: Deep layers, deep attention heads, and distinct MLP neurons (e.g., Layer 22).
Methodological Framework: Demonstrates a robust pipeline combining controlled linguistic puzzles, synthetic counterfactuals, and causal patching to isolate cognitive processes.
Generalizability: While focused on Qwen2.5-7B, the patterns (early recall, deep reasoning) are hypothesized to be consistent across model families (LLaMA, Mistral), as suggested by the authors' broader analysis.

5. Significance

Trustworthiness & Safety: By isolating reasoning circuits, researchers can better distinguish between genuine logical inference and "hallucinated" recall, potentially reducing alignment faking.
Explainable AI (XAI): Moves the field from anecdotal observations to statistically rigorous, causal explanations of how models think.
Model Design: Suggests that future models could be optimized by explicitly strengthening specific layers for reasoning or recall, rather than relying on brute-force scaling.
Scientific Discovery: Enables clearer attribution of model outputs, which is crucial for using LLMs in scientific domains where the distinction between a retrieved fact and a derived conclusion is vital.

Conclusion

The paper establishes that the internal mechanics of LLMs are not a "black box" of mixed signals but a hierarchically organized system where memory retrieval and logical inference are functionally separated. This separation allows for targeted interventions that can selectively impair one capability without affecting the other, offering a new pathway for building more interpretable and reliable AI systems.