Represented Is Not Computed: A Causal Test of Candidate… — Plain-Language Explanation

Imagine you have a very smart, but mysterious, robot chef. You give it a recipe card with three ingredients: a big number ( $N$ ), a base number ( $B$ ), and a specific "slot" number ( $D$ ). The chef's job is to figure out a specific digit from the big number, but only after converting it into the "base" language.

For example, if the big number is 255, the base is 16, and you ask for the 0th slot, the chef needs to do some math to tell you the answer.

The researchers in this paper wanted to peek inside the chef's kitchen to see how it solves this puzzle. They had a very specific theory about how the chef should be thinking, and they wanted to see if that's actually what was happening.

Here is the story of what they found, broken down into simple steps:

1. The Chef is a Genius at the Task

First, they checked if the robot could actually do the job. They trained it on thousands of examples and then tested it on new, unseen numbers.

The Result: The robot was nearly perfect (99.83% accuracy). It knew exactly what answer to give. So, we know it can solve the problem.

2. The "Official Recipe" Theory (What we thought was happening)

The math problem has a clear, step-by-step solution, like a strict recipe in a cookbook. To get the answer, you theoretically need to follow these steps:

Calculate a helper number ( $B^D$ ).
Divide the big number by that helper.
Round down.
Take the remainder.

The researchers thought the robot was probably following this Official Recipe. They used a tool called a "Linear Probe" (think of it like a head chef peeking into the kitchen) to scan the robot's workspace.

The Finding: The chef looked inside and saw that the robot's kitchen did contain these exact numbers. The "helper number" and the "rounded-down number" were clearly visible sitting in bowls on the counter, just like intermediate dishes in a complex cooking process.
The Trap: Because they found these ingredients on the counter, they assumed the chef was using them to cook the dish. It looked like the robot was following the recipe perfectly.

3. The Reality Check (The Causal Test)

This is where the paper gets interesting. Just because the chef has the ingredients on the counter doesn't mean it's using them to make the final decision.

To find out what the chef was actually using, the researchers performed a "kitchen audit" using two methods:

Method A: The Closed Station (Ablation)
They tried to "close" specific prep stations in the kitchen that were supposed to pass the "helper numbers" to the final dish.
- The Result: Surprisingly, closing the stations that held the complex math didn't hurt the chef much. But when they closed the very first station where the chef looked at the "slot number" ( $D$ ), the chef immediately forgot how to answer. It didn't matter if the complex math ingredients were sitting on the counter or not; the chef ignored them.
Method B: The Swap (Patching)
They took a "guest" chef who had a different "slot number" ( $D$ ) but the same big number and base. They swapped the prep station signals from the guest chef into the original robot's kitchen.
- The Result: The original robot suddenly gave the guest chef's answer. But this only happened if the slot number ( $D$ ) was different. If they swapped the big number ( $N$ ) or the base ( $B$ ), the robot didn't care.
- The Conclusion: The robot wasn't using the complex math (the Official Recipe) to decide the answer. It was only listening to the "slot number" ( $D$ ) directly.

4. The "Hidden Path" Discovery

Finally, they mapped out the actual path the information took through the kitchen.

What they expected: A single, organized assembly line where $N$ , $B$ , and $D$ all meet, get mixed together into a complex math formula, and then produce the answer.
What they found: The robot has three separate, small prep stations. One station handles the big number, one handles the base, and one handles the slot number. These stations work independently for almost the entire cooking process. They only combine their ingredients at the very last second, right at PLATING, just before the answer is written down. The robot didn't build the complex "helper numbers" and pass them along; it just kept the ingredients separate until the very end.

The Big Lesson: "Represented" is not "Computed"

The paper's main title says it all: "Represented Is Not Computed."

Represented: The robot's kitchen contained the complex math numbers. If you looked at the counter, you could see them clearly (like finding a recipe card on the counter).
Computed: The robot did not use those numbers to cook the dish. It took a shortcut.

The Analogy:
Imagine the chef has the official recipe card sitting on the counter, with every step clearly written out (the "represented" math).

The Probe: You walk into the kitchen and see the recipe on the counter and say, "Aha! You're using the recipe!"
The Reality: The chef actually memorized the dish years ago and is cooking on instinct. The recipe is sitting there, but the chef never looks at it. If you took the recipe away, the dish would still come out the same. If you swapped it for a different recipe, the chef wouldn't notice.

Summary:
The robot solved the math problem perfectly, and it even "thought" about the math steps in a way that looked like it was following the rules. But when they tested what actually caused the robot to give the answer, they found it was ignoring the complex steps and just reacting directly to the specific "slot" it was asked for.

The paper warns us: Just because we can find a piece of information inside a neural network (like finding a recipe on the counter), it doesn't mean the network is actually using that information to make decisions. We need to test the cause, not just look at the contents.

Technical Summary: Represented Is Not Computed

Problem Statement
Mechanistic interpretability seeks to understand how neural networks integrate task-relevant components to solve structured prompts. In natural language and vision, the internal relations required for this integration are rarely specified precisely enough to define a candidate internal algorithm. This paper addresses this gap by utilizing arithmetic, specifically base-digit extraction, as a cleaner setting where the input-output function is known and candidate algorithms can be explicitly defined. The task involves a Transformer receiving a decimal number $N$ , a base $B$ , and a digit position $D$ , and predicting the coefficient of $B^D$ in the base- $B$ expansion of $N$ . The closed-form solution is $y = \lfloor N/B^D \rfloor \mod B$ .

The central question is whether the model implements a "staged" algorithmic hypothesis suggested by this closed-form solution: computing $B^D$ , then $N/B^D$ , taking the floor, and finally reducing modulo $B$ . Specifically, the authors investigate three distinct questions often conflated in interpretability: (1) Can the model solve the task? (2) Are the quantities from the closed-form solution represented within the network? (3) Are those quantities the causal intermediates used to produce the answer?

Methodology
The authors trained 10-layer decoder-only Transformers from scratch on the base-digit extraction task using three different random seeds. The training data included $N \in \{0, \dots, 999\}$ , $B \in \{2, \dots, 30\}$ , and various digit positions $D$ . The models were evaluated autoregressively on held-out number–base intersections to ensure robust generalization rather than memorization.

To analyze the internal mechanisms, the study employed a multi-stage approach:

Linear Probing: Linear readouts were trained on frozen activations to test if closed-form quantities ( $B^D$ , $N/B^D$ , $\lfloor N/B^D \rfloor$ , and the final answer) were linearly decodable from residual streams at various layers.
Attention Ablation: The authors performed targeted ablations on attention routes from the $D$ -token stream ( $D_{ones}$ ) to the output streams ( $O[0]$ and $O[1]$ ). They measured performance drops when masking attention from specific layers (both shallow-to-deep and deep-to-shallow sweeps) to identify causal dependencies.
Activation Patching: To determine what information is carried by the causal routes, the authors performed key/value patching. They substituted $D_{ones}$ key/value vectors from a "donor" example into a "source" example. By varying whether the donor differed from the source in $N$ , $B$ , or $D$ , they tested whether the route carries information specific to the digit position or the broader arithmetic intermediates.
Sparse Circuit Search: A greedy right-to-left search was conducted to identify a minimal set of attention routes sufficient for task performance, revealing the overall routing structure of the model.

Key Results

Task Competence: The models achieved near-perfect performance on held-out test sets, with a mean exact-answer accuracy of 99.83% across three seeds. This establishes that the models reliably learned the task mapping.
Representation (Probing): Linear probes showed that the closed-form intermediates were decodable from the residual streams in an order that mirrors the staged algorithm: $B^D$ became linearly decodable at SHALLOWER (earlier) layers than $N/B^D$ (and quotient-like quantities), which in turn appeared earlier than the final answer. The fact that the layer-wise order of appearance matched the algorithmic order of operations was the central reason the staged hypothesis appeared representationally plausible. (Some of this decodability was present even at initialization, indicating it is partly an artifact of architecture/data geometry rather than purely learned computation.)
Causal Use (Ablation & Patching): Despite the strong representation of staged intermediates, causal tests revealed a different mechanism.
- Early Sensitivity: Output behavior was most sensitive to early $D_{ones} \to O$ communication (specifically layers 0–1). Masking these early layers caused a drastic performance drop, whereas masking later layers had minimal effect.
- Selective Information Transfer: Patching experiments showed that the $D_{ones} \to O$ route carries behaviorally effective information that is highly selective for $D$ . When the donor differed only in $N$ or $B$ , the patched model's output remained unchanged (matching the source). When the donor differed only in $D$ , the output flipped to match the donor.
- Factorized Routing: The sparse circuit search revealed that $N$ , $B$ , and $D$ are routed through mostly separate local scaffolds that converge late at the output streams. There is no evidence of a single, unified closed-form intermediate being transmitted from the prompt side to the output.

Key Contributions and Claims
The paper's primary contribution is a dissociative observation: the model represents the quantities that make the staged algorithmic solution plausible (they are linearly decodable), yet the identified causal route does not transmit these quantities to the output.

The authors claim that "represented is not computed." In this context, "computed" refers to the causal intermediates actually used to form the answer. The study demonstrates that:

Probes can diverge from causal reality: Linear probes successfully identified the presence of algorithmic intermediates, but causal interventions (ablation and patching) proved these intermediates were not the primary drivers of the output.
Decodability $\neq$ Causal Usage: High decodability of a quantity does not guarantee it is a learned causal intermediate; it may reflect accessibility supplied by the architecture or tokenization that is later sculpted by training but not utilized in the specific causal path to the output.
Mechanism of Base-Digit Extraction: The model solves the task by routing $N$ , $B$ , and $D$ through separate pathways and integrating them late, relying on early $D$ -selective communication rather than a staged transmission of quotient-like values.

Significance
The paper serves as a direct, testable warning against relying solely on linear probes for mechanistic interpretation. Even in a setting with an explicit, known algorithm and near-perfect task performance, the internal causal mechanism can differ significantly from the intuitive algorithmic hypothesis. The authors argue that mechanistic explanation requires demonstrating how quantities are used causally, not just that they are present. This work complements existing research on Transformer circuits and arithmetic mechanisms by showing that heuristic or non-algorithmic routes can solve tasks where clean algorithmic intermediates are clearly representable but not causally utilized.

Represented Is Not Computed: A Causal Test of Candidate Algorithmic Intermediates in a Transformer