The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

The Big Picture: The "Grokking" Mystery

Imagine you are teaching a robot to do math. You show it thousands of examples. For a long time, the robot seems to be failing miserably. It gets almost everything wrong. You start to think, "This robot isn't learning anything."

Then, suddenly, after tens of thousands of examples, the robot snaps into place. It goes from 0% to 99% accuracy in a flash. This phenomenon is called "Grokking."

The big question this paper asks is: What was the robot doing during those long, boring years of failure? Was it actually learning nothing? Or was it secretly building a brilliant internal understanding that it just couldn't show us yet?

The Experiment: The "Collatz" Puzzle

The researchers used a specific math puzzle called the Collatz Conjecture (specifically, a one-step version).

The Rule: If a number is even, divide it by 2. If it's odd, multiply by 3 and add 1.
The Setup: They used a "Translator" robot (an Encoder-Decoder model).
- The Encoder (The Reader): Reads the number and understands its properties.
- The Decoder (The Speaker): Takes that understanding and writes down the answer.

The Discovery: The "Shadow Knowledge" Gap

The researchers found that the robot was not failing to learn. It was failing to speak.

The Reader (Encoder) was a genius early on: Within the first few thousand steps, the Encoder figured out the secret math rules. If you asked it, "Is this number even or odd?" it could answer correctly 99% of the time. It had the knowledge.
The Speaker (Decoder) was stuck: Even though the Reader knew the answer, the Speaker kept guessing randomly for tens of thousands more steps.

The Analogy: Imagine a brilliant professor (the Encoder) who knows the entire history of the world. But they are stuck in a room with a nervous, stuttering student (the Decoder) who has to write the essay. The professor knows the facts, but the student is so bad at writing that the essay looks like gibberish. The "Grokking" moment happens only when the student finally learns how to listen to the professor and write the words down correctly.

The Proof: The "Organ Transplant" Test

To prove that the problem was the Speaker and not the Reader, the researchers did something crazy: they swapped parts.

The "Fresh Speaker" Test: They took a robot that had already learned the math (a trained Encoder) and gave it a brand new, untrained Speaker.
- Result: The new Speaker learned the math 2.75 times faster than a robot starting from scratch. The "Grokking" delay vanished almost entirely.
The "Fresh Reader" Test: They took a robot that had learned how to speak (a trained Decoder) and gave it a brand new, untrained Reader.
- Result: The robot got worse. It couldn't figure out the math at all.

Conclusion: The bottleneck wasn't learning the math; it was accessing the math to produce the answer. The delay was a communication problem, not a knowledge problem.

The Twist: The "Language" Matters

The researchers then changed the "language" the robot used to write numbers. Instead of Base 10 (our normal 0-9), they tried Base 2 (binary), Base 8, Base 12, etc.

The Magic of Base 24: When they used Base 24, the robot learned incredibly fast and got nearly perfect scores.
The Disaster of Base 2 (Binary): When they used Base 2, the robot completely failed. It memorized the training data, then crashed, and never recovered.

The Analogy: Imagine the Reader is trying to explain a recipe to the Speaker.

In Base 24, the instructions are simple: "Take a big chunk, split it, and add a pinch." The Speaker can easily follow these simple steps.
In Base 2, the instructions are a nightmare: "Take a tiny crumb, split it, carry a crumb to the next step, split that, carry another..." The instructions are so messy and tangled that the Speaker gets confused and gives up.

The "Base" acts like a lens. Some lenses make the math look simple and local (easy to see); others make it look complex and global (hard to see).

Why Does This Matter?

This paper changes how we think about AI learning.

Don't give up too soon: Just because an AI looks like it's failing for a long time doesn't mean it's not learning. It might be building a complex internal map that it just hasn't figured out how to use yet.
The "Output" is the hard part: Sometimes, the smartest part of the AI is already there, but the part that generates the answer is the weak link.
How we format data matters: The way we represent numbers (or any data) can make a task 100x easier or impossible. It's not just about the math; it's about the "inductive bias" (the mental shortcut) provided by the format.

Summary

The paper is about a robot that learned the math rules quickly but took a very long time to show us the answer. The delay wasn't because it was stupid; it was because the "speaker" part of the robot was slow to catch up to the "thinker" part. And depending on how you asked the robot to speak (which number base you used), the task could be a breeze or a complete disaster.

1. Problem Statement

The paper investigates the phenomenon of "grokking" in Transformer models trained on algorithmic tasks. Grokking is characterized by a long plateau where the model fits the training set but fails to generalize to the test set, followed by an abrupt jump in test accuracy.

Core Question: In encoder-decoder architectures performing arithmetic tasks, does this delay stem from the encoder failing to learn the necessary internal structure early on, or does the encoder learn the structure quickly, but the decoder fails to utilize (read out) that structure until much later?
Task: The authors focus on one-step Collatz prediction, where the model must predict the digits of $T(n)$ given an integer $n$ (where $T(n) = n/2$ if even, $3n+1$ if odd). This task is chosen because it combines branching logic, residue information, and digit-level transformations, and its difficulty is highly sensitive to the numeral base (representation) used.

2. Methodology

The authors employ a combination of representational probing, causal interventions, and systematic ablation studies to isolate the source of the delay.

Model Architecture: An encoder-decoder Transformer. The encoder processes the input digit sequence of $n$ , and the decoder autoregressively generates the output sequence for $T(n)$ .
Probing: Linear probes are fitted to the frozen encoder hidden states at various training checkpoints to measure how early specific arithmetic features (e.g., parity $n \mod 2$ , residues $n \mod 4, 8, 16$ ) become linearly decodable.
Causal Interventions (Transplantation & Rewinding):
- Encoder Transplant: Freezing a converged encoder and training a fresh decoder.
- Decoder Transplant: Freezing a converged decoder and training a fresh encoder.
- Decoder Rewind: Freezing a converged encoder, resetting the decoder weights to an early checkpoint (step 2k), and continuing training only the decoder.
- Parity Erasure: At inference, projecting out the learned linear parity direction from encoder states to measure the drop in output accuracy.
Numeral Base Sweep: Training models across 15 different numeral bases (powers of 2, powers of 3, mixed bases, and base 10) to analyze how representation affects the "readout" difficulty.
Cross-Task Transfer: Testing if representations learned on Collatz prediction transfer to Greatest Common Divisor (GCD) prediction.

3. Key Contributions & Findings

A. The "Shadow Knowledge Gap"

The primary finding is that useful arithmetic structure is learned by the encoder long before the model produces correct outputs.

Evidence: In the default base-8 setting, a linear probe for parity ( $n \mod 2$ ) on the encoder reaches 99.7% accuracy by step 2,000. However, the sequence-level output accuracy remains near chance (~38%) until step ~44,000.
Implication: The encoder rapidly acquires low-order arithmetic structure (parity, residues), but the decoder struggles to translate this internal representation into the correct output sequence for tens of thousands of steps.

B. The Bottleneck is Decoder Readout

Causal interventions confirm that the delay is a readout problem, not a representation problem.

Encoder Transplant: Pairing a trained encoder with a fresh decoder accelerates grokking by 2.75× compared to training from scratch.
Decoder Rewind: Freezing a converged encoder and rewinding the decoder to an early state eliminates the plateau entirely. The rewound model reaches 97.6% accuracy, whereas joint training only reaches 86.1% with the same compute budget.
Decoder Transplant: Freezing a trained decoder and training a fresh encoder causes performance to decline, confirming the decoder is the limiting factor.
Parity Erasure: Erasing the parity direction from the encoder hurts performance most severely during the plateau phase, suggesting the decoder initially relies on simple linear cues that are present but not yet fully utilized.

C. Numeral Base as an Inductive Bias

The difficulty of the decoder's readout task is heavily dependent on the numeral base used to represent numbers.

Aligned Bases: Bases where the factorization aligns with the Collatz map's arithmetic (e.g., base 24, divisible by both 2 and 3) allow the model to reach near-perfect accuracy (99.8%).
Binary Failure: In base 2, the model memorizes the training set briefly but then collapses to 0% accuracy and never recovers.
- Reason: In binary, the $3n+1$ operation requires complex carry propagation that the decoder cannot exploit locally. The encoder's representation dimensionality collapses (participation ratio drops from 5.2 to 1.0), indicating a failure to maintain a useful representation for the decoder.
Mechanism: In even bases, the $n/2$ branch is a local, bounded-lookahead transduction. The $3n+1$ branch requires carry propagation. Bases divisible by 2 and 3 facilitate faster carry absorption, making the decoder's job easier.

D. Decoder Capacity and Training Exposure

Depth: Decoder depth has a non-monotonic effect. A 4-layer decoder achieves the best final accuracy on the hard "odd" branch, while a 1-layer decoder learns faster initially but plateaus slightly lower.
Exposure: Training on "short-carry" examples (removing deep carry chains) prevents the model from generalizing to deep carries, indicating that exposure to the full complexity of the odd branch is necessary for generalization.

E. Limited Reusability

Cross-task transfer between Collatz and GCD is poor. Encoders trained on one task do not provide a useful initialization for the decoder of the other, suggesting the learned representations are tightly coupled to the specific input format and task structure rather than being reusable arithmetic primitives.

4. Significance

Redefining Grokking: The paper challenges the view that grokking implies a slow acquisition of knowledge. Instead, it posits that grokking in encoder-decoder models is often a bottleneck in accessing knowledge that is already present in the encoder.
Inductive Bias of Representation: It demonstrates that the choice of numeral base acts as a powerful inductive bias. Even for the same underlying mathematical task, the "learnability" of the output pathway can vary from perfect to impossible based solely on how the numbers are tokenized.
Architectural Insight: The findings suggest that for complex algorithmic tasks, the decoder's ability to "read out" structured representations is often the rate-limiting step, more so than the encoder's ability to form those representations.
Methodological Contribution: The use of "decoder rewinding" and "transplantation" provides a robust causal framework for diagnosing whether learning delays are due to representation formation or readout failure.

Conclusion

The paper concludes that in encoder-decoder arithmetic models, learned representations can substantially outrun behavior. The "long delay" to generalization is primarily a failure of the decoder to efficiently utilize the rich, early-formed arithmetic structure within the encoder. This bottleneck is modulated by the numeral base, which dictates how much local digit structure the decoder can exploit.