Grokking as a Falsifiable Finite-Size Transition

The Big Picture: The "Aha!" Moment

Imagine you are teaching a robot to do math (specifically, modular arithmetic, like a clock that resets after a certain number).

Phase 1 (Memorization): The robot starts by just memorizing the answers. It gets perfect scores on the practice problems it sees, but it's just rote learning. It's like a student who memorized the answer key but doesn't understand the math.
Phase 2 (Grokking): Suddenly, after a long time of seemingly no progress, the robot has an "Aha!" moment. It stops memorizing and starts understanding the pattern. It can now solve problems it has never seen before.

This sudden shift is called Grokking.

The Problem: Is it Magic or Physics?

Scientists have been arguing about what Grokking actually is.

Some say it's just a smooth curve: The robot gets a little better, then a little better, until it finally clicks.
Others say it's a Phase Transition: Like water suddenly turning into ice. It's a sharp, dramatic switch where the system completely reorganizes itself.

The problem is that most people just looked at one robot doing the task and said, "Wow, that curve looks sharp, it must be a phase transition!" But in science, you can't just look at one thing and declare it a law of physics. You need to prove it holds up under different conditions.

The Solution: The "Crowd Control" Experiment

The authors of this paper decided to treat the robot learning process like a physics experiment. They wanted to prove that Grokking is a real "phase transition" and not just a smooth slide.

To do this, they used two clever tricks:

1. Changing the Size of the Puzzle (The "Group Order")

In physics, to prove something is a phase transition, you have to change the size of the system.

The Analogy: Imagine trying to figure out if a crowd is acting like a single organism. If you only look at 5 people, it's hard to tell. If you look at 500, it's easier.
The Experiment: Instead of changing the robot's brain size (which would be messy), they changed the size of the math puzzle. They used clocks with different numbers of hours (53 hours, 59 hours, 100 hours, etc.).
The Result: As the puzzle got harder (more hours on the clock), the moment the robot "clicked" became sharper and sharper. It wasn't a blurry slide; it was a cliff edge. This suggests a real transition is happening.

2. Looking Inside the Robot's Brain (The "Order Parameter")

Usually, we judge learning by looking at the robot's test score. But the authors said, "No, that's just the surface. We need to look at the internal geometry of the robot's brain."

The Analogy: Imagine a chaotic party where everyone is talking over each other (memorization). Suddenly, everyone stops, forms a perfect circle, and starts singing in harmony (generalization). The score (how loud they are) might not change much, but the structure of the room has completely changed.
The Tool: They invented a metric called HTC (Head-Tail Contrast). It measures how "organized" the robot's internal thoughts are.
- Low HTC: The brain is a messy soup of random numbers (memorizing).
- High HTC: The brain has organized itself into a clean, efficient structure (understanding).
The Result: When they tracked this "internal organization," they saw it jump from messy to organized at the exact same moment the robot started solving new problems.

The "Crossing" Proof (The Smoking Gun)

In physics, there is a famous test called a Binder Crossing.

The Analogy: Imagine you have 10 different sized buckets of water. You heat them all up. If they are just getting warmer smoothly, their temperature curves will never touch. But if they are all freezing into ice at the exact same temperature, the lines on your graph will cross at a single point.
The Result: The authors plotted their data for all the different puzzle sizes. The lines crossed at a specific point. This is the "smoking gun." It proves that the system is undergoing a genuine, sharp transition, not just a smooth slide.

The Verdict

The paper concludes that Grokking is indeed a phase transition, similar to water freezing into ice.

It is a sudden reorganization of the robot's internal brain structure.
It is not just a smooth improvement; it is a sharp "cliff" where the system flips from one state to another.
They couldn't quite prove exactly what kind of transition it is (is it a gentle slide or a violent crash?), but they proved it is definitely a transition and not just a smooth crossover.

Why This Matters

Before this paper, saying "Grokking is a phase transition" was mostly a cool metaphor. Now, the authors have turned that metaphor into a rigorous scientific fact with a checklist of proof. They showed us how to measure the "size" of a learning problem and the "structure" of a brain to prove when a system truly "gets it."

In short: They took a mysterious "Aha!" moment in AI and proved it's a fundamental law of physics, using math puzzles of different sizes and a special way of looking inside the robot's brain.

1. Problem Statement

Grokking is a phenomenon in neural networks where a model rapidly memorizes training data (often achieving near-perfect training accuracy) but only generalizes to test data after a significantly delayed period of continued optimization. While this phenomenon is frequently described using the language of phase transitions (analogous to physical systems shifting from a disordered to an ordered state), existing literature lacks falsifiable finite-size diagnostics.

Current claims of phase transitions in machine learning often rely on:

Fitting sigmoidal curves to a single training run.
Observing sharp transitions at a single system size.
Using readout metrics (e.g., test accuracy) that do not probe the internal geometry of representations.

Without a controlled extensive size variable and a valid order parameter, the "phase transition" claim remains a descriptive analogy rather than a diagnostic, quantitative claim. The paper aims to supply these missing inputs to rigorously test whether grokking is a genuine finite-size transition or merely a smooth crossover.

2. Methodology

The authors apply the Finite-Size Scaling (FSS) protocol from condensed matter physics to the canonical modular arithmetic task (specifically modular addition).

A. Key Identifications

To apply FSS, two critical components must be defined:

Extensive Size Variable ( $p$ ):
- Instead of varying model width, depth, or parameter count (which changes the model class), the authors use the group order $p$ of the cyclic group $\mathbb{Z}_p$ as the size variable.
- Varying $p$ enlarges the algebraic task family while keeping the architecture (Transformer), optimizer, and hyperparameters fixed. This isolates the "system size" as the complexity of the task space (number of distinct group elements).
Order Parameter ( $m_{HTC}$ ):
- Instead of using test accuracy, the authors define a Spectral Head-Tail Contrast (HTC) based on the internal geometry of the hidden representations.
- Definition: $m_{HTC}(t) = \log \left( \frac{\sum_{j=1}^5 p_j(t)}{\sum_{j=6}^d p_j(t)} \right)$ , where $p_j$ are normalized eigenvalues of the covariance matrix of hidden representations.
- Rationale: This metric measures whether the spectral mass of the representation is concentrated in a few leading modes (ordered/generalizing) or diffuse across the bulk (disordered/memorizing). It is basis-agnostic and sensitive to the reorganization of internal geometry.

B. Experimental Protocol

Model: Fixed Transformer architecture ( $d_{model}=128$ , 2 encoder layers, 4 heads).
Task: Modular addition ( $\mathbb{Z}_p$ ).
Sweeps:
- Coarse Grid: 13 primes ( $p \in [53, 251]$ ), 10 training fractions ( $f$ ), 50 seeds per condition.
- Near-Critical Audit: 6 larger primes ( $p$ up to 397) with densely spaced training fractions around the suspected transition point to stress-test the diagnostics.
Diagnostics Chain:
1. Raw Sharpening: Observing if the transition window narrows as $p$ increases.
2. Binder-like Crossing: Checking if the dimensionless cumulant $U_4$ curves for different $p$ cross at a common control parameter value ( $f_c$ ).
3. Susceptibility Comparison: Testing if the peak susceptibility ( $\chi_{max}$ ) scales as a power law (indicating a transition) or saturates (indicating a smooth crossover).
4. Transition Order Assessment: Analyzing the behavior of the Binder minimum and seed-level distributions to distinguish between continuous and first-order transitions.

3. Key Contributions

Formalization of FSS for ML: The paper establishes a rigorous, falsifiable framework for testing phase transitions in learning systems, moving beyond curve-fitting analogies.
Novel Size Variable: It identifies the group order $p$ in modular arithmetic as a valid extensive variable for finite-size scaling, preserving the task family while varying complexity.
Representation-Level Order Parameter: It introduces the Spectral Head-Tail Contrast (HTC) as a robust, basis-agnostic order parameter that captures the internal geometric reorganization of the network, distinct from readout accuracy.
Diagnostic Chain: It implements a sequential diagnostic protocol (Binder crossings, susceptibility scaling, near-critical audits) that can be rejected at any step, ensuring the claim is not just a "best fit."

4. Key Results

The study provides strong evidence that grokking in modular addition behaves as a finite-size transition rather than a smooth crossover.

Raw Sharpening: As $p$ increases, the transition from low to high spectral concentration sharpens and localizes around a common training fraction ( $f \approx 0.39$ ), consistent with finite-size scaling precursors.
Binder Crossings: The Binder-like cumulant curves ( $U_4$ ) for different primes cross at a common point ( $f_c \approx 0.39$ ) with no statistically significant drift as $1/p \to 0$ . This indicates a shared organizing boundary across system sizes.
Rejection of Smooth Crossover:
- The peak susceptibility ( $\chi_{max}$ ) was tested against two models: a power-law growth (transition) vs. a saturating form (crossover).
- The power-law model was strongly preferred, with a difference in Akaike Information Criterion ( $\Delta AIC$ ) of 16.8 in the near-critical audit. This strongly disfavors the smooth-crossover interpretation.
Transition Order (Unresolved):
- The near-critical audit revealed negative Binder minima at the largest sizes ( $U_{4,min} \approx -0.67$ ), which is a tension indicative of a first-order transition.
- However, seed-level distributions remained unimodal (no bimodality/coexistence), preventing a definitive verdict on the transition order. The authors conclude the order is "unresolved" but the transition nature is confirmed.

5. Significance

Quantitative Rigor: The paper transforms the metaphor of "phase transitions" in deep learning into a quantitative, testable scientific claim. It demonstrates that grokking admits finite-size organization that cannot be explained by simple smooth crossovers.
Methodological Shift: It argues that future claims of phase transitions in ML must rely on admissible size variables, representation-level observables, and explicit failure criteria (like Binder crossings), rather than just observing sharp curves in single runs.
Theoretical Implications: The results suggest that the delayed generalization in grokking is driven by a collective reorganization of the representation space, organized by finite-size logic similar to many-body physical systems.
Scope: While the specific transition order (continuous vs. first-order) remains open, the study successfully isolates the phenomenon as a genuine transition-like event, providing a foundation for future work on universality classes and asymptotic exponents in learning systems.