Grokking as a Phase Transition between Competing Basins: a Singular Learning Theory Approach

The Big Picture: What is "Grokking"?

Imagine you are teaching a student to solve a math puzzle (specifically, modular arithmetic, like figuring out what time it is on a 12-hour clock).

Phase 1: Rote Memorization. At first, the student gets really good at the practice test. They memorize the answers to every single question they've seen. If you ask them a question they've practiced, they get it right 100% of the time. But if you give them a new question they haven't seen, they fail miserably. They are just reciting a list, not understanding the rules.
The Long Wait. You keep training them. They keep getting perfect scores on the practice test, but they still fail the new questions. It feels like they aren't learning anything new.
Phase 2: The "Aha!" Moment (Grokking). Suddenly, after a long time of seemingly no progress, the student has a breakthrough. They stop memorizing and start understanding the underlying pattern. Now, they can solve any new question instantly.

This sudden jump from "memorizing" to "understanding" is called Grokking.

The Mystery: Why does this happen?

For a long time, scientists didn't know why this sudden switch happened. They knew the student (the AI model) was stuck in a "memorization mode" and then suddenly switched to a "generalization mode," but they couldn't explain the mechanics of the switch.

This paper argues that the AI isn't just slowly getting smarter. Instead, it's like a hiker walking through a mountain range who suddenly finds a hidden valley.

The New Lens: Singular Learning Theory (SLT)

The authors use a mathematical framework called Singular Learning Theory (SLT). To understand this, imagine the "Loss Landscape" (the terrain the AI walks on) not as a smooth hill, but as a complex mountain range with two types of valleys:

The "Sharp" Valley (Memorization): Imagine a narrow, deep canyon. It's very easy to fall into this canyon and stay there. If you are in this canyon, you can perfectly match the practice test data (you have zero error). However, this canyon is very narrow. If the wind blows (new data), you fall out. This represents memorization.
The "Flat" Valley (Generalization): Imagine a wide, gentle meadow. It's harder to get into because you have to walk a specific path to find the entrance. But once you are there, the ground is flat and wide. No matter which way the wind blows, you stay safe. This represents generalization.

The Key Concept: The "Local Learning Coefficient" (LLC)

The paper introduces a new tool called the Local Learning Coefficient (LLC). Think of the LLC as a "Flatness Meter."

High LLC: The terrain is sharp and narrow (like the canyon). The AI is memorizing.
Low LLC: The terrain is flat and wide (like the meadow). The AI is generalizing.

In the world of AI, the "Bayesian" view suggests that nature (or the math of probability) prefers the flat, wide valleys. Even if the narrow canyon looks just as good initially, the wide valley is statistically more likely to be the "correct" solution in the long run.

The Story of the Paper: The Phase Transition

The authors studied a specific type of AI (a "Quadratic Network") solving math puzzles. They derived a mathematical formula to calculate the "Flatness Meter" (LLC) exactly.

Here is what they found:

The Race: During training, the AI is essentially racing between two basins (valleys).
- Basin A (Memorization): The AI falls in quickly. It has a high "Flatness Meter" reading (it's sharp). It fits the training data perfectly but is fragile.
- Basin B (Generalization): This basin is harder to reach. It has a low "Flatness Meter" reading (it's flat).
The Switch: As the AI trains longer, the math says it must eventually prefer the flat valley because it is more "degenerate" (meaning there are many, many ways to be in that valley, making it statistically dominant).
The Grokking Moment: The "Grokking" moment isn't magic. It's a Phase Transition. It's the exact moment the AI realizes, "Hey, this wide, flat valley is actually a better place to live than this narrow canyon," and it physically moves its weights to settle there.

The Experiments: Tracking the Switch

The authors didn't just do math; they ran experiments to prove it.

The Tracker: They built a tool that measures the "Flatness Meter" (LLC) while the AI is training.
The Result: They found that the "Flatness Meter" curve perfectly predicts when the AI will start generalizing.
- When the LLC is high, the AI is memorizing.
- When the LLC drops, the AI has found the "generalization valley."
The Surprise: They found that even though the "Flatness Meter" is calculated using only the practice test data, it perfectly predicts how the AI will do on new questions. It's like looking at a map of a single room and being able to predict the layout of the whole house.

Why Does This Matter?

It Explains the "Aha!" Moment: It tells us that grokking isn't a glitch; it's a natural phase transition where the AI swaps a fragile solution for a robust one.
It Gives Us a Crystal Ball: By watching the "Flatness Meter" (LLC), we can predict when an AI is about to learn the rules, even before it starts getting better at the test.
It Helps Tune the AI: They found that changing the "learning rate" (how big of a step the AI takes) changes how long it takes to find the flat valley. A bigger step size helps the AI jump over the narrow canyon and land directly in the wide meadow, making the "Grokking" happen faster.

Summary Analogy

Imagine you are trying to park a car in a crowded lot.

Memorization is squeezing your car into a tiny, tight spot between two other cars. It fits perfectly right now, but if anyone moves even an inch, you get scratched.
Generalization is finding a big, open spot in the middle of the lot. It's harder to get to, and you might drive past it a few times, but once you park there, you are safe no matter what happens around you.

This paper explains that the AI is a driver who keeps trying to squeeze into the tiny spots (memorizing) for a long time, until suddenly, the math of the universe forces them to realize, "Wait, I should just drive to that big open spot." The "Flatness Meter" is the GPS that tells the driver exactly when that big open spot is within reach.

1. Problem Statement

Grokking is a phenomenon observed in deep learning, particularly on algorithmic tasks like modular arithmetic, where a model achieves near-zero training loss (memorization) but fails to generalize for an extended period. Eventually, after continued training, the model undergoes an abrupt transition to high test accuracy (generalization).

The central question addressed by the paper is: What determines which solution basin a model settles into when multiple basins fit the training data equally well? Specifically, why does the model eventually switch from a "memorization" basin to a "generalization" basin?

While empirical hypotheses suggest that "flatter" minima generalize better, the theoretical foundations remain incomplete, especially for singular models (like neural networks) where standard regularity assumptions (e.g., positive definite Fisher information) do not hold.

2. Methodology: Singular Learning Theory (SLT)

The authors analyze grokking through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape for singular models.

Local Learning Coefficient (LLC): The core metric used is the LLC ( $\lambda$ $λ$ ), which measures the local degeneracy (effective dimension) of the loss surface near a minimum.
- Geometric Interpretation: A lower $\lambda$ corresponds to a "flatter" or more degenerate basin with a larger volume of parameters yielding low loss.
- Bayesian Implication: In the Bayesian framework, the posterior mass concentrates in basins with lower LLC as the sample size $n$ increases.
- Generalization: The asymptotic expected Bayes generalization error is proportional to $\lambda$ . Thus, lower LLC implies better generalization.
Phase Transition Mechanism: The paper posits that grokking is a first-order Bayesian phase transition. Initially, the optimizer finds a basin with low training loss but high LLC (memorization). As training continues (effectively increasing the "effective sample size" or allowing the optimizer to explore the landscape), the system transitions to a basin with lower LLC (generalization) because it eventually dominates the marginal likelihood.

3. Key Contributions

The paper makes two primary contributions, bridging the gap between theoretical SLT and empirical deep learning dynamics:

A. Closed-Form Derivation of LLC for Quadratic Networks

The authors derive exact, closed-form expressions for the LLC in quadratic neural networks trained on modular arithmetic tasks. This is significant because most SLT applications rely on numerical estimation; here, the geometry is analytically tractable.

They distinguish between two regimes based on the hidden width $K$ relative to the input dimension $d$ (where $d=2p$ for modular arithmetic with prime $p$ ):

Over-parameterized Regime ( $K \geq \frac{d(d+1)}{2}$ ):
The LLC is determined by the dimension of the space of symmetric matrices:
$\lambda = p \cdot \frac{d(d+1)}{4}$
(Note: This assumes the solution spans the full symmetric space.)
Under-parameterized Regime ( $K < \frac{d(d+1)}{2}$ ):
The LLC depends on the number of active neurons and the effective dimension of the feature space:
$\lambda = \frac{K(d + p - 1)}{2}$
(Note: In the specific modular arithmetic context with quadratic activation, bounds are derived based on the rank of the feature matrix.)

B. Empirical Validation and Dynamics Tracking

The authors empirically verify these theoretical scaling laws and demonstrate that LLC trajectories can track the emergence of generalization.

They show that the LLC, calculated solely from training data, mirrors the validation loss curve.
They demonstrate that hyperparameters (like learning rate) modulate the "severity" of grokking by influencing the optimization path through the loss landscape.

4. Key Results

Theoretical Scaling Laws

Linear Relationship with Dimension: Experiments confirm that the final LLC scales linearly with the input dimension $p$ and the hidden layer width $K$ , matching the derived closed-form expressions.
Memorization vs. Generalization:
- Early Stage (Memorization): The model operates in a "lazy" or "NTK" regime where weights change minimally. The LLC is higher (sharper basin), corresponding to overfitting the training data without learning the underlying structure.
- Late Stage (Generalization): As the model enters a "feature learning" regime, the LLC drops. This drop coincides with the abrupt improvement in test accuracy. The model has found a basin with lower effective dimension (higher degeneracy) that generalizes.

Grokking Severity and Hyperparameters

Learning Rate: The paper introduces a Grokking Severity Measure (GSM) to quantify the delay between memorization and generalization.
- Finding: There is a negative correlation between the learning rate and GSM. Larger learning rates lead to faster generalization (lower severity).
- Mechanism: Larger learning rates allow the optimizer to escape sharp, high-LLC valleys (memorization) and land directly in or traverse toward high-degeneracy, low-LLC basins (generalization) more quickly.
LLC as a Predictor: The trajectory of the LLC serves as a reliable early warning signal for generalization, even before test accuracy improves. A decreasing LLC indicates the model is moving toward a generalizing solution.

5. Significance and Implications

Theoretical Unification: The paper provides a rigorous mathematical explanation for grokking, framing it not as a mysterious algorithmic quirk but as a natural consequence of Bayesian model selection in singular spaces. It unifies the "flatness hypothesis" with SLT.
Diagnostic Tool: The Local Learning Coefficient is proposed as a practical, theoretically grounded metric for monitoring training dynamics. Unlike validation loss, which requires a test set, LLC can be estimated from training data to predict when a model will generalize.
Beyond Neural Networks: By using quadratic networks (a simplified but non-trivial architecture), the authors provide a "toy model" with exact solutions that validates hypotheses applicable to complex deep learning systems.
Optimization Insights: The findings suggest that optimization hyperparameters (like learning rate) should be tuned not just for convergence speed, but to facilitate the transition from high-complexity (memorization) basins to low-complexity (generalization) basins.

In summary, the paper establishes that grokking is a phase transition driven by the competition between solution basins of differing statistical complexity (LLC). The model eventually abandons the memorization basin for the generalization basin because the latter offers a superior trade-off between fit and complexity in the Bayesian posterior.