Marginals Before Conditionals

The Big Idea: Learning to Guess Before Learning to Know

Imagine you are trying to teach a robot how to predict the future. The researchers built a specific game to see how the robot learns. They discovered that the robot doesn't learn the "perfect" answer immediately. Instead, it goes through two distinct phases:

The "Safe Guess" Phase: The robot learns to make a good average guess, ignoring specific clues. It gets stuck here for a long time.
The "Aha!" Moment: Suddenly, the robot figures out how to use a specific clue to get the exact right answer. This happens all at once, like a light switch flipping.

The paper is about understanding why the robot gets stuck in the first phase and what finally pushes it into the second.

The Game: The "Magic Box" Analogy

To test this, the researchers created a simple puzzle:

The Setup: Imagine a Magic Box (let's call it B) that contains a secret code (A).
The Problem: One Magic Box actually holds many different codes inside it. If you just look at the Box, you can't know which code is inside. It's like a vending machine that has 10 different snacks in the same slot. If you press the button, you get a random snack.
The Clue: There is a special Selector Token (let's call it Z). If you tell the robot, "I want the snack from the red slot," the robot can pick the exact right one.
The Goal: The robot needs to learn that Box + Selector = Exact Snack.

Phase 1: The "Plateau" (The Long Wait)

When the training starts, the robot is smart enough to realize: "I don't know which snack is in the red slot yet, but I know the box usually contains snacks."

So, the robot stops trying to guess the specific snack. Instead, it learns to say: "I'll just guess that any snack from that box is equally likely."

The Result: The robot's error rate drops to a specific level (mathematically, $\log K$ ) and then stops moving. It hits a flat "plateau."
The Analogy: Imagine you are trying to find a specific friend in a crowded stadium. You don't know which section they are in. So, you just stand in the middle of the stadium and shout, "I'm guessing they are somewhere in here!" You aren't wrong (you are technically in the stadium), but you aren't finding them either. You are stuck in a "safe" position.

Key Discovery 1: The Length of the Wait
The researchers found that how long the robot stays stuck depends on how many total examples it has to learn, not how confusing the puzzle is.

Analogy: If you have 1,000 different Magic Boxes to learn, it takes a long time to figure out the trick for all of them. If you have 10,000 boxes, it takes even longer. It doesn't matter if each box has 3 snacks or 36 snacks inside; the time it takes to learn the trick is determined by the total volume of work (the dataset size), not the complexity of the individual boxes.

Phase 2: The "Snap" (The Collective Leap)

After thousands of steps of being stuck, something magical happens. The robot doesn't slowly get better at one box at a time. Instead, all the boxes get solved at the exact same moment.

The Analogy: Imagine a room full of people trying to solve a puzzle. For a long time, everyone is guessing randomly. Then, suddenly, a whisper goes through the room, and everyone figures out the solution in the same second. It's a "collective snap."
The Internal Change: Inside the robot's brain (the neural network), a specific part of the circuit (a "selector-routing head") starts building up before the robot actually gets the answer right. It's like the robot is building the ladder while it's still standing on the ground, and only when the ladder is finished does it climb up to the solution.

Why Does It Get Stuck? (The "Entropic Force")

Why doesn't the robot just figure it out faster? The paper suggests that noise (randomness in the learning process) actually traps the robot.

The Analogy: Imagine the robot is a ball sitting in a very wide, flat valley (the "marginal solution"). To get to the "perfect answer," it needs to roll up a tiny, shallow hill to get to a deeper valley on the other side.
The Trap: Because the hill is so flat, the random shaking (noise) of the ground makes the ball wobble back and forth. It's hard for the ball to find the tiny path up the hill because the shaking keeps pushing it back into the flat valley.
The Surprise: Usually, we think more noise helps you escape a trap. Here, more noise actually makes it take longer to escape. The randomness keeps the robot comfortable in its "safe guess" mode.

The "Arrow of Time" (Forward vs. Backward)

The paper also looked at learning in reverse.

The Backward Task (Hard): "Given the Box and the Selector, what is the Snack?" (This is the puzzle we just solved).
The Forward Task (Easy but Slow): "Given the Snack, what was the Box?"
The Twist: Even though the Forward task is logically simpler (no ambiguity), the robot learns it slower.
Why? The Backward task has a structure (the Box groups the snacks) that helps the robot build a "shortcut" (a circuit) to solve it. The Forward task is just a list of random pairs with no structure, so the robot has to memorize every single one individually. It's like learning a song with a chorus (easy to remember the pattern) vs. memorizing a phone book (harder because there's no pattern).

Summary: What Did We Learn?

Staged Learning: AI doesn't learn everything at once. It learns the "average" first, gets stuck, and then suddenly learns the "specifics."
Volume Matters: The time it takes to break out of the "stuck" phase depends on how much data you feed the model, not how hard the puzzle is.
Noise is a Double-Edged Sword: Randomness in the training process can actually keep the AI stuck in a "good enough" state for a long time.
The "Snap": When the AI finally learns, it happens all at once across the whole system, not piece by piece.

In plain English: The AI is like a student who is afraid to guess the specific answer, so they just give the "average" answer for a long time. They only start guessing the specific answer when they have seen enough examples to feel confident, and once they get the confidence, they suddenly get everything right at once.

1. Problem Statement

The paper investigates the phenomenon of staged learning in neural networks, specifically focusing on the transition from learning marginal distributions to learning conditional distributions. While previous work (e.g., "grokking") has studied the delay between memorization and generalization, this work isolates a distinct transition: the delay between learning a partial solution (ignoring a selector token) and the full conditional solution (using the selector).

The core question is: Why do models learn to predict the marginal $P(A|B)$ first, stalling at a specific loss plateau, before suddenly "snapping" to the conditional solution $P(A|B, z)$ ? The authors aim to characterize the dynamics of this plateau, the factors determining its duration, and the internal mechanisms driving the transition.

2. Methodology

The Task: Surjective Map with $K$ -fold Ambiguity

The authors constructed a minimal, controlled task ("wind tunnel") to isolate conditional learning:

Inputs: A base string $B$ (6 characters) and a selector token $z$ (2 characters).
Targets: A target string $A$ (4 characters).
Mapping: There are $n_b$ base strings. Each $B$ maps to $K$ distinct targets $A$ . The selector $z$ indexes into these $K$ targets, making the mapping $(B, z) \to A$ one-to-one.
Information Theory:
- Marginal Entropy: $H(A|B) = \log K$ (uncertainty without $z$ ).
- Conditional Entropy: $H(A|B, z) = 0$ (perfect prediction with $z$ ).
Model Behavior: A model ignoring $z$ converges to a loss of $\log K$ (uniform distribution over candidates). A model utilizing $z$ achieves zero loss.

Experimental Setup

Architecture: 4-layer Transformer ( $d=128$ , 4 heads, $\sim600K$ parameters) trained with AdamW.
Metrics:
- Loss: Cross-entropy loss.
- $\Delta_z$ (z-shuffle gap): The difference in loss between using original $z$ tokens and shuffled $z$ tokens within a batch. $\Delta_z = 0$ indicates the model ignores $z$ ; $\Delta_z > 0$ indicates $z$ is being used.
- Waiting Time ( $\tau$ ): The step count where loss drops below 50% of the plateau height ( $\log K$ ).
Variables: Systematic sweeps over dataset size ( $D$ ), ambiguity ( $K$ ), batch size ( $B$ ), learning rate ( $\eta$ ), and label noise.

3. Key Contributions & Results

A. The Two-Stage Learning Dynamic

Every run with $K > 1$ exhibits two distinct regimes:

Marginal Plateau: The model quickly converges to a loss of exactly $\log K$ (within a few hundred steps). During this phase, $\Delta_z \approx 0$ , meaning the model ignores the selector token and predicts a uniform distribution over the $K$ candidates.
Sharp Transition: After a waiting period, the loss drops sharply to near zero, and $\Delta_z$ becomes positive, indicating the model has learned to route based on $z$ .

B. Determinants of Plateau Duration

The most significant finding is that the duration of the plateau ( $\tau$ ) depends on the dataset size ( $D$ ), not the ambiguity ( $K$ ).

Control Experiment: When $D$ is held constant (by adjusting $n_b = D/K$ ) while varying $K$ , $\tau$ remains flat.
Scaling Law: When $D$ varies, $\tau$ scales super-linearly: $\tau \propto D^{1.19}$ .
Implication: The model must learn to route $z$ across all $D$ examples. The complexity of the ambiguity ( $K$ ) does not independently slow down learning; only the total volume of data to be processed matters.

C. Collective Transition (The "Snap")

The transition is collective, not incremental.

At $\tau/2$ , 0% of sampled groups have achieved high accuracy.
At $\tau$ , nearly 100% of groups transition simultaneously within a narrow window (approx. $0.5\tau$).
This suggests a shared internal circuit forms and becomes operational for all groups at once, rather than groups learning individually over time.

D. Entropic Stabilization

The plateau is not a local minimum but a metastable saddle point stabilized by gradient noise.

Gradient Noise Effect: Increasing noise (via smaller batch sizes or higher learning rates) delays the transition.
- Higher $\eta$ : $3.6\times$ slower transition (at fixed throughput).
- Smaller Batch Size: $1.8\times$ slower transition (at fixed throughput).
Mechanism: The marginal solution sits at a saddle with extreme anisotropy ( $\lambda_{max} \approx 2.8$ , $\lambda_{min} \approx -0.005$ ). The escape direction is $\sim500\times$ shallower than the dominant curvature. Gradient noise tends to project onto the high-curvature (non-escape) directions, acting as an "entropic force" that pushes the model back into the marginal solution, stabilizing it.

E. Internal Mechanism & Circuit Formation

Selector-Routing Head: Causal ablation identifies a specific head in the first layer (L0H3) as the critical component.
Cascade Timing: The onset of $z$ -dependence ( $\Delta_z > 0$ ) precedes the loss drop by $\sim50\%$ of the waiting time. The internal circuit assembles before the global loss transition occurs.
Geometry: During the plateau, weight updates are random walks (cosine similarity $\approx 0$ ). At the transition, updates become coherent (cosine similarity $\approx 0.8$ ).

F. Directional Asymmetry (Connection to Reversal Curse)

The paper connects this to the "Reversal Curse" (models trained on $A \to B$ failing to infer $B \to A$ ).

Backward Task ( $(B, z) \to A$ ): Structured, shared groups. Fast learning.
Forward Task ( $A \to B$ ): Unambiguous but requires memorizing independent pairs without shared structure.
Result: The forward task is 1.7–4.4 $\times$ slower than the backward task. The shared group structure in the backward task scaffolds circuit formation, whereas the forward task requires independent memorization.

4. Significance and Implications

Mechanistic Understanding of "Grokking": This work provides a precise, information-theoretic decomposition of delayed generalization. It separates the "height" of the delay (determined by ambiguity $\log K$ ) from the "duration" (determined by dataset size $D$ ).
Role of Noise: Contrary to the intuition that noise helps escape local minima, this paper demonstrates that in high-dimensional anisotropic landscapes, gradient noise can stabilize suboptimal solutions (entropic stabilization) by preventing alignment with shallow escape directions.
Circuit Formation: It offers empirical evidence that internal circuits (like the selector-routing head) form and stabilize before they manifest in global performance metrics, suggesting that "hidden progress" is a measurable, dynamic process.
Reversal Curse Explanation: It provides a structural explanation for the reversal curse: learning directions that lack shared group structure (requiring independent memorization) are significantly slower than those with structured ambiguity.
Falsification of Alternatives: The authors systematically tested and falsified seven alternative hypotheses (e.g., incremental group coverage, barrier crossing, linear sufficiency), narrowing the viable explanations to the entropic stabilization of a saddle point.

Conclusion

The paper establishes that neural networks learn marginals before conditionals due to the entropic stabilization of a saddle point by gradient noise. The transition to the conditional solution is a collective, sharp event triggered when the optimizer aligns with a shallow escape direction, a process heavily dependent on dataset size rather than task ambiguity. This framework bridges information theory, optimization dynamics, and mechanistic interpretability to explain staged learning phenomena.