On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

Here is an explanation of the paper "On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD," translated into simple language with creative analogies.

The Big Idea: Why "Mistakes" Make AI Smarter

Imagine you are teaching a child to recognize animals. Usually, you want perfect examples: a picture of a cat labeled "cat," a dog labeled "dog."

But what if, every now and then, you accidentally told the child, "This is a dog" when it was actually a cat? You might think this would confuse them. Surprisingly, recent research shows that making these small, random mistakes actually helps the child learn better. They end up recognizing animals more accurately and remembering the essential features (like "has whiskers") rather than memorizing every single photo.

This paper explains why this happens mathematically. It looks at how Artificial Intelligence (AI) learns when we intentionally add "noise" (mistakes) to the labels during training.

The Two-Phase Dance of Learning

The authors discovered that when an AI learns with these noisy labels, it doesn't just learn faster; it goes through two distinct "phases" or stages, like a dancer moving from a stiff rehearsal to a fluid performance.

Phase 1: The "Shrinking" Phase (Escaping the Lazy Zone)

The Analogy: Imagine a group of stiff, wooden mannequins (the AI's neurons) standing in a room. At the start, they are frozen in place. They are "lazy." They haven't really moved or learned anything new; they are just mimicking the starting position. This is called the Lazy Regime.

When we introduce label noise (the mistakes), it's like shaking the room.

What happens: The noise causes the "second layer" of the mannequins (the ones holding the signs) to vibrate or oscillate wildly.
The Result: Because of this shaking, the "first layer" (the legs and bodies) starts to shrink. The weights (the strength of the connections) get smaller and smaller.
Why it matters: As the weights shrink, the mannequins stop being stiff. They break free from their frozen starting positions. They enter the Rich Regime. Now, they are flexible and can actually start learning complex patterns instead of just copying the starting setup.

Key Takeaway: The noise acts like a "shaker." It forces the AI to stop being lazy and start moving, shrinking away the unnecessary bulk.

Phase 2: The "Alignment" Phase (Finding the Truth)

The Analogy: Now that the mannequins are flexible and moving, they need to find the right pose. Imagine there is a "Ghost of Truth" (the perfect solution) standing in the center of the room.

What happens: Because the first layer has shrunk and become flexible, the mannequins can now easily rotate and align themselves. They all start pointing in the same direction as the "Ghost of Truth."
The Result: The AI stops guessing randomly. It locks onto the correct pattern. It becomes sparse, meaning it only keeps the most important connections and discards the rest.
Why it matters: This is when the AI actually learns the "real" rules of the game. It converges to a solution that is simple, efficient, and works well on new, unseen data.

Key Takeaway: Once the AI is flexible, the noise helps it snap into the perfect, simple shape that solves the problem.

Why Does This Matter?

In the past, scientists thought noise was bad. They thought if you wanted a smart AI, you needed perfect, clean data. This paper proves the opposite: Noise is a feature, not a bug.

It prevents overthinking: Without noise, AI might get stuck in a "lazy" state where it just memorizes the training data (like a student memorizing answers without understanding). Noise forces it to understand the underlying logic.
It creates simpler models: The "shrinking" phase naturally removes unnecessary parts of the AI. This leads to models that are smaller, faster, and easier to run on phones or laptops.
It applies to other tools: The authors also showed that this same "shaking" effect happens with another popular AI tool called SAM (Sharpness-Aware Minimization). So, the principle of "shake it to make it better" is a general rule for modern AI.

The "Secret Sauce" Explained Simply

Think of the AI as a team of rowers in a boat.

Without Noise: They all row in perfect, stiff unison, but they are rowing in a straight line that doesn't go anywhere useful. They are "lazy."
With Noise: The coach (the algorithm) occasionally yells "Row left!" when they should row right. This confuses them for a second, causing them to wobble.
The Magic: That wobble breaks their stiff formation. They start adjusting their oars, dropping the heavy, useless ones (shrinking weights), and eventually, they all find the perfect rhythm to row straight toward the finish line (the ground truth).

Conclusion

This paper provides the mathematical proof for a counter-intuitive idea: To build a smarter, more generalizable AI, we should sometimes let it make mistakes.

The noise acts as a catalyst. It first forces the AI to stop being rigid (Phase 1), and then guides it to find the simplest, most accurate solution (Phase 2). It turns out that a little bit of chaos is exactly what order needs to emerge.

Here is a detailed technical summary of the paper "On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD".

1. Problem Statement

Deep learning success is often attributed to the implicit bias introduced by stochastic noise in gradient-based training. While clean data is ideal, recent empirical observations suggest that injecting label noise (flipping labels with a certain probability) during training paradoxically improves model generalization and leads to sparser solutions.

The core problem addressed by this paper is: How does label noise, typically considered detrimental in statistical learning, confer benefits in over-parameterized models? Specifically, the authors aim to theoretically explain the learning dynamics of Stochastic Gradient Descent (SGD) with label noise, focusing on the transition from the "lazy regime" (kernel regime) to the "rich regime" (feature learning regime).

2. Methodology and Setup

The authors conduct a rigorous theoretical analysis combined with extensive empirical validation.

Model Architecture: A two-layer linear network defined as $\hat{y}_i = \mathbf{a}^\top \mathbf{W} \mathbf{x}_i$ , where $\mathbf{W} \in \mathbb{R}^{m \times d}$ (first layer) and $\mathbf{a} \in \mathbb{R}^m$ (second layer). The network is over-parameterized ( $m \gg d$ ).
Training Algorithm: Label Noise SGD (Algorithm 1). In each step, with probability $\tau$ , the true label $y_i$ is replaced by a random label $\tilde{y}_i$ . For theoretical tractability, the authors adapt this to a regression setting where the noisy label is $\tilde{y}_i = y_i + \epsilon$ , with $\epsilon$ being noise with variance $\sigma^2$ .
Initialization: NTK (Neural Tangent Kernel) initialization, where weights are small ( $O(1/\sqrt{d})$ and $O(1/\sqrt{m})$ ), typically associated with the lazy regime.
Theoretical Framework: The analysis relies on high-probability bounds, concentration inequalities, and the study of coupled dynamics between the first and second layers.

3. Key Contributions and Theoretical Findings

The paper identifies a two-phase learning behavior driven by label noise, which is the primary theoretical contribution.

Phase I: Progressive Diminishing and Escaping the Lazy Regime

Observation: In the initial phase, the magnitudes of the first-layer weights ( $\|\mathbf{w}_i\|$ ) progressively diminish.
Mechanism: The label noise induces oscillations in the second-layer parameters ( $\mathbf{a}$ ). These oscillations create a negative drift in the squared norm of the first-layer weights.
Result: The model escapes the lazy regime (where weights barely move from initialization, behaving like a linear kernel) and enters the rich regime (feature learning).
Theorem 4.2: Proves that with high probability, all neurons escape the lazy regime within a specific time $T_1$ , driven by the noise-induced oscillations.

Phase II: Alignment and Convergence

Observation: Once the weights have diminished sufficiently (entering the rich regime), the neurons begin to align with the ground-truth interpolator $\theta^*$ .
Mechanism: The dynamics shift to a feature-learning mode where the direction of the weights aligns with the optimal solution.
Result: The model converges to a sparse solution that perfectly fits the data (interpolates).
Theorem 4.5 & 4.6: Demonstrate that after alignment, the model converges to the global minimum with a sparsity bias.

Extension to Sharpness-Aware Minimization (SAM)

The authors extend their findings to SAM, an optimization algorithm designed to find flat minima.
Finding: SAM exhibits the same two-phase dynamics (progressive diminishing followed by alignment) as label noise SGD. This suggests that the underlying mechanism of label noise (inducing oscillations to escape the lazy regime) is generalizable to other noise-amplifying optimizers.

4. Experimental Results

The theory is validated through both synthetic and real-world experiments:

Synthetic Experiments:
- Replicate the two-layer linear network setup.
- Figure 2: Visualizes the two-phase dynamics. The average neuron norm drops (Phase I), followed by a rapid increase in alignment with the ground truth (Phase II).
- Simulation: A simplified Markov process simulation confirms that oscillations in the second layer are sufficient to drive the diminishing of the first layer.
Real-World Experiments:
- Trained WideResNets on CIFAR-10.
- Figure 1 & 3: Models trained with label noise SGD show significantly better test accuracy and lower test loss compared to vanilla SGD.
- Sparsity: Models trained with label noise maintain higher performance after aggressive pruning (removing weights), confirming they learn sparser, more robust representations.
- Regime Transition: The loss curves of label noise SGD deviate significantly from linearized models (lazy regime), confirming entry into the rich regime.

5. Significance and Implications

Mechanistic Explanation: The paper provides a mathematical explanation for why "bad" data (noisy labels) can lead to "good" generalization. It attributes this to the noise forcing the model out of the lazy kernel regime into the feature-learning regime.
Implicit Bias: It highlights a new form of implicit bias: label noise acts as a regularizer that promotes sparsity and feature learning by inducing specific weight oscillations.
Generalization to SAM: By linking label noise SGD to SAM, the paper unifies the understanding of how different noise-injection strategies improve generalization.
Theoretical Novelty: This is one of the first detailed theoretical investigations of label noise SGD in networks with two or more trainable layers, addressing the complex coupling between layers that makes such analysis difficult.

Conclusion

The authors conclude that label noise is not merely a perturbation but a critical driver that transitions neural networks from linear, kernel-like behavior to non-linear, feature-learning behavior. This transition is characterized by an initial phase of weight norm reduction (escaping laziness) followed by alignment with the ground truth, ultimately leading to sparse, generalizable solutions.

On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

The Big Idea: Why "Mistakes" Make AI Smarter

The Two-Phase Dance of Learning

Phase 1: The "Shrinking" Phase (Escaping the Lazy Zone)

Phase 2: The "Alignment" Phase (Finding the Truth)

Why Does This Matter?

The "Secret Sauce" Explained Simply

Conclusion

1. Problem Statement

2. Methodology and Setup

3. Key Contributions and Theoretical Findings

Phase I: Progressive Diminishing and Escaping the Lazy Regime

Phase II: Alignment and Convergence

Extension to Sharpness-Aware Minimization (SAM)

4. Experimental Results

5. Significance and Implications

Conclusion

More like this

Comparison of Outlier Detection Algorithms on String Data

Structure-Aware Epistemic Uncertainty Quantification for Neural Operator PDE Surrogates

Interventional Time Series Priors for Causal Foundation Models

Fingerprinting Concepts in Data Streams with Supervised and Unsupervised Meta-Information

Graph Tokenization for Bridging Graphs and Transformers