Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Imagine you are trying to find the lowest point in a vast, foggy mountain range. This is what machine learning does: it tries to find the "perfect" set of settings (parameters) for a neural network that minimizes errors. The tool it uses to walk down the mountain is called Stochastic Gradient Descent (SGD).

Usually, we think of this process as a careful, steady walk. But in reality, especially when the network is huge and the "step size" (learning rate) is big, the walker doesn't just stroll; they sometimes take wild, giant leaps.

This paper, "Large Spikes in Stochastic Gradient Descent: A Large-Deviations View," by Benjamin Gess and Daniel Heydecker, explains why these wild leaps happen, when they happen, and why they are actually a good thing.

Here is the breakdown using simple analogies:

1. The "Catapult" Mechanism

Imagine you are walking down a hill, but the ground is bumpy. Sometimes, you step on a loose rock. Instead of just stumbling, the rock launches you high into the air.

The Spike: In the math world, this is a "spike." The error (loss) of the network suddenly shoots up to a massive number.
The Landing: You don't crash and burn. You land in a completely different spot on the mountain—often a spot that is much flatter and more stable than where you started.
The Paper's Insight: The authors prove that these spikes aren't just random accidents. They are a specific, predictable phase of the training process, which they call the "Catapult Phase."

2. The Two Types of Weather (Inflationary vs. Deflationary)

The paper discovers that the mountain has two different "weather patterns" depending on how steep the hill is (curvature) and how big your steps are (learning rate). They use a special formula (let's call it the G-Function) to predict the weather.

🌪️ The Inflationary Storm (G > 0):
- What happens: If the conditions are right, the "wind" pushes you upward. You are guaranteed to take a giant leap.
- The Result: You will almost certainly fly high, land in a new spot, and reduce the "sharpness" of the mountain (make the solution smoother). This is great! It helps the network learn better.
- Analogy: It's like a rollercoaster that is guaranteed to launch you over a hill.
🌧️ The Deflationary Drizzle (G < 0):
- What happens: The wind is blowing against you. You shouldn't be able to fly. However, sometimes, by pure luck (random chance), you get a series of lucky steps that push you up anyway.
- The Result: These leaps are rare, but not impossible. The paper calculates exactly how rare they are.
- The Surprise: In the real world, we use massive networks (millions of parameters). Even if a leap is "rare" (say, 1 in a million), if you have a billion chances to take a step, you will eventually see it. The paper shows that these "lucky" leaps happen often enough in practice to be useful.

3. Why "Spikes" Are Actually Good

You might think, "If the error goes up, that's bad!"

The Old View: We used to think we should avoid spikes at all costs.
The New View (This Paper): Spikes are the only way to escape "Lazy Training."
- Lazy Training: Imagine the network is stuck in a deep, narrow valley (a "sharp minimum"). It's stable, but it's a bad spot because it doesn't generalize well to new data.
- The Escape: To get out of this narrow valley, you need a huge push. A small step won't do it. You need a spike. The spike acts like a catapult that throws the network out of the narrow valley and into a wide, flat plain (a "flat minimum").
- Flat Minima: These are the "good" spots where the network is robust and works well on new data.

4. The "Large Deviations" Secret

The title mentions "Large-Deviations." In simple terms, this is the math of unlikely events.

Usually, we ignore things that are "too unlikely" to happen.
The authors show that in the world of massive AI, "unlikely" doesn't mean "impossible." It just means "it takes a specific amount of time."
They provide a calculator (the formulas in the paper) that tells you:
1. Will a spike happen for sure?
2. If not, what are the odds?
3. How big will the spike be?

5. The "ReLU" Twist

The paper also looks at networks that use a specific activation function called ReLU (which is like a switch that turns off if the number is negative).

They found that with ReLU, the network splits into two independent "channels" (positive and negative).
The catapult mechanism works on these channels separately. If either channel gets a lucky spike, the whole system benefits.

Summary: The Takeaway

This paper is like a weather forecast for AI training.

Before: Engineers thought spikes were dangerous glitches to be avoided.
Now: We know spikes are a feature, not a bug. They are the mechanism that allows AI to escape bad, sharp solutions and find good, flat ones.
The Magic: The authors give us the exact math to predict when these "catapults" will fire. This explains why modern AI, which uses large learning rates and small batches, is so successful at finding high-quality solutions.

In a nutshell: Sometimes, to get to the bottom of the mountain, you have to be thrown into the air first. This paper explains the physics of that throw.

Here is a detailed technical summary of the paper "Large Spikes in Stochastic Gradient Descent: A Large-Deviations View" by Benjamin Gess and Daniel Heydecker.

1. Problem Statement

The paper investigates the dynamics of Stochastic Gradient Descent (SGD) when training shallow, fully connected neural networks in the Neural Tangent Kernel (NTK) scaling regime. Specifically, it focuses on the phenomenon of "catapults" or large spikes in the loss function.

Context: In deterministic full-batch Gradient Descent (GD), large learning rates ( $\eta$ ) can cause divergence or convergence to specific minima. In SGD, particularly with small batch sizes, the stochastic noise often leads to transient, massive increases in loss (spikes) followed by a rapid return to lower loss values.
The Gap: While empirical evidence suggests these spikes help escape sharp minima and find flatter, better-generalizing solutions, a rigorous mathematical theory explaining when these spikes occur, their probability, and their mechanism was lacking.
Core Question: Under what conditions does SGD produce large spikes that reduce the curvature of the loss landscape (flattening the NTK), and how likely are these events in the "deflationary" regime where they are not guaranteed?

2. Methodology

The authors employ a rigorous probabilistic approach combining stochastic analysis, martingale theory, and Large Deviation Principles (LDP).

Model Setup:
- They analyze a univariate shallow network with $n$ neurons and quadratic loss.
- Two activation functions are considered: Linear ( $\phi(w)=w$ ) and ReLU ( $\phi(w)=\max(0,w)$ ).
- The dynamics are reduced to the evolution of two scalar quantities: the prediction $\mu(t)$ and the curvature (NTK eigenvalue) $\lambda(t)$ .
- The update rule for $\mu(t)$ is approximated as a multiplicative random walk: $\mu(t+1) \approx \mu(t) \cdot (1 - \eta \lambda(t) s_i^2)$ , where $s_i$ is the input data and the index $i$ is sampled stochastically.
Key Analytical Tools:
- Log-Drift Function ( $G$ ): Defined as $G(\lambda) = \sum p_i \log |1 - \eta \lambda s_i^2|$ . The sign of $G$ determines the regime.
- Martingale Construction: The authors construct specific supermartingales and submartingales (e.g., $|\mu(t)|^\vartheta$ ) to bound hitting probabilities.
- Large Deviation Theory: Used to analyze the "deflationary" regime where spikes are not guaranteed but still possible. They characterize the decay rate of the probability of a spike using an exponent $\vartheta$ .
- Change of Measure (Tilting): To prove lower bounds on spike probabilities, they use the Cramér-Doob transformation to define a new probability measure where the log-drift is positive, making the spike "typical" under the new measure.
- Scale Decomposition: To handle the fact that $\lambda(t)$ changes over time, they decompose the trajectory into "moderate" and "large" spikes, proving that "slow" curvature reduction without a spike is exponentially unlikely.

3. Key Contributions

The paper provides a complete classification of the "catapult phase" in SGD, distinguishing it from full-batch GD.

Explicit Criterion for Spike Behavior:
The authors define a function $G(\lambda_0)$ based on the data distribution, learning rate, and initial curvature.
- Inflationary Regime ( $G > 0$ ): Large spikes are guaranteed to occur with high probability. The loss grows exponentially until it hits a threshold where the kernel $\lambda$ is reduced.
- Deflationary Regime ( $G < 0$ ): Large spikes are not guaranteed, but they occur with polynomial probability (specifically $\sim (n/\eta)^{-\vartheta/2}$ ), rather than the exponentially small probability ( $e^{-n}$ ) typical of large deviation events. This explains why spikes are observed in practice even when the drift is negative.
Quantification of Spike Probability:
In the deflationary regime, the probability of a spike decays polynomially with the network width $n$ . The exponent $\vartheta$ is the unique positive root of a convex function derived from the data. This contrasts with standard LDP results which usually predict exponential decay, making these events "appreciable" in practical settings (e.g., $n=10^{12}$ ).
Mechanism of Curvature Reduction:
The paper proves that large spikes are the only efficient way (up to exponentially unlikely events) to escape the "lazy training" regime (where $\lambda$ is constant) and reduce the curvature. Small, gradual fluctuations are insufficient to significantly lower $\lambda$ without a spike.
Extension to ReLU:
The theory is extended to ReLU networks. Under a specific "w-biased" initialization, the positive and negative parts of the network decouple into two independent linear models, allowing the same analysis to apply to each component.

4. Key Results

Theorem 1 (Linear Activation):
- Case A (Inflationary): If $G(\lambda_0) > 0$ , the loss reaches a threshold $L \approx n/\eta$ in time $O(\log(L)/G)$ . The spike reduces the kernel $\lambda$ to a lower value $\lambda^*$ .
- Case B (Deflationary): If $G(\lambda_0) < 0$ but $\lambda_0$ is within a specific range, the probability of reaching a threshold $L$ decays as $(| \mu_0 | / L)^{\vartheta/2}$ .
- Comparison with Full-Batch: The authors show that the "catapult" region for SGD is strictly larger than for full-batch GD. There exist parameter regimes where full-batch GD converges monotonically, but SGD produces large spikes with polynomial probability.
Theorem 2 (ReLU Activation):
- Similar results hold for ReLU networks. If at least one component (positive or negative) is in the inflationary regime, a spike occurs. If both are deflationary, the probability is governed by the maximum of the two decay exponents.
Proposition 4.2 (Slow Escape is Improbable):
- It is proven that the kernel $\lambda$ cannot decrease significantly without a large spike in the prediction $\mu$ . The probability of $\lambda$ decreasing by a constant amount while $\mu$ remains small decays exponentially.
Non-Monotonicity:
- The paper demonstrates that increasing the initial curvature $\lambda_0$ does not monotonically increase the likelihood of spikes. Due to the non-convex nature of the log-drift function, increasing $\lambda_0$ can sometimes move the system from an inflationary to a deflationary regime, or increase the decay exponent $\vartheta$ , making spikes less likely.

5. Significance and Implications

Theoretical Validation of Empirical Observations: The work rigorously validates the "catapult mechanism" proposed in recent literature, explaining why SGD with large learning rates finds flatter minima.
Practical Relevance of Polynomial Decay: By showing that spike probabilities decay polynomially rather than exponentially, the paper explains why these phenomena are observable in modern deep learning (where $n$ is large) even when the learning rate is not in the "guaranteed" inflationary regime.
New Stability Criterion: The paper introduces a distinction between "almost sure linear blowup" (spikes) and "blowup in expectation." It argues that the almost sure condition is the relevant one for understanding SGD dynamics, leading to a new "edge of stochastic stability."
Design Guidelines: The explicit formulas for $G(\lambda)$ and $\vartheta(\lambda)$ allow practitioners to predict whether a specific dataset and learning rate will induce beneficial spikes or lead to divergence, offering a theoretical basis for hyperparameter tuning.

In summary, this paper provides a mathematically precise framework for understanding the interplay between SGD noise and the non-linear dynamics of neural networks, revealing that "large spikes" are a fundamental, quantifiable, and often necessary feature of the optimization process in the NTK regime.

Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

1. The "Catapult" Mechanism

2. The Two Types of Weather (Inflationary vs. Deflationary)

3. Why "Spikes" Are Actually Good

4. The "Large Deviations" Secret

5. The "ReLU" Twist

Summary: The Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this

XConv: Low-memory stochastic backpropagation for convolutional layers

A Survey on Decentralized Federated Learning

Polynomially Over-Parameterized Convolutional Neural Networks Contain Structured Strong Winning Lottery Tickets

Provable Filter for Real-world Graph Clustering

Enhancing Computational Efficiency in Multiscale Systems Using Deep Learning of Coordinates and Flow Maps