Non-Euclidean Gradient Descent Operates at the Edge of Stability

Imagine you are trying to find the lowest point in a vast, foggy mountain range (this represents training a neural network to make fewer mistakes). You have a map, but it's a bit blurry, and you can only see the ground immediately under your feet.

For decades, the standard advice for navigating this terrain was: "Take small, careful steps. If you step too fast, you'll overshoot the bottom and start bouncing up and down, never settling." In math terms, this meant your step size had to be smaller than a specific limit based on how "steep" or "curvy" the ground was.

But recently, researchers noticed something weird happening in modern AI training. When they took steps that were too big (violating the old rules), the system didn't crash. Instead, it found a strange, rhythmic dance. It would climb up a little, slide down a little, and hover right on the very edge of a cliff, never falling off but never fully stopping. This phenomenon is called the "Edge of Stability" (EoS).

This paper asks a big question: Does this "Edge of Stability" only happen when we walk in a straight line (Euclidean space), or does it happen even when we change the rules of how we measure distance?

Here is the breakdown of their discovery using simple analogies:

1. The Old Way vs. The New Way

The Old Way (Euclidean GD): Imagine walking on a flat, grid-like city street. You measure distance by counting blocks (North/South, East/West). This is the standard way AI models usually learn.
The New Way (Non-Euclidean GD): Imagine walking through a dense jungle or a city with weird, winding canals. Here, "distance" isn't just about blocks; it might be about how much energy it takes to push through the mud, or how many bridges you have to cross.
- The paper looks at methods like $\ell_\infty$ -descent (where you only care about the biggest single step you take, ignoring the small ones) and Spectral GD (where you look at the whole shape of the terrain, like a matrix).

2. The "Sharpness" Meter

To understand if you are about to fall off a cliff, you need a "Sharpness Meter."

In the old days, this meter measured the curvature of the ground. If the ground was too curvy (sharp), you had to take tiny steps.
The authors realized that for these new, weird ways of walking, the old meter didn't work. So, they invented a Generalized Sharpness Meter.
- Analogy: If the old meter was a ruler, the new meter is a flexible tape measure that can stretch to fit the weird shape of the jungle or the grid. It measures "how curvy the ground feels" specifically for the way you are walking.

3. The Big Discovery: The Edge is Everywhere

The team ran experiments on different types of AI models (like image recognizers and language models) using these new walking styles.

What they found:
No matter how weird the walking style was (whether it was the "jungle" style or the "grid" style), the AI always ended up doing the same thing:

Progressive Sharpening: At first, the "Sharpness Meter" goes up. The ground gets curvier.
The Edge of Stability: The meter hits a specific ceiling (mathematically, $2/\text{step-size}$). It doesn't go much higher, and it doesn't drop much lower. It hovers right there.
The Dance: The AI starts oscillating (wiggling back and forth) right at that ceiling, but it keeps making progress overall.

The Takeaway: The "Edge of Stability" isn't a fluke of standard AI training. It's a fundamental law of optimization. Whether you walk in a straight line or a zig-zag, if you take steps that are "just right" (or slightly too big), you will naturally settle into this rhythmic dance on the edge of the cliff.

4. Why Does This Matter?

Think of it like a tightrope walker.

Old Theory: "If you walk too fast, you fall. So, walk slowly."
New Reality: "Actually, if you walk at a specific speed, you enter a state of 'flow' where you wobble but don't fall. You can actually go faster and still stay balanced."

This paper proves that this "flow state" works for many different types of optimizers (the algorithms that teach AI). It suggests that we don't need to be as scared of taking big steps as we thought. As long as we understand the "shape" of the problem (the geometry), the AI will naturally self-correct and find this stable edge, even if the math gets a little wild.

Summary in One Sentence

The paper shows that the strange, rhythmic "dancing" behavior AI models do when they train fast isn't a bug or a fluke of standard methods; it's a universal rule that happens even when we change the fundamental geometry of how the AI moves, proving that these models are naturally smart enough to find the "edge of stability" no matter how they walk.

Here is a detailed technical summary of the paper "Non-Euclidean Gradient Descent Operates at the Edge of Stability."

1. Problem Statement

The paper addresses the phenomenon known as the Edge of Stability (EoS) in deep learning optimization.

Context: In standard Gradient Descent (GD) with step size $\eta$ , classical theory suggests convergence requires the Hessian's largest eigenvalue (sharpness, $S$ ) to be less than $2/\eta $. However, empirical observations show that during training, sharpness often grows until it reaches$ 2/\eta$, after which the loss decreases non-monotonically while sharpness hovers near this threshold.
Gap: While EoS is well-documented for vanilla Euclidean GD and some preconditioned methods (like Adam/Adagrad), it remains unclear whether this phenomenon generalizes to a broader class of Non-Euclidean Gradient Descent methods. These include algorithms using different norms (e.g., $\ell_\infty$ , spectral norms) such as SignGD, Muon, Block Coordinate Descent, and Spectral GD.
Challenge: Existing definitions of sharpness (based on the Euclidean $\ell_2$ norm) fail to capture the stability dynamics of these non-Euclidean methods. A unified theoretical framework is needed to define "sharpness" relative to the specific geometry of the optimizer.

2. Methodology

The authors propose a unified framework based on Directional Smoothness and Generalized Sharpness.

A. Non-Euclidean Gradient Descent Formulation

The paper defines a general class of optimizers minimizing a regularized linearization of the loss $L(w)$ :
$w_{t+1} = \arg \min_y \langle \nabla L(w_t), y - w_t \rangle + \frac{1}{2\eta} \|y - w_t\|^2$
This reduces to standard GD when the norm is $\ell_2$ . It also encompasses:

$\ell_\infty$ -descent: Related to SignGD.
Spectral GD: Related to the Muon optimizer (using spectral norms).
Block CD: Coordinate descent variants.
Normalized variants: Where the step size is scaled by the dual gradient norm.

B. Directional Smoothness ( $D_{\|\cdot\|}$ )

Instead of relying on global smoothness, the authors utilize Directional Smoothness, defined as the average curvature along the chord between consecutive iterates:
$D_{\|\cdot\|}(w, y) = \frac{L(y) - L(w) - \langle \nabla L(w), y - w \rangle}{\frac{1}{2}\|y - w\|^2}$
Key Theoretical Insight:

If the loss decreases ( $\Delta L_t \leq 0$ ), then $D_{\|\cdot\|} \leq 2/\eta$ .
If the loss oscillates (EoS phase), $D_{\|\cdot\|}$ must oscillate around $2/\eta$.
This provides a direct link between the descent condition and the threshold $2/\eta$ for any norm.

C. Generalized Sharpness ( $S_{\|\cdot\|}$ )

The authors define a Generalized Sharpness measure tailored to the specific norm $\|\cdot\|$ used by the optimizer:
$S_{\|\cdot\|}(w) := \max_{d \neq 0} \frac{d^\top \nabla^2 L(w) d}{\|d\|^2} = \max_{\|d\| \leq 1} d^\top \nabla^2 L(w) d$

For $\ell_2$ , this recovers the standard largest Hessian eigenvalue.
For $\ell_\infty$ , it becomes a maximization over binary spin assignments (NP-hard, approximated via Frank-Wolfe).
For Spectral norms, it involves maximizing over matrix directions.

D. Computational Strategy

Since computing $S_{\|\cdot\|}$ is often NP-hard for non-Euclidean norms, the authors employ the Frank-Wolfe (FW) algorithm with multiple random restarts to approximate the maximum curvature direction. They validate that this approximation converges to a stable value as restarts increase.

3. Key Contributions

Unified Interpretation of EoS: The paper interprets EoS through the lens of Directional Smoothness, showing that the condition for loss decrease is intrinsically tied to the directional smoothness being $\leq 2/\eta$ .
Generalized Sharpness Definition: They introduce $S_{\|\cdot\|}$ , a geometry-aware sharpness metric that unifies previous definitions (Euclidean, preconditioned) and extends to new methods ( $\ell_\infty$ , Spectral, Block CD).
Theoretical Divergence Proof: For quadratic objectives, they prove that if the generalized sharpness $S > 2/\eta$ , there exists an initialization where the non-Euclidean GD diverges. This mirrors the Euclidean case, though the proof for non-Euclidean norms is slightly weaker (requiring specific initialization on the unstable eigenvector).
Discovery of Pre-EoS Oscillations: They identify a unique intermediate regime in non-Euclidean GD (specifically $\ell_\infty$ and Spectral) where directional smoothness begins to rise and iterates oscillate before the generalized sharpness reaches $2/\eta$. This regime does not exist in standard Euclidean GD.

4. Experimental Results

The authors validate their theory across MLPs, CNNs, and Transformers using various optimizers:

Vanilla GD ( $\ell_2$ ): Confirms that both directional smoothness and sharpness hover at $2/\eta$ during EoS.
$\ell_\infty$ -descent (SignGD): Shows that generalized sharpness (approximated by FW) stabilizes at or slightly above $2/\eta $. Crucially, the standard$ \ell_2 $sharpness remains far below$ 2/\eta$, proving EoS is invisible without the correct geometry-aware metric.
Block CD: Generalized sharpness (max eigenvalue of block Hessians) approaches $2/\eta$.
Spectral GD (Muon): Generalized sharpness hovers near $2/\eta $. The authors note that Spectral GD is less sensitive to the number of FW restarts compared to$ \ell_\infty$-descent.
Normalized Methods (SignGD, Muon w/o momentum): When normalized by the dual gradient norm, the effective step size changes, but the normalized sharpness still hovers at $2/\eta$.
Quadratic Approximation Test: By switching from the real loss to a local quadratic Taylor approximation during training, they demonstrate that the optimizer is stable before EoS but diverges once EoS is reached, confirming that $2/\eta$ is indeed the stability boundary for the local quadratic model.

5. Significance and Implications

Unification: The paper provides a single, geometry-aware framework to analyze the stability of diverse optimizers, moving beyond the Euclidean $\ell_2$ bias.
New Optimizer Insights: It explains the behavior of modern, high-performance optimizers like Muon and SignGD, showing they operate under the same EoS principles as vanilla GD, provided the correct sharpness metric is used.
Algorithm Design: The findings suggest that the "Edge of Stability" is a fundamental property of gradient-based optimization in non-convex landscapes, not an artifact of specific algorithms.
Future Directions: The paper highlights open questions, such as the mechanism behind the "pre-EoS oscillatory regime" in non-Euclidean methods and the need for stronger convergence proofs for arbitrary initializations in non-Euclidean settings.

In summary, this work extends the theoretical understanding of the Edge of Stability from a Euclidean phenomenon to a general geometric property, demonstrating that any gradient descent method operating under a specific norm will exhibit EoS behavior when measured against its corresponding generalized sharpness.

Non-Euclidean Gradient Descent Operates at the Edge of Stability

1. The Old Way vs. The New Way

2. The "Sharpness" Meter

3. The Big Discovery: The Edge is Everywhere

4. Why Does This Matter?

Summary in One Sentence

1. Problem Statement

2. Methodology

A. Non-Euclidean Gradient Descent Formulation

B. Directional Smoothness (D∥⋅∥D_{\|\cdot\|}D∥⋅∥​)

C. Generalized Sharpness (S∥⋅∥S_{\|\cdot\|}S∥⋅∥​)

D. Computational Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

B. Directional Smoothness ( $D_{\|\cdot\|}$ )

C. Generalized Sharpness ( $S_{\|\cdot\|}$ )

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$