Towards Understanding Adam Convergence on Highly Degenerate Polynomials

Imagine you are trying to roll a heavy ball down a hill to reach the very bottom (the "optimal solution" in machine learning). This is what optimization algorithms like Adam and Gradient Descent do.

For a long time, scientists knew that Adam was a superstar in deep learning, beating older methods like Gradient Descent (GD). But nobody knew exactly why or when it was so good. It was like knowing a specific car is faster on a racetrack but not understanding the engine mechanics.

This paper finally opens the hood and explains the secret: Adam is a master at navigating "flat, slippery valleys," while other methods get stuck.

Here is the breakdown using simple analogies:

1. The Terrain: The "Flat Valley" vs. The "Steep Hill"

Most math textbooks teach us about "steep hills" (Strongly Convex functions). If you drop a ball on a steep hill, it rolls down fast and predictably.

Gradient Descent (GD): Like a hiker who takes small, careful steps. On a steep hill, they get there quickly.
The Problem: In real-world Deep Learning, the "hills" are often actually flat valleys (called degenerate polynomials). Imagine a valley where the ground is perfectly flat for miles.
- If you are a hiker (GD) on a flat valley, your steps get smaller and smaller because the ground feels flat. You barely move. You get stuck in "slow motion."
- The Paper's Discovery: Adam doesn't get stuck. It zooms through these flat valleys.

2. The Secret Weapon: The "Self-Adjusting Snowshoe"

Why does Adam work on flat ground? The paper identifies a clever mechanism involving two internal "memories" that Adam keeps:

Memory A (The Direction): Remembers which way is down.
Memory B (The Speed): Remembers how fast the ground is changing.

The Magic Trick (Decoupling):
On a flat valley, the ground changes very slowly.

Old methods (GD): They look at the ground, see it's flat, and take a tiny step. Tiny step = slow progress.
Adam: It has a "snowshoe" (the second memory). As the ground gets flatter, Adam realizes, "Hey, the ground is so flat that my usual step size is too small!" So, it automatically makes its steps HUGE without anyone telling it to.
The Analogy: Imagine you are walking on ice. If you slip (gradient is small), a normal walker stops. Adam, however, puts on giant snowshoes that instantly amplify your stride, allowing you to glide across the ice at high speed.

The paper proves mathematically that Adam's "snowshoes" (its learning rate) grow exponentially as the ground gets flatter. This turns a slow, sub-linear crawl into a fast, linear sprint.

3. The Three "Moods" of Adam (The Phase Diagram)

The authors also discovered that Adam isn't always perfect. Depending on how you tune its "personality knobs" (hyperparameters $\beta_1$ and $\beta_2$ ), it behaves in three distinct ways:

Mood 1: The Smooth Operator (Stable Convergence)
- What happens: The snowshoes are sized perfectly. Adam glides smoothly to the bottom and stops exactly at the goal.
- When: When the knobs are tuned just right.
Mood 2: The Rollercoaster (Spikes)
- What happens: Adam zooms down fast, but then overshoots the bottom, flies up the other side, and crashes back down. It looks like a "spike" in the error graph.
- Why: It got too excited. The snowshoes were too big, and it couldn't stop in time. It eventually settles, but it makes a mess first.
Mood 3: The Shaky Dancer (Oscillation)
- What happens: Adam never really speeds up. It just wiggles back and forth in place, like a dog shaking off water.
- Why: The knobs are set so that Adam forgets its "snowshoes" too quickly. It stays coupled to the tiny gradients and can't amplify its steps. It acts like a different, slower algorithm entirely.

4. Why This Matters for AI

Deep learning models (like the ones powering Chatbots or Image Generators) live in these "flat valleys."

Old Theory: We thought we needed to manually slow down the learning rate over time (a "scheduler") to make things work.
New Insight: The paper shows that Adam has a natural, built-in ability to handle these flat spots automatically. It doesn't need a manual "slow down" instruction; its internal mechanics naturally speed it up when the path gets flat.

Summary

Think of Gradient Descent as a cautious hiker who gets lost on flat plains.
Think of Adam as a smart drone that detects the flatness and instantly switches to "turbo mode," gliding over the obstacles.

This paper explains the physics of that "turbo mode," proving that Adam's speed comes from a clever trick where it stops looking at the immediate ground and starts trusting its momentum to take giant, efficient leaps. This explains why Adam is the king of training modern AI, especially in complex, flat landscapes.

Here is a detailed technical summary of the paper "Towards Understanding Adam Convergence on Highly Degenerate Polynomials."

1. Problem Statement

Despite Adam's widespread success in deep learning, the theoretical understanding of why it outperforms Gradient Descent (GD) and Momentum on specific problem types remains limited. Previous convergence analyses often rely on external learning rate schedulers or require $\beta_2$ (the second-moment decay rate) to be extremely close to 1.

The core problem addressed is: Under what conditions does Adam exhibit "natural" auto-convergence without external schedulers, and why does it fail or succeed on different landscape geometries?

The authors focus on highly degenerate polynomials (functions where the Hessian vanishes at the minimum, e.g., $L(x) = \frac{1}{k}x^k$ for $k \ge 4$ ). While classical optimization theory focuses on strongly convex functions ( $k=2$ ), deep learning loss landscapes are known to contain many highly degenerate directions. Empirical observations show that Adam behaves differently on these landscapes compared to strongly convex ones, often converging linearly on degenerate functions where GD and Momentum degrade to sublinear rates.

2. Methodology

The authors employ a rigorous theoretical framework combining dynamical systems analysis, asymptotic stability theory, and empirical validation.

Model Problem: They analyze the local behavior around a degenerate minimum $x^*=0$ for the prototype function $L(x) = \frac{1}{k}x^k$ where $k \ge 4$ is an even integer.
State Space Formulation: They derive a closed-form state space system for Adam by introducing normalized variables:
- $\omega_t = m_t / x^{k-1}_t$ (normalized first moment).
- $\lambda_t = x^{k-2}_t / \sqrt{v_t}$ (effective curvature).
  This decouples the iterate scale from the optimizer dynamics.
Stability Analysis: They perform a local asymptotic stability analysis by computing the Jacobian matrix of the linearized system at the non-trivial fixed point. They derive necessary and sufficient conditions for the spectral radius of the Jacobian to be less than 1.
Mechanism Isolation: They analyze RMSProp (Adam with $\beta_1=0$ ) to isolate the effect of the second-moment adaptation, proving that the decoupling of the second moment estimate $v_t$ from the squared gradient $g_t^2$ is the primary driver of acceleration.
Phase Diagram Construction: They map the hyperparameter space $(\beta_1, \beta_2)$ to identify distinct behavioral regimes.

3. Key Contributions

A. Identification of a "Natural" Convergence Class

The paper identifies a class of highly degenerate polynomials where Adam converges automatically with constant step sizes, without requiring learning rate decay. This contrasts with prior work that often required schedulers or specific $\beta_2 \approx 1$ constraints.

B. Proof of Linear Convergence vs. Sublinear Rates

GD and Momentum: The authors prove that on degenerate functions ( $k \ge 4$ ), both GD and Momentum suffer from power-law (sublinear) convergence, specifically $x(t) \sim t^{-1/(k-2)}$ . The computational cost to reach precision $\epsilon$ grows exponentially with the degeneracy order $k$ ( $T_\epsilon \sim \epsilon^{-(k-2)}$ ).
Adam: In contrast, Adam achieves local linear convergence ( $x_t \sim \rho^t$ ) on these same functions. This represents a fundamental complexity separation, reducing the dependence on $k$ from exponential to linear.

C. The Decoupling Mechanism

The paper elucidates the mechanism behind Adam's acceleration:

Decoupling: As the gradient $g_t$ vanishes rapidly on degenerate landscapes, the second moment estimate $v_t$ decouples from the instantaneous squared gradient $g_t^2$ .
Autonomous Decay: $v_t$ enters a regime of autonomous geometric decay ( $v_t \approx \beta_2 v_{t-1}$ ).
Effective Learning Rate: Since the update step is proportional to $1/\sqrt{v_t} $, the geometric decay of$ v_t $induces an **exponentially increasing effective learning rate** ($ \eta_{eff} \propto \beta_2^{-t/2}$). This exponential schedule compensates for the vanishing curvature, driving linear convergence.

D. Hyperparameter Phase Diagram

The authors characterize three distinct behavioral regimes based on $\beta_1$ and $\beta_2$ :

Stable Convergence (Regime I): $\beta_1 < \beta_2^{k/(2(k-2))}$ . The system converges to a stable fixed point with linear rates.
Spikes (Regime II): $\beta_2^{(k-1)/(2(k-2))} < \beta_1 < \beta_2^{k/(2(k-2))}$ . The system initially converges linearly but eventually becomes unstable due to the fixed point being unstable, leading to a "loss spike" before potentially diverging or oscillating.
SignGD-like Oscillation (Regime III): $\beta_1 > \beta_2^{(k-1)/(2(k-2))}$ . The first moment $m_t$ remains tightly coupled with the gradient, preventing the decoupling mechanism. The behavior resembles SignGD, resulting in oscillations around the minimum without exponential convergence.

4. Key Results

Theoretical Bounds: The derived stability conditions (e.g., $\beta_1 < \beta_2^{k/(2(k-2))}$ ) perfectly align with empirical phase diagrams. For $k=4$ , the condition simplifies to $\beta_1 < \beta_2$ .
Convergence Rates: Theoretical predictions for the convergence slope (e.g., $\approx -0.0726$ for $k=4$ ) match experimental loss curves precisely.
Quadratic vs. Degenerate: On strongly convex functions ( $k=2$ ), the non-trivial fixed point does not exist, and the system is inherently unstable (leading to spikes), explaining why Adam often requires learning rate decay for convex problems but thrives on degenerate ones.
Deep Learning Relevance: Experiments on neural networks (MLPs with Softmax vs. ReLU, Transformers vs. CNNs) show that architectures with higher landscape degeneracy (evidenced by Hessian spectral density) benefit more from Adam, validating the theory's applicability to real-world deep learning.

5. Significance

Theoretical Foundation: This work provides the first rigorous proof of Adam's inherent linear convergence on degenerate objectives without external schedulers, filling a critical gap in optimization theory.
Mechanism Explanation: It moves beyond heuristic explanations (like "heavy-tailed noise") to identify the decoupling of moments as the specific mechanism for acceleration.
Practical Guidance: The phase diagram offers concrete guidelines for hyperparameter selection. It explains why certain configurations lead to "loss spikes" (a common empirical phenomenon) and suggests that for highly degenerate problems, Adam is theoretically superior to GD/Momentum.
Architecture Insights: The findings suggest that the success of Adam in Transformers (which exhibit high degeneracy) compared to CNNs may be rooted in the geometric properties of their respective loss landscapes.

In summary, the paper demonstrates that Adam's "magic" in deep learning is not a fluke but a mathematically guaranteed property when optimizing highly degenerate functions, driven by an implicit exponential learning rate schedule generated through moment decoupling.