Adapt or Forget: Provable Tradeoffs Between Adam and… — Plain-Language Explanation

Original authors: Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells

Published 2026-05-07

📖 5 min read🧠 Deep dive

Original authors: Sharan Sahu, Abir Sarkar, Cameron J. Hogan, Martin T. Wells

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to follow a moving target in a foggy field. The target (the "optimal solution") is constantly shifting its position, and you can only see it through a blurry, noisy lens. Your goal is to stay as close to the target as possible.

This paper is a theoretical investigation into two different strategies for following this moving target: SGD (Stochastic Gradient Descent) and Adam (Adaptive Moment Estimation). While Adam is the "go-to" tool for training modern AI, this paper asks: Does Adam actually help when the world is changing, or does it sometimes make things worse?

Here is the breakdown of their findings using simple analogies.

The Two Runners

SGD (The Sprinter): This runner takes a step based only on what they see right now. If the ground looks like it slopes down, they step that way. They don't remember where they were five seconds ago.
- Strength: Because they don't carry baggage, they can react instantly when the target suddenly changes direction.
- Weakness: If the view is foggy (noisy data), they might take a wrong step based on a glitch in the fog.
Adam (The Marathoner with a Backpack): This runner is smarter. They carry a "backpack" of memory.
- First-Moment Memory (The Compass): They remember the average direction they've been going. If the path is bumpy, they smooth out their steps by averaging past directions.
- Second-Moment Memory (The Terrain Map): They remember how steep the ground has been in the past. If a path was steep before, they take smaller steps there; if it was flat, they take bigger steps.
- Strength: In a foggy, bumpy environment, this memory helps them stay steady and not get knocked off course by random noise.
- Weakness: If the target suddenly sprints in a new direction, the runner's memory (the compass and map) is now "stale." They are still trying to follow the old path, causing them to lag behind.

The Big Discovery: The "Noise vs. Drift" Tradeoff

The paper proves mathematically that there is a fundamental tradeoff. You cannot win in both scenarios with the same strategy.

Scenario A: The "Drift-Dominated" World (The Target is Running Fast)

Imagine the target is sprinting across the field, changing direction rapidly.

What happens: Adam's "backpack" becomes a liability. The runner is looking at an old map and following an old compass. By the time they adjust their memory to the new direction, the target has moved again.
The Result: SGD wins. The sprinter who ignores the past and reacts only to the present can keep up with the fast-moving target better than the runner burdened by memory.
Paper's Claim: In high-drift regimes, the "stale" information in Adam actually hurts performance, creating a larger gap between you and the target.

Scenario B: The "Noise-Dominated" World (The Target is Standing Still, but the Fog is Thick)

Imagine the target is standing still, but the wind is blowing debris everywhere, making it hard to see the ground.

What happens: SGD, the sprinter, gets confused by every gust of wind and stumbles around. Adam, the marathoner, uses its memory to say, "Okay, that gust of wind was just noise; the general trend is still here."
The Result: Adam wins. The adaptive memory smooths out the chaos, allowing the runner to stay closer to the target than the jittery sprinter.
Paper's Claim: In high-noise regimes, Adam's ability to average out the noise makes it superior to SGD.

The "Burn-In" and the "Floor"

The paper also explains why Adam sometimes takes a long time to get going (the "burn-in" period) and why it never gets perfectly close to the target (the "floor").

The Burn-In: When Adam starts, its "backpack" is empty. It has to fill it up with data before it can use its memory effectively. During this time, it might actually perform worse than SGD.
The Floor: Even after a long time, Adam can't get perfectly close to a moving target. The paper breaks down exactly why this gap exists. It's caused by four things:
1. Starting Position: Where you began.
2. Target Speed: How fast the target is running (Drift).
3. Memory Lag: How much the "backpack" is holding onto the past (controlled by a setting called $\beta_1$ ).
4. Map Instability: How much the "terrain map" is fluctuating (controlled by a setting called $\beta_2$ ).

The "Stabilizer" Knob ( $\epsilon$ )

One of the most practical findings is about a specific setting in Adam called $\epsilon$ (epsilon).

The Analogy: Think of $\epsilon$ as a "shock absorber" or a "dampener" on the runner's shoes.
The Finding: The paper explains why increasing $\epsilon$ $ϵ$ helps Adam when the world is changing (drift).
- A small $\epsilon$ makes the runner very sensitive to the "terrain map." If the map glitches, the runner stumbles.
- A large $\epsilon$ acts as a buffer. It stops the runner from overreacting to small, noisy changes in the map. This makes the runner more stable when the target is moving, preventing them from getting thrown off balance by the adaptive mechanism itself.

Summary

The paper provides a mathematical "rulebook" for when to use which runner:

If your data is changing rapidly (high drift): Don't use Adam's heavy memory. Use SGD (or a version of Adam with less memory) so you can react quickly.
If your data is noisy but stable (high noise): Use Adam. Its memory will help you ignore the noise and find the true path.
If you must use Adam in a changing world: You might need to tweak the "shock absorber" ( $\epsilon$ ) to stop the algorithm from getting too jittery.

The authors conclude that Adam isn't "bad"; it's just that its superpower (memory) becomes a weakness when the environment changes too fast for that memory to keep up.

Technical Summary: Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

Problem Statement
This paper addresses the theoretical behavior of adaptive gradient methods, specifically Adam, under non-stationary stochastic objectives. Unlike the standard stationary setting where the goal is convergence to a fixed minimizer, this work considers a time-varying optimization problem where the objective function $G_t(\theta)$ changes over time due to a drifting distribution $\Pi_t$ . The central question is: When does Adam's adaptive preconditioning improve tracking of a moving minimizer compared to vanilla Stochastic Gradient Descent (SGD), and when does its momentum-based memory become detrimental?

While empirical evidence suggests Adam can suffer from "plasticity loss" or instability under distribution shifts, a precise theoretical characterization of these failure modes and the specific role of Adam's hyperparameters ( $\beta_1, \beta_2, \epsilon$ ) in non-stationary regimes has been lacking.

Methodology and Framework
The authors analyze the Adam algorithm within a stochastic predictability framework, where the target minimizer $\theta^*_t$ is a predictable process adapted to the filtration $\mathcal{F}_t$ . The analysis is divided into two primary regimes:

Euclidean Tracking under Adaptive Strong Monotonicity: The authors derive finite-time tracking bounds ( $\|\theta_t - \theta^*_t\|$ ) by imposing a strong monotonicity condition on the predictable proxy of the Adam-preconditioned mean-gradient operator. This approach separates the predictable geometry of the problem from the stochastic fluctuations of the realized preconditioner.
Projected Stationarity under General Preconditioning: Without assuming strong monotonicity, the authors establish high-probability bounds on the average projected stationarity gap. This generalizes the analysis to non-convex settings and constrained optimization, reducing to standard gradient-norm guarantees when constraints are inactive.

Key technical innovations include:

Predictable Proxy Construction: To handle the fact that the Adam preconditioner $P_{t+1}$ depends on the fresh sample $X_{t+1}$ (making it non-predictable), the authors construct a predictable proxy $\tilde{P}_{t+1}$ using the conditional expectation of the second moment. This allows the derivation of contraction conditions that do not rely on optional stopping arguments.
Error Decomposition: The tracking error is rigorously decomposed into four distinct components: initialization decay, objective drift, first-moment tracking error (governed by $\beta_1$ ), and preconditioner perturbation (governed by $\beta_2$ and $\epsilon$ ).
Concentration Inequalities: The analysis utilizes conditional $\Psi_\alpha$ -Orlicz norms and Freedman-type martingale inequalities to derive high-probability bounds that hold uniformly over the time horizon.

Key Contributions and Results

Finite-Time Tracking Bounds: The paper provides explicit high-probability bounds for Adam that decompose the error into interpretable terms. The bounds reveal that the tracking floor is determined by a tradeoff between the noise reduction provided by momentum and the lag introduced by stale gradient information.
The Noise–Drift Tradeoff: The central theoretical finding is a sharp tradeoff between noise-dominated and drift-dominated regimes:
- Noise-Dominated Regimes: When stochastic gradient noise is high, Adam's first-moment averaging (controlled by $\beta_1$ ) and adaptive preconditioning reduce the high-probability tracking floor compared to SGD.
- Drift-Dominated Regimes: When the objective drifts rapidly, the memory bias induced by $\beta_1$ and the perturbations in the second-moment preconditioner (induced by $\beta_2$ ) compound the cost of non-stationarity. In these regimes, vanilla SGD, which lacks this memory, achieves a smaller tracking floor by adapting more quickly to the moving target.
Hyperparameter Characterization: The bounds explicitly delineate the roles of Adam's hyperparameters:
- $\beta_1$ (First Moment): Controls a bias-variance tradeoff. Large $\beta_1$ suppresses noise but amplifies memory bias, making it harmful under rapid drift.
- $\beta_2$ (Second Moment): Governs a transient-floor tradeoff. Large $\beta_2$ reduces the asymptotic preconditioner perturbation floor but slows the decay of the transient "burn-in" time.
- $\epsilon$ (Stabilization): The analysis provides a theoretical mechanism for the empirical observation that increasing $\epsilon$ stabilizes Adam under task changes. Larger $\epsilon$ dampens the variability of the adaptive second-moment process, reducing the preconditioner perturbation term at the cost of slower adaptation to drift.
Projected Stationarity Guarantees: The authors extend these insights to general non-convex, constrained settings, proving that the same qualitative error structure (drift, first-moment bias, second-moment perturbation) persists even without strong monotonicity.

Significance and Claims
The paper claims to provide the first finite-time theoretical analysis of Adam under non-stationary stochastic objectives. Its significance lies in:

Resolving Empirical Instability: It offers a theoretical explanation for why Adam degrades under distribution shift (e.g., in continual learning) and why specific hyperparameter adjustments (like increasing $\epsilon$ ) stabilize it.
Optimizer Selection: It delineates precise conditions under which adaptive methods are provably superior to SGD versus when they are provably suboptimal, moving beyond heuristic advice.
Bridging Theory and Practice: The theoretical bounds align with numerical experiments across strongly convex least squares, MLP regression, phase retrieval, and matrix factorization, confirming that SGD outperforms Adam in high-drift settings while Adam excels in high-noise settings.

The authors note limitations, specifically the reliance on bounded-gradient assumptions to control preconditioner perturbations pathwise and the lack of minimax lower bounds for Adam in this setting, suggesting these as directions for future work. However, the current work establishes a rigorous framework for understanding the "adapt or forget" dilemma in adaptive optimization.

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization