Adam Converges Without Any Modification On Update Rules

The Great Adam Debate: Why the "Perfect" AI Tuner Sometimes Fails (and How to Fix It)

Imagine you are training a giant, complex robot (an AI model) to learn how to walk. To teach it, you use a very popular, high-tech coach named Adam. Adam is the default coach for almost every modern AI, from chatbots to image generators. He is famous for being fast and efficient.

However, a few years ago, some researchers found a scary flaw in Adam's logic. They built a specific, tricky obstacle course and showed that if Adam tried to run it with his standard settings, he would get dizzy, spin in circles, and eventually run off a cliff (mathematically, the numbers would "diverge" to infinity). This made everyone nervous: If Adam can fail on a simple course, is he safe for our giant robots?

This paper says: "Don't panic. Adam is fine, but you have to tune his settings correctly for the specific course you are running."

Here is the breakdown of their discovery using simple analogies.

1. The Mix-Up: "The Course vs. The Coach"

The confusion happened because of a game of "chicken and egg."

The Old Study (Reddi et al.): They picked a set of settings for Adam first (let's call them Speed and Memory). Then, they looked around and said, "Hey, I found a weird obstacle course where these specific settings make Adam crash!"
- The Flaw: In the real world, we don't pick a course after picking our settings. We have a specific problem (like training a language model), and then we tune the settings to fit it.
The Real World: We have a fixed problem. We try different settings. Sometimes it works, sometimes it doesn't.

The authors realized the old study was like saying, "If you drive a Ferrari at 200mph, you will crash." That's true, but only if you are driving on a dirt path! If you are on a race track, 200mph is perfect. The problem wasn't the car; it was the mismatch between the car's settings and the road.

2. The Two Settings: Speed (β1) and Memory (β2)

Adam has two main dials you can turn:

β1 (The "Momentum" or "Speed"): How much does the robot remember its immediate past steps? (Like how fast you are currently running).
β2 (The "Memory" or "History"): How much does the robot remember its entire history of steps? (Like how much it remembers the bumps and turns from the last hour).

The paper discovered a Phase Transition. Imagine a map where the X-axis is Speed and the Y-axis is Memory.

The Danger Zone (Red Region): If your Memory (β2) is too low, Adam gets confused. He forgets the big picture and starts reacting wildly to every tiny bump in the road. He spins out of control and runs off the cliff.
The Safe Zone (Blue Region): If you turn up the Memory (β2) high enough, Adam becomes stable. He remembers the long-term trends and ignores the tiny, distracting bumps. He walks steadily toward the goal.

The Big Discovery: There is a specific "tipping point" line on this map. Below the line, Adam crashes. Above the line, Adam converges (succeeds).

3. The Secret Ingredient: Batch Size

Here is the most practical part of the paper. The "tipping point" line isn't fixed; it moves depending on how you train your AI.

Small Batches: If you train your AI using small chunks of data at a time (Small Batch Size), the road is very bumpy and noisy. To handle this, you need High Memory (High β2). You need the robot to look back further to smooth out the noise.
Large Batches: If you use huge chunks of data, the road is smoother. You can get away with lower Memory settings.

The Analogy:
Imagine walking through a foggy forest.

Small Batch (Foggy): You can only see a few steps ahead. If you only remember the last step (Low β2), you might trip over a root. You need to remember the path from 10 minutes ago (High β2) to know where the safe path is.
Large Batch (Clear Day): You can see the whole forest. You don't need to remember as far back to stay on track.

4. What Should You Do? (The Practical Advice)

The paper gives a clear recipe for AI engineers, especially those training massive models like LLMs (Large Language Models):

If Adam is failing or unstable: Don't change the algorithm! Just turn up the Memory dial (β2).
The Rule of Thumb: The smaller your batch size, the higher you need to set β2.
- Example: If you are training with a tiny batch size, try setting β2 to 0.999 or even 0.9995 instead of the default 0.99.
Keep Speed (β1) in check: Make sure your Speed dial (β1) isn't too high compared to your Memory. A good rule is: Speed should be less than the square root of Memory.

5. Why This Matters

Before this paper, people were scared that Adam was fundamentally broken because of those "divergence" examples. They started inventing complex, modified versions of Adam to fix it.

This paper says: "You don't need a new car. You just need to drive the existing car correctly."

Theoretical Win: They proved mathematically that if you pick the right settings for your specific problem, Adam will always converge. It won't run off the cliff.
Real World Win: They showed that many successful AI models (like GPT-3 and Llama) were already using settings in the "Safe Zone" (High β2), which is why they worked so well, even though the theory said they should have failed.

Summary

Think of Adam as a high-performance sports car.

Old Theory: "This car crashes!" (Because they tested it on a dirt road with racing tires).
This Paper: "The car is fine! Just make sure you use the right tires for the road. If the road is bumpy (small batch size), you need high-traction tires (High β2). If you do that, the car will win the race every time."

The authors have given us the map to the "Safe Zone," ensuring that the AI revolution can keep driving forward without crashing.

1. Problem Statement

The paper addresses a fundamental theoretical contradiction in the optimization of deep learning models:

The Paradox: Adam is the default optimizer for training Large Language Models (LLMs) and other deep networks, yet a seminal paper by Reddi et al. [2018] proved that Adam can diverge for a wide range of hyperparameters $(\beta_1, \beta_2)$ .
The Mismatch: Reddi et al. constructed divergence counter-examples by fixing the hyperparameters first and then choosing a specific problem (specifically, the number of mini-batches $n$ ) to force divergence. In contrast, practical applications fix the problem first (dataset size and batch size) and then tune the hyperparameters.
The Core Question: Can vanilla Adam (without algorithmic modifications like AMSGrad or AdaBound) provably converge if the problem is fixed and hyperparameters are chosen appropriately?

2. Methodology

The authors analyze the convergence and divergence of Adam under two sampling strategies: with-replacement sampling (Algorithm 1) and random shuffling (Algorithm 2).

Assumptions:
- Lipschitz Gradients: Standard smoothness assumption on component functions.
- Variance Condition (Assumption 2.2): They use a generalized variance condition $\sum \|\nabla f_i(x)\|^2 \le D_1 \|\nabla f(x)\|^2 + D_0$ . Crucially, they do not assume bounded gradients or bounded second moments, which are common in prior literature but restrictive for proving divergence.
- Problem Class: They define a problem class $\mathcal{F}^n_{L,D_0,D_1}$ where $n$ (number of mini-batches) is fixed.
Analytical Approach:
- Concentration of $1/\sqrt{v_k}$ : The core technical challenge is analyzing the stochastic non-linear dynamics where $v_k$ (the second moment estimate) appears in the denominator. The authors prove that when $\beta_2$ is large, $1/\sqrt{v_k}$ concentrates around $1/\sqrt{\mathbb{E}[v_k]}$ . This stabilizes the update direction.
- Potential Function: To handle the momentum term $m_k$ which distorts the gradient direction, they introduce an auxiliary sequence $z_k = \frac{x_k - \beta_1^n x_{k-n}}{1-\beta_1^n}$ . This sequence effectively cancels out historical gradient signals up to $n$ steps, allowing for a cleaner convergence analysis.
- Counter-Example Construction: For divergence, they construct a specific 1D convex function where the iterates and gradients diverge to infinity if $\beta_2$ is small.

3. Key Contributions

A. Discovery of a Phase Transition

The paper establishes the first phase transition in the $(\beta_1, \beta_2)$ 2D-plane for Adam:

Convergence Region (Safe): When $\beta_2$ is sufficiently large (specifically $\beta_2 \ge \gamma(n)$ ) and $\beta_1 < \sqrt{\beta_2}$ , Adam converges.
Divergence Region (Danger): When $\beta_2$ is small, there exists a region of $(\beta_1, \beta_2)$ where Adam diverges to infinity.

B. Problem-Dependent Thresholds

Unlike Reddi et al., who found problem-independent divergence, this work shows that the boundary between convergence and divergence is problem-dependent, specifically dependent on the batch size (or number of mini-batches $n$ ).

Theorem 3.1 (Convergence): For a fixed problem class, if $\beta_2 \ge 1 - O(\frac{1-\beta_1^n}{n^5})$ , Adam converges to critical points (if $D_0=0$ ) or a neighborhood of critical points (if $D_0 > 0$ ).
Theorem 3.5 (Divergence): For any fixed $n \ge 3$ , if $\beta_2$ is small enough, there exists a problem instance where Adam diverges. The divergence region expands as $n$ increases.

C. Resolution of the Reddi et al. Contradiction

The authors explain that Reddi et al.'s divergence result is an asymptotic characterization (as $n \to \infty$ ). For any finite $n$ (fixed batch size), there exists a "safe" region of hyperparameters where convergence is guaranteed. The divergence in Reddi et al. arises because they allowed $n$ to vary with $(\beta_1, \beta_2)$ to force failure.

4. Key Results

Convergence Rate: Under the safe region, Adam achieves a convergence rate of $O(\frac{\log T}{\sqrt{T}})$ for the expected gradient norm, comparable to SGD.
Batch Size Dependence: The critical threshold for $\beta_2$ $β_{2}$ increases as the number of mini-batches $n$ $n$ increases (i.e., as the batch size decreases).
- Implication: Small batch sizes require larger $\beta_2$ values to ensure convergence.
Divergence Behavior: When $\beta_2$ is small, the algorithm can diverge to infinity even on convex problems. This explains why Sign-SGD (equivalent to Adam with $\beta_1=\beta_2=0$ ) is unstable without momentum or bounded variance assumptions.
Empirical Validation: Grid searches on MNIST and CIFAR-10 confirm the theoretical phase transition. Furthermore, the paper cites recent empirical studies on LLM pre-training (e.g., Llama, DeepSeek) showing that increasing $\beta_2$ (e.g., from 0.95 to 0.99 or 0.999) significantly improves performance, especially with small batch sizes, aligning with the theoretical predictions.

5. Significance

Theoretical Justification for Vanilla Adam: The paper provides rigorous theoretical guarantees that vanilla Adam (without modifications like AMSGrad) is sufficient for convergence, provided hyperparameters are tuned correctly relative to the problem size.
Practical Tuning Guidelines: It offers concrete, theoretically backed advice for practitioners:
- If Adam fails to converge or performs poorly, increase $\beta_2$ .
- The required $\beta_2$ should be inversely proportional to the batch size (larger $\beta_2$ for smaller batches).
- Ensure $\beta_1 < \sqrt{\beta_2}$ .
Refutation of "Necessity of Modification": It challenges the narrative that algorithmic modifications are necessary to fix Adam's divergence, arguing instead that the issue lies in the choice of hyperparameters relative to the problem scale.
New Analytical Tools: The concentration analysis of $1/\sqrt{v_k}$ without bounded gradient assumptions and the use of the potential function $z_k$ provide new tools for analyzing adaptive gradient methods with unbounded gradients.

In summary, the paper resolves the long-standing tension between Adam's empirical success and theoretical divergence by identifying a problem-dependent phase transition, proving that Adam converges without modification when $\beta_2$ is tuned appropriately for the specific batch size and problem class.