Adam Converges Without Any Modification On Update Rules

This paper resolves concerns about Adam's divergence by proving that the optimizer converges when hyperparameters (β1,β2)(\beta_1, \beta_2) are tuned according to problem-specific conditions—particularly batch size—revealing a novel phase transition in the hyperparameter space and offering practical guidelines for improving large language model training.

Yushun Zhang, Bingran Li, Congliang Chen, Zhi-Quan Luo, Ruoyu Sun

Published 2026-03-03
📖 6 min read🧠 Deep dive

The Great Adam Debate: Why the "Perfect" AI Tuner Sometimes Fails (and How to Fix It)

Imagine you are training a giant, complex robot (an AI model) to learn how to walk. To teach it, you use a very popular, high-tech coach named Adam. Adam is the default coach for almost every modern AI, from chatbots to image generators. He is famous for being fast and efficient.

However, a few years ago, some researchers found a scary flaw in Adam's logic. They built a specific, tricky obstacle course and showed that if Adam tried to run it with his standard settings, he would get dizzy, spin in circles, and eventually run off a cliff (mathematically, the numbers would "diverge" to infinity). This made everyone nervous: If Adam can fail on a simple course, is he safe for our giant robots?

This paper says: "Don't panic. Adam is fine, but you have to tune his settings correctly for the specific course you are running."

Here is the breakdown of their discovery using simple analogies.


1. The Mix-Up: "The Course vs. The Coach"

The confusion happened because of a game of "chicken and egg."

  • The Old Study (Reddi et al.): They picked a set of settings for Adam first (let's call them Speed and Memory). Then, they looked around and said, "Hey, I found a weird obstacle course where these specific settings make Adam crash!"
    • The Flaw: In the real world, we don't pick a course after picking our settings. We have a specific problem (like training a language model), and then we tune the settings to fit it.
  • The Real World: We have a fixed problem. We try different settings. Sometimes it works, sometimes it doesn't.

The authors realized the old study was like saying, "If you drive a Ferrari at 200mph, you will crash." That's true, but only if you are driving on a dirt path! If you are on a race track, 200mph is perfect. The problem wasn't the car; it was the mismatch between the car's settings and the road.

2. The Two Settings: Speed (β1) and Memory (β2)

Adam has two main dials you can turn:

  1. β1 (The "Momentum" or "Speed"): How much does the robot remember its immediate past steps? (Like how fast you are currently running).
  2. β2 (The "Memory" or "History"): How much does the robot remember its entire history of steps? (Like how much it remembers the bumps and turns from the last hour).

The paper discovered a Phase Transition. Imagine a map where the X-axis is Speed and the Y-axis is Memory.

  • The Danger Zone (Red Region): If your Memory (β2) is too low, Adam gets confused. He forgets the big picture and starts reacting wildly to every tiny bump in the road. He spins out of control and runs off the cliff.
  • The Safe Zone (Blue Region): If you turn up the Memory (β2) high enough, Adam becomes stable. He remembers the long-term trends and ignores the tiny, distracting bumps. He walks steadily toward the goal.

The Big Discovery: There is a specific "tipping point" line on this map. Below the line, Adam crashes. Above the line, Adam converges (succeeds).

3. The Secret Ingredient: Batch Size

Here is the most practical part of the paper. The "tipping point" line isn't fixed; it moves depending on how you train your AI.

  • Small Batches: If you train your AI using small chunks of data at a time (Small Batch Size), the road is very bumpy and noisy. To handle this, you need High Memory (High β2). You need the robot to look back further to smooth out the noise.
  • Large Batches: If you use huge chunks of data, the road is smoother. You can get away with lower Memory settings.

The Analogy:
Imagine walking through a foggy forest.

  • Small Batch (Foggy): You can only see a few steps ahead. If you only remember the last step (Low β2), you might trip over a root. You need to remember the path from 10 minutes ago (High β2) to know where the safe path is.
  • Large Batch (Clear Day): You can see the whole forest. You don't need to remember as far back to stay on track.

4. What Should You Do? (The Practical Advice)

The paper gives a clear recipe for AI engineers, especially those training massive models like LLMs (Large Language Models):

  1. If Adam is failing or unstable: Don't change the algorithm! Just turn up the Memory dial (β2).
  2. The Rule of Thumb: The smaller your batch size, the higher you need to set β2.
    • Example: If you are training with a tiny batch size, try setting β2 to 0.999 or even 0.9995 instead of the default 0.99.
  3. Keep Speed (β1) in check: Make sure your Speed dial (β1) isn't too high compared to your Memory. A good rule is: Speed should be less than the square root of Memory.

5. Why This Matters

Before this paper, people were scared that Adam was fundamentally broken because of those "divergence" examples. They started inventing complex, modified versions of Adam to fix it.

This paper says: "You don't need a new car. You just need to drive the existing car correctly."

  • Theoretical Win: They proved mathematically that if you pick the right settings for your specific problem, Adam will always converge. It won't run off the cliff.
  • Real World Win: They showed that many successful AI models (like GPT-3 and Llama) were already using settings in the "Safe Zone" (High β2), which is why they worked so well, even though the theory said they should have failed.

Summary

Think of Adam as a high-performance sports car.

  • Old Theory: "This car crashes!" (Because they tested it on a dirt road with racing tires).
  • This Paper: "The car is fine! Just make sure you use the right tires for the road. If the road is bumpy (small batch size), you need high-traction tires (High β2). If you do that, the car will win the race every time."

The authors have given us the map to the "Safe Zone," ensuring that the AI revolution can keep driving forward without crashing.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →