A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

This paper presents the first theoretical framework analyzing the convergence of adaptive optimizers like Adam and Muon under floating-point quantization, demonstrating that they maintain near-full-precision rates with logarithmic mantissa scaling while revealing Adam's specific sensitivity to weight and second-moment errors compared to Muon's robustness.

Xuan Tang, Jichu Li, Difan Zou

Published 2026-03-03
📖 5 min read🧠 Deep dive

The Big Picture: Why Are We Doing This?

Imagine you are trying to train a giant robot brain (a Large Language Model, or LLM) to write poetry, code, or answer questions. To make this brain smart, you have to show it billions of examples.

However, the brain is so huge that it doesn't fit in your computer's memory. To solve this, engineers use a trick called Low-Precision Training.

  • The Analogy: Think of the robot's brain as a library of books.
    • Full Precision (FP32): Every book is written in high-definition, with perfect spelling, grammar, and tiny details. It's heavy and takes up a lot of shelf space.
    • Low Precision (FP8/BF16): To save space, you rewrite the books with fewer details. You round off numbers and drop tiny decimal points. It's like writing a summary instead of a novel. It's much lighter and faster to read, but you risk losing some nuance.

The big question the paper answers is: "If we throw away so much detail (quantization), why does the robot brain still learn effectively? Does it eventually stop learning because the numbers are too 'fuzzy'?"

The Problem: The "Fuzzy" Math

In the past, mathematicians proved that these robots would learn well only if the numbers were perfect. But in the real world, we use "fuzzy" numbers to save space.

Previous theories tried to explain this by assuming the errors were random and canceled each other out (like flipping a coin). But in reality, the errors in low-precision training are systematic. They happen because of how the computer hardware stores numbers, not just random noise.

This paper says: "Stop guessing. Let's build a theory that actually matches how modern computers work."

The Two Main Characters: Adam and Muon

The paper focuses on two specific "teachers" (optimizers) that guide the robot brain during training:

  1. Adam: The veteran teacher. It's been around for a decade and is used in almost every AI model today. It's smart but can be a bit sensitive.
  2. Muon: The new, shiny teacher. It's a newer method that uses a special mathematical trick (SVD) to look at the data from a different angle.

The researchers wanted to know: If we use "fuzzy" numbers, which teacher is better at keeping the robot on track?

The Core Discovery: The "Magnifying Glass" Effect

The researchers built a new mathematical framework to track how errors spread through the training process. They found something fascinating about Adam:

  • The Analogy: Imagine Adam is holding a magnifying glass over the robot's mistakes.
    • Adam uses a setting called β2\beta_2 (beta-two), which is usually set very close to 1 (like 0.999). This setting tells Adam to remember the past mistakes for a very long time.
    • The problem is that when you combine "remembering everything" with "fuzzy numbers," the magnifying glass gets stuck. Small rounding errors in the past get amplified over and over again.
    • The Result: Adam is very sensitive to "fuzziness" in its memory (the second moment). If the numbers aren't precise enough, the magnifying glass makes the errors huge, and the robot gets confused.

What about Muon?

  • The Analogy: Muon is like a teacher who doesn't rely on a magnifying glass. Instead, it uses a "sieve" (a mathematical filter) to look at the data.
  • Because Muon doesn't rely on that specific "remember everything" mechanism, it doesn't amplify the small rounding errors as much.
  • The Result: Muon is much more robust. It can handle "fuzzier" numbers (lower precision) without getting confused.

The "Goldilocks" Rule for Precision

The paper proves a very specific rule for how precise the numbers need to be:

  • The Rule: You don't need perfect precision (infinite decimal places). You just need the "mantissa" (the part of the number that holds the significant digits) to be long enough.
  • The Catch: The required length only needs to grow logarithmically with the number of steps.
  • The Metaphor: Imagine you are walking a long path (training the model).
    • If you take 10 steps, you need a map with 1-inch details.
    • If you take 10,000 steps, you don't need a map with 1-millimeter details. You just need a map that is slightly more detailed than the 10-step one.
    • Conclusion: Standard low-precision formats (like BF16 or FP8) are actually perfectly fine for training massive models, provided you don't cut the precision too aggressively.

The Verdict: Why This Matters

  1. It Explains the Magic: For years, engineers have been using low-precision training because it works, but they didn't have a solid math proof for why. This paper provides that proof. It says, "It works because the errors don't pile up as badly as we thought, as long as you keep a few bits of precision."
  2. Adam vs. Muon: The theory explains why recent experiments show Muon is often more stable than Adam when using very low precision (like 4-bit or 8-bit). Adam gets jittery because of its "magnifying glass" (the β2\beta_2 parameter), while Muon stays calm.
  3. Future Hardware: This gives hardware engineers confidence. They can build chips that are even faster and smaller (using even lower precision) because they now know the mathematical limits of how low they can go before the AI stops learning.

Summary in One Sentence

This paper proves that while "fuzzy" math (low precision) introduces errors, modern AI optimizers like Adam and Muon can still learn effectively because the errors don't explode—though Adam is a bit more sensitive to the fuzziness than the newer, more robust Muon.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →