A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization

The Big Picture: Why Are We Doing This?

Imagine you are trying to train a giant robot brain (a Large Language Model, or LLM) to write poetry, code, or answer questions. To make this brain smart, you have to show it billions of examples.

However, the brain is so huge that it doesn't fit in your computer's memory. To solve this, engineers use a trick called Low-Precision Training.

The Analogy: Think of the robot's brain as a library of books.
- Full Precision (FP32): Every book is written in high-definition, with perfect spelling, grammar, and tiny details. It's heavy and takes up a lot of shelf space.
- Low Precision (FP8/BF16): To save space, you rewrite the books with fewer details. You round off numbers and drop tiny decimal points. It's like writing a summary instead of a novel. It's much lighter and faster to read, but you risk losing some nuance.

The big question the paper answers is: "If we throw away so much detail (quantization), why does the robot brain still learn effectively? Does it eventually stop learning because the numbers are too 'fuzzy'?"

The Problem: The "Fuzzy" Math

In the past, mathematicians proved that these robots would learn well only if the numbers were perfect. But in the real world, we use "fuzzy" numbers to save space.

Previous theories tried to explain this by assuming the errors were random and canceled each other out (like flipping a coin). But in reality, the errors in low-precision training are systematic. They happen because of how the computer hardware stores numbers, not just random noise.

This paper says: "Stop guessing. Let's build a theory that actually matches how modern computers work."

The Two Main Characters: Adam and Muon

The paper focuses on two specific "teachers" (optimizers) that guide the robot brain during training:

Adam: The veteran teacher. It's been around for a decade and is used in almost every AI model today. It's smart but can be a bit sensitive.
Muon: The new, shiny teacher. It's a newer method that uses a special mathematical trick (SVD) to look at the data from a different angle.

The researchers wanted to know: If we use "fuzzy" numbers, which teacher is better at keeping the robot on track?

The Core Discovery: The "Magnifying Glass" Effect

The researchers built a new mathematical framework to track how errors spread through the training process. They found something fascinating about Adam:

The Analogy: Imagine Adam is holding a magnifying glass over the robot's mistakes.
- Adam uses a setting called $\beta_2$ (beta-two), which is usually set very close to 1 (like 0.999). This setting tells Adam to remember the past mistakes for a very long time.
- The problem is that when you combine "remembering everything" with "fuzzy numbers," the magnifying glass gets stuck. Small rounding errors in the past get amplified over and over again.
- The Result: Adam is very sensitive to "fuzziness" in its memory (the second moment). If the numbers aren't precise enough, the magnifying glass makes the errors huge, and the robot gets confused.

What about Muon?

The Analogy: Muon is like a teacher who doesn't rely on a magnifying glass. Instead, it uses a "sieve" (a mathematical filter) to look at the data.
Because Muon doesn't rely on that specific "remember everything" mechanism, it doesn't amplify the small rounding errors as much.
The Result: Muon is much more robust. It can handle "fuzzier" numbers (lower precision) without getting confused.

The "Goldilocks" Rule for Precision

The paper proves a very specific rule for how precise the numbers need to be:

The Rule: You don't need perfect precision (infinite decimal places). You just need the "mantissa" (the part of the number that holds the significant digits) to be long enough.
The Catch: The required length only needs to grow logarithmically with the number of steps.
The Metaphor: Imagine you are walking a long path (training the model).
- If you take 10 steps, you need a map with 1-inch details.
- If you take 10,000 steps, you don't need a map with 1-millimeter details. You just need a map that is slightly more detailed than the 10-step one.
- Conclusion: Standard low-precision formats (like BF16 or FP8) are actually perfectly fine for training massive models, provided you don't cut the precision too aggressively.

The Verdict: Why This Matters

It Explains the Magic: For years, engineers have been using low-precision training because it works, but they didn't have a solid math proof for why. This paper provides that proof. It says, "It works because the errors don't pile up as badly as we thought, as long as you keep a few bits of precision."
Adam vs. Muon: The theory explains why recent experiments show Muon is often more stable than Adam when using very low precision (like 4-bit or 8-bit). Adam gets jittery because of its "magnifying glass" (the $\beta_2$ parameter), while Muon stays calm.
Future Hardware: This gives hardware engineers confidence. They can build chips that are even faster and smaller (using even lower precision) because they now know the mathematical limits of how low they can go before the AI stops learning.

Summary in One Sentence

This paper proves that while "fuzzy" math (low precision) introduces errors, modern AI optimizers like Adam and Muon can still learn effectively because the errors don't explode—though Adam is a bit more sensitive to the fuzziness than the newer, more robust Muon.

1. Problem Statement

The rapid scaling of Large Language Models (LLMs) has necessitated the use of low-precision training (e.g., BF16, FP8) to reduce memory usage and improve computational efficiency. While empirical results show that adaptive optimizers like Adam and the newer Muon optimizer perform well under aggressive quantization of gradients, weights, and optimizer states, the theoretical understanding of why this works is lacking.

Existing theoretical frameworks for quantized optimization suffer from critical limitations:

Unrealistic Assumptions: They often assume unbiased quantization or rely on error-feedback mechanisms (storing error terms for every parameter), which are memory-intensive and impractical for modern LLM training pipelines.
Incomplete Modeling: Previous works typically analyze only gradient quantization or ignore the quantization of optimizer states (momentum and second-moment estimates), which are crucial for adaptive methods.
Gap in Theory: There is no rigorous framework explaining the convergence of fully quantized adaptive optimizers under realistic floating-point rounding errors.

2. Methodology

The authors introduce the first theoretical framework for analyzing the convergence of adaptive optimizers under hardware-aware floating-point quantization.

A. Quantization Model

Instead of assuming unbiased noise or error feedback, the paper adopts a relative error model consistent with standard floating-point formats (FP32 $\to$ BF16/FP8).

Assumption: For a quantization operator $Q$ , the error is bounded relative to the magnitude of the value: $|x_Q - x| \leq q|x|$ , where $q = \Theta(2^{-M})$ and $M$ is the mantissa length.
Scope: The framework explicitly models quantization errors for all key components:
1. Gradients ( $q_G$ )
2. Weights ( $q_W$ )
3. First Moment (Momentum) ( $q_M$ )
4. Second Moment (Variance) ( $q_V$ )

B. Optimizers Analyzed

The paper derives convergence guarantees for two specific optimizers:

Adam: The standard adaptive optimizer with decoupled weight decay.
Muon: A matrix-based optimizer that uses Singular Value Decomposition (SVD) and a sign operator to update weights, recently proposed for LLM training.

C. Theoretical Approach

The authors analyze smooth non-convex objectives under standard stochastic gradient assumptions. They derive convergence rates by bounding the accumulation of quantization errors over $T$ iterations, explicitly characterizing how errors in different components propagate through the update rules.

3. Key Contributions

First Hardware-Aware Framework: The paper establishes a rigorous analytical framework for quantized adaptive optimizers that models the quantization of weights, gradients, and optimizer states simultaneously using a relative error model, avoiding impractical error-feedback assumptions.
Convergence Guarantees:
- Quantized Adam (Theorem 4.5): Proves that Adam achieves a convergence rate of $\tilde{O}(T^{-1/4})$ (matching full-precision Adam) provided the mantissa length scales logarithmically with iterations ( $M = \Omega(\log T)$ ).
- Quantized Muon (Theorem 4.6): Proves Muon also achieves $\tilde{O}(T^{-1/4})$ but under weaker error control conditions.
Component Sensitivity Analysis:
- Adam Sensitivity: The analysis reveals Adam is highly sensitive to the quantization of weights and second moments ( $q_V$ ). This is due to the reliance on $\beta_2 \to 1$ and the inverse square root of the second moment, which amplifies quantization errors.
- Muon Robustness: Muon is shown to be more robust because its update rule (based on SVD and sign operators) avoids the amplification of errors caused by the inverse square root of historical gradient variances.
Empirical Validation: Extensive experiments on synthetic data (Rosenbrock), CIFAR-10, and a real-world LLM benchmark (nanoGPT) corroborate the theory, showing that moderate mantissa lengths yield near full-precision performance, while very low precision causes degradation.

4. Key Results

Convergence Rates

Under the relative error model, both optimizers retain convergence rates close to their full-precision counterparts if the quantization error decays sufficiently fast (or the mantissa length increases logarithmically):

Adam: Requires strict conditions: $q_G, q_M = O(1/T)$ and $q_W, q_V = O(1/T^2)$ .
Muon: Requires weaker conditions: $q_G, q_W, q_M = O(1/\sqrt{T})$ .

Sensitivity to $\beta_2$

The theoretical analysis explains a practical observation: as $\beta_2$ approaches 1 (common in practice for stability), Adam becomes increasingly sensitive to second-moment quantization. The term $1/\sqrt{V_t}$ acts as an amplifier for quantization noise. Muon avoids this specific amplification mechanism.

Experimental Findings

Gradient Norms: Experiments show that lower mantissa lengths (e.g., $M=4$ ) lead to higher converged gradient norms (stalling), while moderate lengths ( $M \ge 16$ ) recover full-precision performance.
Muon vs. Adam: In LLM training (nanoGPT), Muon demonstrates superior robustness at low precision (e.g., $M=2$ ) compared to AdamW, achieving lower validation loss. This aligns with the theoretical finding that Muon requires weaker error control.

5. Significance and Impact

Bridging Theory and Practice: This work closes the gap between the empirical success of low-precision LLM training and theoretical understanding. It explains why aggressive quantization works without needing error-feedback mechanisms that are too costly for trillion-parameter models.
Guidance for Hardware Design: The results suggest that for Adam, higher precision is strictly required for the second moment and weights, whereas Muon offers a pathway to more aggressive quantization (lower bit-widths) without sacrificing convergence.
Future Optimization: The framework provides a foundation for designing future low-precision optimizers. It highlights that the choice of optimizer (e.g., Muon over Adam) can be a critical factor in enabling efficient, low-bit training for next-generation LLMs.

In summary, the paper provides the first rigorous proof that adaptive optimizers can converge under realistic floating-point quantization, revealing that Muon is theoretically more robust to low-precision constraints than Adam, offering a theoretical basis for its growing adoption in large-scale model training.