Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization

Imagine you are trying to navigate a massive, foggy mountain range to find the lowest valley (the perfect solution for an AI model). You have a team of hikers (the optimizer) trying to get there as fast as possible.

For a long time, the most popular guide was Adam (a very smart hiker who adjusts their step size based on how slippery the ground is under each foot). But recently, a new guide called Muon has arrived. Muon is special because it doesn't just look at the slope; it forces the hikers to walk in a perfectly straight, organized line, preventing them from tripping over each other or wandering in circles. This "orthogonal" walking style has been shown to work incredibly well in practice, especially for huge AI models.

However, there was a problem: Nobody knew exactly why Muon worked so well, or how fast it would actually get to the bottom. The existing math explaining Muon was either too vague or relied on unrealistic assumptions (like assuming the mountain was perfectly smooth and had no hidden cliffs).

This paper is like a team of mathematicians who decided to go out into the fog, measure Muon's steps with a ruler, and write a new, clearer map. Here is what they found, explained simply:

1. The Old Maps vs. The New Map

Previous studies tried to explain Muon, but their maps had flaws:

Some said Muon was fast, but only if the mountain had a special shape (the "PL condition") that doesn't exist in real life.
Others said Muon was slow, or their math got stuck on a variable representing the size of the mountain, making the answer incomplete.
Basically, the old theories were like saying, "Muon works great, but only if you believe in magic," or "Muon is okay, but it might take forever."

The New Discovery: The authors of this paper created a new, simpler proof. They didn't need magic or special mountain shapes. They showed that Muon is mathematically guaranteed to find the bottom, and they proved it does so faster than previously thought.

2. The Secret Sauce: How to Walk Faster

The paper discovered that Muon's speed depends on two main things: how big your steps are (Learning Rate) and how many hikers are in your group (Batch Size).

They found three "Golden Rules" for the fastest journey:

Rule A: The "Big Group" Strategy.
If you keep your step size steady but make your group of hikers (the batch size) grow larger and larger as you go, Muon gets incredibly fast. It's like realizing that a larger team can clear the path faster than a small one. If you double the group size every step, the speed improves dramatically.
Rule B: The "Shrinking Step" Strategy.
If you start with big steps and slowly make them smaller (a "diminishing learning rate"), Muon still works well, but it needs a specific trick to be the fastest.
Rule C: The "Super Combo."
The absolute fastest way to reach the valley is to combine shrinking steps with a rapidly growing group size. This combination allows Muon to converge (finish the job) at a rate of $1/T $(where$ $(w h er e$ T$ is the number of steps).
- Analogy: Imagine driving a car. Old optimizers were like driving at a constant speed. Muon with this new strategy is like a car that starts fast, then as it gets closer to the destination, it slows down just enough to turn perfectly, while simultaneously adding more engines to the car to keep momentum.

3. Why This Matters

Before this paper, if you wanted to use Muon, you had to guess the settings. You might have been using a "good enough" setting, but not the best one.

This paper gives you the instruction manual. It tells engineers:

"If you want Muon to be the fastest optimizer possible, don't just pick random numbers. Make your batch size grow exponentially (double it every step) and adjust your learning rate carefully. If you do this, you will get results faster than any other method we know of."

Summary in a Nutshell

The Problem: Muon is a great new optimizer, but we didn't have a solid math proof for why it's fast or how to tune it perfectly.
The Solution: The authors wrote a new, simpler proof that works for almost any situation.
The Result: They proved Muon can be faster than previously thought.
The Takeaway: To get the best performance, pair a shrinking learning rate with a rapidly growing batch size. It's the "secret recipe" for making AI training faster and more stable.

In short, this paper took a mysterious, high-performing tool and gave us the blueprints to use it at its absolute peak potential.

Here is a detailed technical summary of the paper "Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization" by Shuntaro Nagashima and Hideaki Iiduka.

1. Problem Statement

The paper addresses the lack of rigorous and sharp theoretical convergence guarantees for the Muon optimizer (Momentum orthogonalized by Newton-Schulz) in nonconvex optimization settings, which are standard for training Deep Neural Networks (DNNs).

Context: Muon has emerged as a practical alternative to Adam and SGD, offering better stability and scalability by projecting update directions onto an orthogonal manifold.
Gap: Existing theoretical analyses of Muon suffer from significant limitations:
- They often rely on restrictive assumptions (e.g., the Polyak-Łojasiewicz (PL) condition).
- They yield coarse convergence rates (e.g., $O(T^{-1/4})$ ).
- Some results depend on problem dimensions (e.g., network width $n$ ) in a way that prevents strict convergence guarantees.
- They fail to fully capture the orthogonalization structure of the update rule without simplifying assumptions.

The authors aim to establish sharper convergence guarantees under standard nonconvex assumptions (smoothness and bounded variance) without restrictive conditions like PL, covering a broader class of learning rate and batch size schedules.

2. Methodology

The authors propose a direct and simplified analysis framework that leverages the specific geometric structure of the Muon update rule.

Problem Setting: Nonconvex optimization minimizing empirical risk $f(W) = \frac{1}{N}\sum f_i(W)$ , where $f$ is $L$ -smooth and stochastic gradients have bounded variance.
Algorithm Analysis: The paper analyzes the standard Muon algorithm (Algorithm 1), which includes:
1. Momentum accumulation ( $M_t$ ).
2. Optional Nesterov acceleration ( $C_t$ ).
3. Orthogonalization: Projecting the update vector $C_t$ onto the Stiefel manifold (orthogonal matrices) via $O_t = \arg\min_{O^\top O=I} \|O - C_t\|_F$ .
Key Technical Tools:
- Descent Lemma: Utilizing the smoothness of the loss function to bound the decrease in objective value.
- Dual Norm Properties: Exploiting the relationship between the Frobenius norm and the dual spectral norm to handle the orthogonal projection step. Specifically, they use the property that $\langle C_t, O_t \rangle = \|C_t\|_*$ (dual norm) and relate it to the gradient norm.
- Recursive Error Bounds: Deriving bounds on the error term $\|\nabla f(W_t) - C_t\|_F$ by recursively expanding the momentum and stochastic gradient terms.

3. Key Contributions

The paper makes three primary contributions:

General Convergence Upper Bound (Theorem 3.1):
The authors derive a comprehensive upper bound for the expected gradient norm $\mathbb{E}[\|\nabla f(W_t)\|_F]$ for both standard Muon and Muon with Nesterov momentum. The bound is expressed as a sum of terms dependent on the learning rate ( $\eta_t$ ), batch size ( $b_t$ ), and momentum parameter ( $\beta$ ), without assuming the PL condition.
Improved Convergence Rates (Big-O Notation):
By applying the general bound to practical hyperparameter schedules, the authors demonstrate significantly faster convergence rates than previous works:
- Constant Learning Rate + Constant Batch Size: Achieves $O(1/\sqrt{T})$ (with specific tuning) or $O(1/T)$ if batch size scales as $O(T^2)$ .
- Exponentially Growing Batch Size: By using $b_t = b\delta^t$ (where $\delta > 1$ ), the convergence rate improves to $O(1/T)$ (or $O(\log T / \sqrt{T})$ with diminishing learning rates). This is a substantial improvement over the $O(T^{-1/4})$ rates found in prior literature.
Hyperparameter Sensitivity Analysis:
The paper systematically evaluates four learning rate schedules (Constant, Cosine-annealing, Polynomial decay, Diminishing) and two batch size strategies (Constant, Exponentially growing). It provides explicit convergence rates for all combinations, offering practical guidance for hyperparameter tuning.

4. Key Results

The results are summarized in Table 1 of the paper and Corollary 3.1. The most significant findings include:

Superiority over Existing Bounds:
- Previous best rates were $O(T^{-1/4})$ (e.g., Tang et al., Shen et al.) or required the PL condition for $O(T^{-2/3})$ (Chang et al.).
- This work achieves $O(1/T)$ under standard assumptions by utilizing an exponentially growing batch size and appropriate learning rate decay.
Impact of Batch Size:
- The analysis confirms that increasing the batch size is crucial for Muon's convergence speed.
- With a constant batch size $b$ , the rate is limited by $O(1/\sqrt{b})$ .
- With an exponentially growing batch size ( $b_t = b\delta^t$ ), the batch size term vanishes from the dominant convergence term, allowing the rate to be driven primarily by the learning rate and iteration count $T$ .
Diminishing Learning Rates:
- Using a diminishing learning rate $\eta_t = \eta/\sqrt{t+1}$ combined with an exponentially growing batch size yields a rate of $O(\frac{\log T}{\sqrt{T}})$ .
- If the learning rate is set to $\eta = O(1/T)$ with an exponentially growing batch size, the rate is $O(1/T)$ .

5. Significance

Theoretical Validation: This work provides the first rigorous proof that Muon can achieve $O(1/T)$ convergence in nonconvex settings without restrictive assumptions like the PL condition. This validates the empirical success of Muon in large-scale training.
Practical Guidance: The paper offers concrete recipes for practitioners. It suggests that to maximize Muon's performance, one should pair a decaying learning rate with an exponentially growing batch size.
Broader Applicability: The analytical techniques used (simplified descent lemmas and dual norm handling for orthogonal updates) are applicable to other orthogonalized first-order methods, potentially advancing the theoretical understanding of a wider class of optimizers.

In conclusion, Nagashima and Iiduka successfully bridge the gap between the empirical effectiveness of the Muon optimizer and its theoretical foundations, proving that with proper hyperparameter scheduling, Muon offers superior convergence guarantees compared to existing analyses.

Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization

1. The Old Maps vs. The New Map

2. The Secret Sauce: How to Walk Faster

3. Why This Matters

Summary in a Nutshell

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance

More like this

Partial Sums of the Series for the Dirichlet Eta Function, their Peculiar Convergence, the Simple Zeros Conjecture, and the RH

Triangular arrangements on the projective plane

Some arithmetic properties of Weil polynomials of the form t2g+atg+qgt^{2g}+at^g+q^gt2g+atg+qg

Big Picard theorems and algebraic hyperbolicity for varieties admitting a variation of Hodge structures

On the dual positive cones and the algebraicity of a compact Kähler manifold

Some arithmetic properties of Weil polynomials of the form $t^{2g}+at^g+q^g$