Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization

This paper establishes sharper convergence guarantees for the Muon optimizer by providing a direct, simplified analysis that achieves faster convergence rates under broader problem settings than existing restrictive theoretical frameworks.

Shuntaro Nagashima, Hideaki Iiduka

Published 2026-03-06
📖 4 min read🧠 Deep dive

Imagine you are trying to navigate a massive, foggy mountain range to find the lowest valley (the perfect solution for an AI model). You have a team of hikers (the optimizer) trying to get there as fast as possible.

For a long time, the most popular guide was Adam (a very smart hiker who adjusts their step size based on how slippery the ground is under each foot). But recently, a new guide called Muon has arrived. Muon is special because it doesn't just look at the slope; it forces the hikers to walk in a perfectly straight, organized line, preventing them from tripping over each other or wandering in circles. This "orthogonal" walking style has been shown to work incredibly well in practice, especially for huge AI models.

However, there was a problem: Nobody knew exactly why Muon worked so well, or how fast it would actually get to the bottom. The existing math explaining Muon was either too vague or relied on unrealistic assumptions (like assuming the mountain was perfectly smooth and had no hidden cliffs).

This paper is like a team of mathematicians who decided to go out into the fog, measure Muon's steps with a ruler, and write a new, clearer map. Here is what they found, explained simply:

1. The Old Maps vs. The New Map

Previous studies tried to explain Muon, but their maps had flaws:

  • Some said Muon was fast, but only if the mountain had a special shape (the "PL condition") that doesn't exist in real life.
  • Others said Muon was slow, or their math got stuck on a variable representing the size of the mountain, making the answer incomplete.
  • Basically, the old theories were like saying, "Muon works great, but only if you believe in magic," or "Muon is okay, but it might take forever."

The New Discovery: The authors of this paper created a new, simpler proof. They didn't need magic or special mountain shapes. They showed that Muon is mathematically guaranteed to find the bottom, and they proved it does so faster than previously thought.

2. The Secret Sauce: How to Walk Faster

The paper discovered that Muon's speed depends on two main things: how big your steps are (Learning Rate) and how many hikers are in your group (Batch Size).

They found three "Golden Rules" for the fastest journey:

  • Rule A: The "Big Group" Strategy.
    If you keep your step size steady but make your group of hikers (the batch size) grow larger and larger as you go, Muon gets incredibly fast. It's like realizing that a larger team can clear the path faster than a small one. If you double the group size every step, the speed improves dramatically.
  • Rule B: The "Shrinking Step" Strategy.
    If you start with big steps and slowly make them smaller (a "diminishing learning rate"), Muon still works well, but it needs a specific trick to be the fastest.
  • Rule C: The "Super Combo."
    The absolute fastest way to reach the valley is to combine shrinking steps with a rapidly growing group size. This combination allows Muon to converge (finish the job) at a rate of $1/T(where (where T$ is the number of steps).
    • Analogy: Imagine driving a car. Old optimizers were like driving at a constant speed. Muon with this new strategy is like a car that starts fast, then as it gets closer to the destination, it slows down just enough to turn perfectly, while simultaneously adding more engines to the car to keep momentum.

3. Why This Matters

Before this paper, if you wanted to use Muon, you had to guess the settings. You might have been using a "good enough" setting, but not the best one.

This paper gives you the instruction manual. It tells engineers:

"If you want Muon to be the fastest optimizer possible, don't just pick random numbers. Make your batch size grow exponentially (double it every step) and adjust your learning rate carefully. If you do this, you will get results faster than any other method we know of."

Summary in a Nutshell

  • The Problem: Muon is a great new optimizer, but we didn't have a solid math proof for why it's fast or how to tune it perfectly.
  • The Solution: The authors wrote a new, simpler proof that works for almost any situation.
  • The Result: They proved Muon can be faster than previously thought.
  • The Takeaway: To get the best performance, pair a shrinking learning rate with a rapidly growing batch size. It's the "secret recipe" for making AI training faster and more stable.

In short, this paper took a mysterious, high-performing tool and gave us the blueprints to use it at its absolute peak potential.