Muon+: Towards Better Muon via One Additional Normalization Step

This paper introduces Muon+, a simple yet effective enhancement to the Muon optimizer that adds a normalization step after orthogonalization, demonstrating consistent improvements in training and validation perplexity across various model scales and architectures in compute-optimal training regimes.

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Zheng Zhang

Published 2026-02-27
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a giant, super-smart robot (a Large Language Model) how to speak human language. To do this, you show it billions of sentences and let it learn by trial and error. This process is called pre-training.

The robot learns by adjusting its internal "knobs" (mathematical weights) whenever it makes a mistake. The tool it uses to decide how to turn those knobs is called an optimizer. For a long time, the industry standard tool has been called Adam or AdamW.

Recently, a new tool called Muon arrived on the scene and started doing a better job. It works by "straightening out" the robot's learning path so it doesn't get stuck in loops or go off-track. Think of Muon as a very strict coach who tells the robot: "Don't just move randomly; move in a perfectly straight, organized line."

The Problem: The Robot is Still a Bit Wobbly

Even with Muon's strict coaching, the robot's movements can still be a little unbalanced. Sometimes it pushes too hard in one direction and too little in another. It's like a dancer who is moving in a straight line but is leaning heavily to the left, making the dance look awkward and inefficient.

The Solution: MUON+ (The "Posture Check")

The authors of this paper, MUON+, asked a simple question: "What if, after the coach tells the robot to move in a straight line, we also give it a quick 'posture check' to make sure it's standing perfectly upright?"

They added one tiny extra step to the Muon process: Normalization.

Here is the analogy:

  1. The Old Way (Muon): The coach says, "Okay, take a step forward, but make sure your steps are at right angles to each other." The robot does this, but it might still be leaning forward or backward.
  2. The New Way (MUON+): The coach says, "Take a step forward at right angles. Now, pause and check your balance. If you're leaning, adjust your weight so you are perfectly centered before you take the next step."

That "pause and check" is the additional normalization step. It's simple, but it makes a huge difference.

What Happened When They Tried It?

The researchers tested this new method on robots of all sizes, from small ones (130 million "brain cells") to massive ones (1 billion "brain cells"). They also tested them on different types of robot architectures (GPT-style and LLaMA-style).

The Results were amazing:

  • Better Grades: The robots trained with MUON+ learned faster and made fewer mistakes (lower "perplexity," which is just a fancy way of saying "confusion").
  • Stability: The robots didn't wobble as much. They could handle larger learning rates (learning faster) without crashing.
  • Long Haul: Even when they trained the robots for a very long time (using 200 times more data than usual), MUON+ kept performing better than the old method.

Why Does This Matter?

Training these giant AI models costs millions of dollars in electricity and computer power. Every tiny improvement in efficiency saves a lot of money and time.

The paper shows that you don't always need to invent a complex, new mathematical theory to get better results. Sometimes, you just need to add a simple "posture check" (normalization) to an already good system.

In a nutshell:

  • Muon is a great coach that organizes the robot's learning path.
  • MUON+ is that same coach, but it also makes sure the robot stands up straight before taking the next step.
  • The Result: The robot learns faster, makes fewer mistakes, and stays stable, saving time and money for everyone building AI.

It's a small tweak with a massive impact, proving that sometimes the simplest adjustments yield the best performance.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →