Muon+: Towards Better Muon via One Additional Normalization Step

Imagine you are trying to teach a giant, super-smart robot (a Large Language Model) how to speak human language. To do this, you show it billions of sentences and let it learn by trial and error. This process is called pre-training.

The robot learns by adjusting its internal "knobs" (mathematical weights) whenever it makes a mistake. The tool it uses to decide how to turn those knobs is called an optimizer. For a long time, the industry standard tool has been called Adam or AdamW.

Recently, a new tool called Muon arrived on the scene and started doing a better job. It works by "straightening out" the robot's learning path so it doesn't get stuck in loops or go off-track. Think of Muon as a very strict coach who tells the robot: "Don't just move randomly; move in a perfectly straight, organized line."

The Problem: The Robot is Still a Bit Wobbly

Even with Muon's strict coaching, the robot's movements can still be a little unbalanced. Sometimes it pushes too hard in one direction and too little in another. It's like a dancer who is moving in a straight line but is leaning heavily to the left, making the dance look awkward and inefficient.

The Solution: MUON+ (The "Posture Check")

The authors of this paper, MUON+, asked a simple question: "What if, after the coach tells the robot to move in a straight line, we also give it a quick 'posture check' to make sure it's standing perfectly upright?"

They added one tiny extra step to the Muon process: Normalization.

Here is the analogy:

The Old Way (Muon): The coach says, "Okay, take a step forward, but make sure your steps are at right angles to each other." The robot does this, but it might still be leaning forward or backward.
The New Way (MUON+): The coach says, "Take a step forward at right angles. Now, pause and check your balance. If you're leaning, adjust your weight so you are perfectly centered before you take the next step."

That "pause and check" is the additional normalization step. It's simple, but it makes a huge difference.

What Happened When They Tried It?

The researchers tested this new method on robots of all sizes, from small ones (130 million "brain cells") to massive ones (1 billion "brain cells"). They also tested them on different types of robot architectures (GPT-style and LLaMA-style).

The Results were amazing:

Better Grades: The robots trained with MUON+ learned faster and made fewer mistakes (lower "perplexity," which is just a fancy way of saying "confusion").
Stability: The robots didn't wobble as much. They could handle larger learning rates (learning faster) without crashing.
Long Haul: Even when they trained the robots for a very long time (using 200 times more data than usual), MUON+ kept performing better than the old method.

Why Does This Matter?

Training these giant AI models costs millions of dollars in electricity and computer power. Every tiny improvement in efficiency saves a lot of money and time.

The paper shows that you don't always need to invent a complex, new mathematical theory to get better results. Sometimes, you just need to add a simple "posture check" (normalization) to an already good system.

In a nutshell:

Muon is a great coach that organizes the robot's learning path.
MUON+ is that same coach, but it also makes sure the robot stands up straight before taking the next step.
The Result: The robot learns faster, makes fewer mistakes, and stays stable, saving time and money for everyone building AI.

It's a small tweak with a massive impact, proving that sometimes the simplest adjustments yield the best performance.

1. Problem Statement

Large Language Models (LLMs) require massive computational resources for pre-training. While optimizers like Adam and AdamW are dominant, they face efficiency and memory challenges at scale. The Muon optimizer has emerged as a promising alternative, utilizing Newton-Schulz iterations to orthogonalize the momentum matrix, thereby preventing gradient rank collapse. However, despite its success, there is a need to further improve optimization stability and final model quality, particularly in compute-optimal and long-horizon training regimes. Recent variations of Muon (e.g., NorMuon, Maon) introduced complex modifications, but the specific contribution of normalization within these frameworks remained under-explored as a standalone driver of performance.

2. Methodology: MUON+

The authors propose MUON+, a simple yet effective enhancement to the Muon optimizer. The core innovation is the introduction of a normalization step immediately following the orthogonalization step.

Standard Muon Update:
1. Update Momentum: $M_t = \mu M_{t-1} + (1-\mu)G_t$
2. Orthogonalize: $O_t = \text{Ortho}(M_t)$ (via Newton-Schulz iteration to approximate $UV^T$ from SVD)
3. Update Weights: $W_t = W_{t-1} - \eta \sqrt{m/n} \cdot O_t$
MUON+ Update:
The authors insert a normalization operator $\text{Norm}^{(d)}(\cdot)$ after orthogonalization:
1. Update Momentum: $M_t = \mu M_{t-1} + (1-\mu)G_t$
2. Orthogonalize: $U = \text{Ortho}(M_t)$
3. Normalize: $O_t = \text{Norm}^{(d)}(U)$
4. Update Weights: $W_t = W_{t-1} - \eta \sqrt{m/n} \cdot O_t$

Normalization Directions:
The paper investigates four normalization strategies applied along specific dimensions ( $d$ ):

Column-wise ($col$): Normalizes the $L_2$ norm of each column.
Row-wise ($row$): Normalizes the $L_2$ norm of each row.
Composed ( $col\_row$ and $row\_col$ ): Sequential application of column and row normalization.

The normalization is defined as $X / \sqrt{\sum x^2 + \epsilon}$ , ensuring the update vectors have unit norm in the specified direction, which stabilizes the magnitude of updates without requiring complex second-moment adaptation.

3. Key Contributions

Simplicity and Effectiveness: The paper demonstrates that a single normalization step added to Muon yields consistent performance gains across diverse model architectures and scales, challenging the notion that complex manifold or second-moment adaptations are strictly necessary.
Comprehensive Evaluation: Extensive pre-training experiments were conducted on:
- GPT-style models: 130M to 774M parameters.
- LLaMA-style models: 60M to 1B parameters.
- Training Regimes: Both compute-optimal settings (Token-to-Parameter ratio $\approx 20$ ) and industrial-scale overtraining (T2P $\approx 200$ ).
Ablation Studies: The authors systematically analyzed:
- Normalization Directions: Found that row-wise and composed ( $col\_row$ , $row\_col$ ) normalizations generally outperform column-only or no normalization.
- Learning Rate Sensitivity: MUON+ exhibits reduced sensitivity to suboptimal learning rates compared to standard Muon.
- Orthogonalization Methods: The improvement holds regardless of the SVD approximation method used (You, Jordan, or PolarExpress).
Code Release: The implementation is open-sourced to facilitate adoption.

4. Experimental Results

The experiments utilized the FineWeb dataset and mixed-precision training on H100/A100 GPUs.

Compute-Optimal Pre-training:
- GPT Models: MUON+ reduced validation perplexity (PPL) significantly.
  - GPT-Small (124M): PPL reduced by 2.02 (29.66 $\to$ 27.64).
  - GPT-Base (362M): PPL reduced by 1.72.
  - GPT-Large (774M): PPL reduced by 0.91.
- LLaMA Models: MUON+ outperformed both AdamW and standard Muon across all scales.
  - LLaMA-1B (1.3B params): PPL reduced from 10.68 (Muon) to 10.31 (MUON+).
  - Consistent gains were observed from 60M to 1B parameters.
Overtraining (Long-Horizon):
- Experiments with T2P ratios of $\approx 200$ (72B tokens) on GPT-Base and LLaMA-350M showed that MUON+ maintains its advantage.
- GPT-Base: PPL improved by 1.13 (16.97 $\to$ 15.84).
- LLaMA-350M: PPL improved by 0.45 (11.48 $\to$ 11.03).
- The performance gap remained stable throughout training, indicating robustness in later optimization stages.
Ablation Insights:
- Normalization vs. Second-Moment: Comparisons with NorMuon suggested that the normalization step itself is the primary driver of performance gains, while second-moment adaptation (variance scaling) offered negligible additional benefits in these settings.
- Directionality: Row-wise and composed normalizations ( $col\_row$ , $row\_col$ ) consistently yielded the best results, with row-wise often outperforming column-wise.

5. Significance

Efficiency: MUON+ offers a "free" performance boost with minimal computational overhead (a simple normalization operation), making it highly attractive for large-scale pre-training where every percentage point of perplexity reduction is critical.
Stability: The method improves optimization stability, allowing models to train effectively with a wider range of learning rates and under extreme token-to-parameter ratios.
Generalizability: The success across GPT and LLaMA architectures, as well as different SVD approximation methods, suggests that MUON+ is a robust, architecture-agnostic improvement to matrix-based optimizers.
Theoretical Insight: The work highlights that structural normalization of orthogonal updates is a critical, previously under-recognized factor in the success of Muon-based optimizers, potentially guiding future research into manifold optimization for LLMs.

In conclusion, MUON+ establishes that a simple normalization step after orthogonalization is a powerful, scalable, and essential component for optimizing large foundation models, outperforming the original Muon and AdamW baselines across all tested scales and training regimes.

Muon+: Towards Better Muon via One Additional Normalization Step

The Problem: The Robot is Still a Bit Wobbly

The Solution: MUON+ (The "Posture Check")

What Happened When They Tried It?

Why Does This Matter?

1. Problem Statement

2. Methodology: MUON+

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank