HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Imagine you are trying to teach a giant, complex robot (a Large Language Model) how to speak human language. To do this, you give it a massive library of books (data) and a teacher (an optimizer) who corrects its mistakes.

For a long time, the best teacher was a method called Adam. But recently, a new, smarter teacher named Muon arrived. Muon is special because it doesn't just look at mistakes one by one; it looks at how different parts of the robot's brain work together. It's like a coach who understands that if your left arm moves, your right leg might need to adjust, rather than treating every muscle in isolation.

However, the authors of this paper found a flaw in Muon's teaching style. They discovered that Muon was being too rigid and too fair in a way that actually hurt the robot's learning.

Here is the story of HTMuon, the new and improved teacher, explained simply.

1. The Problem: Muon's "One-Size-Fits-All" Mistake

Imagine Muon is a coach correcting a student's posture.

The Issue: Muon looks at every direction the student could move and says, "Okay, we will push you with exactly the same force in every direction."
Why this is bad: In reality, some directions are full of useful information (like "stand up straight"), while others are just noise (like "wiggle your toe randomly").
The Result: By pushing equally in all directions, Muon accidentally pushes just as hard on the "wiggle your toe" noise as it does on the "stand up straight" signal. This drowns out the important lessons with static.

Furthermore, the authors noticed that the best-trained robots in the world have a specific "fingerprint" in their brain structure called a Heavy-Tailed Spectrum.

The Analogy: Think of a city's population. A "light-tailed" city has everyone living in identical, small houses. A "heavy-tailed" city has a few massive skyscrapers (the most important connections) and many small cottages (the less important ones).
The Discovery: Muon forces the robot to build a city of identical small houses (light-tailed). But the best robots naturally build cities with skyscrapers (heavy-tailed). Muon was accidentally preventing the robot from building the best possible brain structure.

2. The Solution: HTMuon (The "Smart Filter")

The authors created HTMuon (Heavy-Tailed Muon). Think of HTMuon as Muon with a smart filter added to its brain.

How it works: Instead of pushing with equal force everywhere, HTMuon looks at the "noise" directions and says, "This is just static; let's turn the volume down." It pushes much harder on the important, signal-rich directions and much softer on the noisy ones.
The Magic Power: By turning down the noise, it allows the robot's brain to naturally develop those "skyscrapers" (the heavy-tailed spectrum) that are associated with high intelligence.

3. The Results: A Smarter, Faster Robot

The paper tested this new teacher on various tasks, from teaching robots to write stories (LLMs) to recognizing cats and dogs (Image Classification).

Better Grades: On the C4 dataset (a huge library of text), HTMuon helped the robot learn so well that its confusion (called "perplexity") dropped significantly compared to the old Muon. It's like a student going from a B+ to an A+ just by changing how they study.
Plug-and-Play: The best part? HTMuon isn't a totally new robot; it's a plugin. You can take any existing Muon setup and swap in HTMuon to get instant upgrades.
Speed: Usually, being smarter takes more time. HTMuon is slightly slower because it does extra math to filter the noise. However, the authors created "turbo modes" (accelerated versions) that make it almost as fast as the original Muon while keeping the smarts.

4. The Theory: Why It Works

The authors didn't just guess; they proved mathematically why this works.

They showed that HTMuon is essentially the "steepest descent" (the fastest way down a hill) when you are allowed to take steps of a specific, flexible shape (called a Schatten-q norm).
In simple terms: Muon was only allowed to take square steps. HTMuon is allowed to take rectangular steps, which fits the shape of the problem much better.

Summary

Muon was a great teacher that understood how different parts of a brain connect, but it was too rigid and treated noise and signal the same.

HTMuon fixes this by:

Listening to the signal: It amplifies the important directions.
Ignoring the noise: It dampens the random static.
Building a better brain: It allows the model to develop the "heavy-tailed" structure that nature seems to prefer for high intelligence.

The result is a training method that makes AI models smarter, more stable, and better at generalizing to new tasks, all while being compatible with the tools developers already use.

1. Problem Statement

The paper addresses limitations in the recently proposed Muon optimizer, a matrix-based method designed to capture parameter interdependencies in Large Language Model (LLM) training. While Muon outperforms standard vector-based optimizers (like Adam/AdamW) by orthogonalizing momentum updates, the authors identify two critical flaws:

Noise Amplification: Muon's update rule sets all singular values of the momentum matrix to exactly one (via orthogonalization). This assigns equal weight to all singular vector directions, including those associated with small singular values. The authors argue that directions with small singular values are often noise-dominated, and treating them equally with signal-rich directions degrades training stability and generalization.
Suppression of Heavy-Tailed Spectra: Muon's "all-ones" spectrum results in light-tailed weight spectra (Empirical Spectral Density, ESD). However, Heavy-Tailed Self-Regularization (HT-SR) theory posits that well-trained, high-quality neural networks exhibit heavy-tailed weight spectra. The degree of heavy-tailedness correlates with model quality. Muon's orthogonalization suppresses this natural emergence, potentially limiting the model's final performance.

2. Methodology: HTMuon

The authors propose HTMuon, a matrix-based optimizer that preserves Muon's ability to model parameter coupling while inducing heavy-tailed updates.

Core Mechanism: HTMuon modifies the singular value transformation of the momentum matrix $M_t$ $M_{t}$ . Instead of setting all singular values to 1 (as in Muon), HTMuon raises them to a power $p$ $p$ where $p \in (0, 1)$ $p \in (0, 1)$ .
- The update rule is: $O_t = U_t \Sigma_t^p V_t^\top$ , where $M_t = U_t \Sigma_t V_t^\top$ is the SVD of the momentum.
- $p=1$ : Reduces to SGDM (ignores parameter coupling).
- $p=0$ : Reduces to Muon (light-tailed, unit singular values).
- $p \in (0, 1)$ : The proposed regime. It retains the matrix-based structure (capturing geometry) but attenuates small singular values more than large ones, creating a heavy-tailed distribution of updates.
Theoretical Justification:
- The paper proves that HTMuon corresponds to steepest descent under a Schatten- $q$ norm constraint (where $q$ is related to $p$ ). This generalizes Muon, which is steepest descent under the Schatten- $\infty$ norm.
- Theoretical analysis shows HTMuon achieves the same sample complexity upper bound ( $O(\epsilon^{-4})$ ) as Muon and SGDM in smooth non-convex settings.
Efficiency Optimizations:
- Since SVD is computationally expensive, the authors introduce HTMuon NS, which uses Newton-Schulz iterations to approximate the matrix power and orthogonalization, significantly reducing runtime overhead.
- They also propose an interval-based update strategy, applying HTMuon every $k$ steps and Muon in between, to further balance performance and cost.

3. Key Contributions

Theoretical Insight: The paper establishes a link between Muon's performance limitations and the HT-SR theory, demonstrating that Muon's orthogonalization inadvertently suppresses the heavy-tailed spectral properties associated with high-quality learning.
Algorithm Design: Introduction of HTMuon, a simple yet effective modification to Muon that introduces a tunable power parameter $p$ to control spectral tail heaviness.
Comprehensive Evaluation: Extensive experiments on LLM pretraining (LLaMA, GPT-2) and image classification (ResNet, ViT) showing consistent improvements over state-of-the-art baselines.
Efficient Implementation: Development of accelerated variants (HTMuon NS and interval-based updates) that make the method scalable for large models (up to 1B parameters) without prohibitive computational costs.
Plug-in Capability: Demonstration that HTMuon can serve as a drop-in replacement or additive module for existing Muon variants (e.g., NorMuon, AdaMuon) to further boost performance.

4. Experimental Results

The authors evaluated HTMuon on various tasks and model sizes:

LLM Pretraining (C4 Dataset):
- LLaMA-60M: HTMuon reduced perplexity (PPL) by 0.92 compared to Muon and 4.33 compared to Adam.
- LLaMA-135M: PPL reduced by 0.98 compared to Muon.
- LLaMA-1B: HTMuon (with interval updates) outperformed Muon, demonstrating scalability.
- Downstream Tasks: On 7 commonsense reasoning benchmarks, HTMuon achieved the best average score, outperforming Muon by 1.05 points.
Image Classification:
- CIFAR-10/100: HTMuon improved accuracy over Muon by up to 0.31% on CIFAR-100 and 0.24% on CIFAR-10.
- ImageNet-1K: HTMuon achieved higher accuracy than Muon and Adam when training ViT-Tiny.
Spectral Analysis:
- Models trained with HTMuon exhibited lower Power Law exponents ( $\alpha$ ) in their weight spectra compared to Muon, confirming the induction of heavier tails.
- HTMuon models showed lower spectral and Frobenius norms, which are associated with better generalization.
Efficiency:
- HTMuon NS (Newton-Schulz) with an update interval of 5 steps incurred only a ~6-11% overhead in runtime compared to Muon while maintaining superior perplexity.

5. Significance

Bridging Theory and Practice: The work successfully translates the theoretical insights of Heavy-Tailed Self-Regularization into a practical optimizer design, validating that spectral properties are a critical factor in optimizer selection.
Next-Generation Optimizers: HTMuon represents a shift from purely vector-based or rigid orthogonalization approaches toward spectral-aware optimization. It suggests that preserving the "heavy-tailed" nature of learned representations is crucial for maximizing model capacity.
Scalability: By providing efficient approximations (Newton-Schulz) and interval strategies, the paper makes heavy-tailed spectral correction feasible for training billion-parameter models, addressing the computational bottlenecks of previous matrix-based methods.
Generalizability: The method is not limited to LLMs; it improves performance across diverse architectures (Transformers, CNNs, ViTs) and tasks, suggesting a universal benefit of heavy-tailed spectral correction in deep learning.

In conclusion, HTMuon improves upon the state-of-the-art Muon optimizer by correcting its spectral bias, leading to better generalization, lower perplexity, and more robust training dynamics, all while maintaining computational feasibility through novel acceleration techniques.

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

1. The Problem: Muon's "One-Size-Fits-All" Mistake

2. The Solution: HTMuon (The "Smart Filter")

3. The Results: A Smarter, Faster Robot

4. The Theory: Why It Works

Summary

1. Problem Statement

2. Methodology: HTMuon

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Faster Stochastic Algorithms for Minimax Optimization under Polyak--Łojasiewicz Conditions

Tensor Completion Leveraging Graph Information: A Dynamic Regularization Approach with Statistical Guarantees

Federated Multi-Agent Mapping for Planetary Exploration

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Exploring Low-Dimensional Subspaces in Diffusion Models for Controllable Image Editing