HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

This paper introduces HTMuon, a heavy-tailed spectral correction method that improves upon the Muon optimizer by preserving parameter interdependencies while inducing heavier-tailed weight spectra, resulting in consistent performance gains in LLM pretraining and image classification alongside theoretical convergence guarantees.

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

Published 2026-03-12
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a giant, complex robot (a Large Language Model) how to speak human language. To do this, you give it a massive library of books (data) and a teacher (an optimizer) who corrects its mistakes.

For a long time, the best teacher was a method called Adam. But recently, a new, smarter teacher named Muon arrived. Muon is special because it doesn't just look at mistakes one by one; it looks at how different parts of the robot's brain work together. It's like a coach who understands that if your left arm moves, your right leg might need to adjust, rather than treating every muscle in isolation.

However, the authors of this paper found a flaw in Muon's teaching style. They discovered that Muon was being too rigid and too fair in a way that actually hurt the robot's learning.

Here is the story of HTMuon, the new and improved teacher, explained simply.

1. The Problem: Muon's "One-Size-Fits-All" Mistake

Imagine Muon is a coach correcting a student's posture.

  • The Issue: Muon looks at every direction the student could move and says, "Okay, we will push you with exactly the same force in every direction."
  • Why this is bad: In reality, some directions are full of useful information (like "stand up straight"), while others are just noise (like "wiggle your toe randomly").
  • The Result: By pushing equally in all directions, Muon accidentally pushes just as hard on the "wiggle your toe" noise as it does on the "stand up straight" signal. This drowns out the important lessons with static.

Furthermore, the authors noticed that the best-trained robots in the world have a specific "fingerprint" in their brain structure called a Heavy-Tailed Spectrum.

  • The Analogy: Think of a city's population. A "light-tailed" city has everyone living in identical, small houses. A "heavy-tailed" city has a few massive skyscrapers (the most important connections) and many small cottages (the less important ones).
  • The Discovery: Muon forces the robot to build a city of identical small houses (light-tailed). But the best robots naturally build cities with skyscrapers (heavy-tailed). Muon was accidentally preventing the robot from building the best possible brain structure.

2. The Solution: HTMuon (The "Smart Filter")

The authors created HTMuon (Heavy-Tailed Muon). Think of HTMuon as Muon with a smart filter added to its brain.

  • How it works: Instead of pushing with equal force everywhere, HTMuon looks at the "noise" directions and says, "This is just static; let's turn the volume down." It pushes much harder on the important, signal-rich directions and much softer on the noisy ones.
  • The Magic Power: By turning down the noise, it allows the robot's brain to naturally develop those "skyscrapers" (the heavy-tailed spectrum) that are associated with high intelligence.

3. The Results: A Smarter, Faster Robot

The paper tested this new teacher on various tasks, from teaching robots to write stories (LLMs) to recognizing cats and dogs (Image Classification).

  • Better Grades: On the C4 dataset (a huge library of text), HTMuon helped the robot learn so well that its confusion (called "perplexity") dropped significantly compared to the old Muon. It's like a student going from a B+ to an A+ just by changing how they study.
  • Plug-and-Play: The best part? HTMuon isn't a totally new robot; it's a plugin. You can take any existing Muon setup and swap in HTMuon to get instant upgrades.
  • Speed: Usually, being smarter takes more time. HTMuon is slightly slower because it does extra math to filter the noise. However, the authors created "turbo modes" (accelerated versions) that make it almost as fast as the original Muon while keeping the smarts.

4. The Theory: Why It Works

The authors didn't just guess; they proved mathematically why this works.

  • They showed that HTMuon is essentially the "steepest descent" (the fastest way down a hill) when you are allowed to take steps of a specific, flexible shape (called a Schatten-q norm).
  • In simple terms: Muon was only allowed to take square steps. HTMuon is allowed to take rectangular steps, which fits the shape of the problem much better.

Summary

Muon was a great teacher that understood how different parts of a brain connect, but it was too rigid and treated noise and signal the same.

HTMuon fixes this by:

  1. Listening to the signal: It amplifies the important directions.
  2. Ignoring the noise: It dampens the random static.
  3. Building a better brain: It allows the model to develop the "heavy-tailed" structure that nature seems to prefer for high intelligence.

The result is a training method that makes AI models smarter, more stable, and better at generalizing to new tasks, all while being compatible with the tools developers already use.