To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Imagine you are teaching a student to solve a complex puzzle. You have two different teachers (optimizers) to choose from: Teacher SGD (the old-school, reliable veteran) and Teacher Muon (the flashy, high-speed new hire).

For a long time, everyone used Teacher SGD. Then, Teacher Muon arrived, and everyone was amazed because Muon could teach the student to solve the puzzle much faster. In fact, Muon was so fast that many people started using it as the default teacher for almost everything.

But this paper asks a scary question: "Just because Muon is faster, does that mean the student actually understands the puzzle better?"

The authors argue that Muon's speed comes with a hidden cost: it removes a natural "simplicity bias" that helps students learn the true rules of the game, rather than just memorizing the answers.

Here is the breakdown using simple analogies:

1. The Two Teaching Styles

Teacher SGD (The Slow, Steady Climber)
Imagine a hiker climbing a mountain to find the lowest valley (the best solution).

How they move: They take it step-by-step. They explore one path, get stuck on a small hill (a "saddle point"), figure it out, and then move to the next.
The "Simplicity Bias": Because they move slowly, they naturally learn the biggest, most obvious features of the mountain first. They learn the main trail before they worry about the tiny pebbles.
The Result: They might take longer to reach the bottom, but they build a solid, robust understanding of the terrain. They don't get distracted by small details until they've mastered the big picture.

Teacher Muon (The Speed Demon)
Imagine a hiker with a jetpack who can fly over the small hills.

How they move: They don't get stuck on the small hills. They zoom through the landscape, learning everything at once. They learn the big trails and the tiny pebbles simultaneously.
The "Speed": This is why Muon is so fast. It skips the "saddle points" that slow down the old teacher.
The Problem: Because it learns everything at the same time, it loses the "simplicity bias." It doesn't prioritize the most important rules first. It treats a tiny, irrelevant detail the same as a fundamental law of physics.

2. The Cost of Speed: Memorization vs. Understanding

The paper shows two scenarios where Teacher Muon's speed actually hurts the student:

Scenario A: The "Shared Secret" (Learning the Underlying Structure)

Imagine you are teaching a student to recognize animals.

The Setup: You show them a picture of a cat in a red box, then a cat in a blue box. You want them to learn that "Cat = Cat," regardless of the box color.
Teacher SGD: Learns the shape of the cat first. Once they understand the cat, they realize the box color doesn't matter. They can correctly identify a cat in a green box (which they've never seen) because they learned the shared secret (the shape).
Teacher Muon: Because it learns everything at once, it memorizes "Red Box + Cat" and "Blue Box + Cat" as two separate, unrelated facts. When you show them a "Green Box + Cat," they get confused. They didn't learn the rule; they just memorized the specific examples they saw.
The Takeaway: Muon is great at memorizing, but bad at finding the common rules that apply to new situations.

Scenario B: The "Fake Clue" (Spurious Correlations)

Imagine a test where the answer is usually "Red," but sometimes the clue is a tiny, random speck of dust on the paper.

Teacher SGD: First, they look at the main question (the "Red" clue). They ignore the dust until they are sure about the main answer. If the dust is a fake clue, SGD is less likely to be tricked by it early on.
Teacher Muon: Because it learns everything at the same speed, it latches onto the "dust" clue just as quickly as the "Red" clue. If the dust happens to be a fake pattern in the training data, Muon might decide that "Dust = Answer" is a real rule.
The Takeaway: Muon is more likely to fall for "cheating" or "fake clues" in the data because it doesn't wait to see which clues are actually important.

3. The Big Picture: Why This Matters

For a long time, the AI world has been obsessed with speed. "Who can train the model the fastest?" is the main question.

This paper is a wake-up call. It says: "Stop just looking at the stopwatch."

The Trade-off: Muon trades understanding for speed.
The Risk: If you use Muon for critical tasks (like medical diagnosis or self-driving cars), you might get a model that is fast but fragile. It might memorize the training data perfectly but fail when faced with a slightly different real-world situation.
The Lesson: When choosing an optimizer (a tool to train AI), you shouldn't just ask, "Is it fast?" You should ask, "What kind of habits does this tool teach my model?"

Summary Analogy

Teacher SGD is like a craftsman who builds a chair slowly, ensuring the legs are solid and the joints are tight. It takes time, but the chair won't wobble.
Teacher Muon is like a 3D printer that spits out a chair in seconds. It's amazing! But if the design has a flaw, the printer just prints the flaw perfectly. It doesn't "think" about whether the chair is stable; it just follows the instructions instantly.

The Conclusion: Speed is great, but we need to make sure our AI isn't just memorizing the test answers. Sometimes, the "slow" way is the only way to learn the right way.

1. Problem Statement

While the Muon optimizer (MomentUm Orthogonalized by Newton-Schulz) has recently gained popularity for its superior training speed compared to established optimizers like SGD and Adam, its underlying inductive biases remain poorly understood.

The Gap: Current literature focuses primarily on Muon's speed benefits (wall-clock time convergence) but lacks theoretical insight into the trajectory it takes through the loss landscape and the functional properties of the solutions it converges to.
The Core Question: Does the mechanism driving Muon's speedup introduce detrimental biases? Specifically, does Muon sacrifice the "simplicity bias" inherent in Stochastic Gradient Descent (SGD), potentially leading to models that memorize data rather than learning generalizable underlying structures?

2. Methodology

The authors employ a combination of theoretical analysis on simplified models and empirical validation on complex tasks.

A. Theoretical Framework: Deep Linear Networks

To isolate the effects of the optimizer, the authors analyze Deep Linear Networks (2-layer networks without non-linearities).

Spectral Gradient Descent (Spectral GD): They introduce a tractable theoretical proxy for Muon called Spectral GD. This variant assumes exact Singular Value Decomposition (SVD) and removes momentum to isolate the effect of orthogonalization (setting all non-zero singular values of the update step to 1).
Comparison: They contrast the learning dynamics of Spectral GD against standard Gradient Descent (GD).
- GD Dynamics: Proven to exhibit "saddle-to-saddle" dynamics. It learns singular components sequentially, starting from the largest singular values. This gradual increase in solution rank acts as an implicit regularizer (simplicity bias).
- Spectral GD Dynamics: Proven to learn all singular components simultaneously at the same rate. It bypasses the sequential rank-increasing phase, effectively removing the simplicity bias.

B. Empirical Experiments

The authors validate their theoretical findings using two specific experimental setups:

The "Routing" Task (Shared Representations): A multi-modal setup where different input domains map to a shared underlying task (mapping numbers $\{1,2,3,4\}$ to vectors). The model must learn a low-rank shared representation to generalize to unseen input-output pairs.
Spurious Correlations (MNIST): A classification task where a specific pixel intensity is correlated with the class label (spurious feature) alongside the actual digit shape. The goal is to determine which optimizer relies more on the true feature (digit shape) versus the spurious one.

3. Key Contributions & Results

A. Theoretical Insight: Loss of Simplicity Bias

GD Trajectory: Standard GD follows a path where the rank of the solution matrix increases gradually. It fully learns the dominant modes (largest singular values) before moving to smaller ones. This acts as an implicit curriculum, prioritizing simple, high-variance structures.
Muon/Spectral GD Trajectory: Muon learns all modes simultaneously. While this explains its speed (it doesn't get stuck in saddle points waiting for the next mode to activate), it eliminates the sequential learning process.
Consequence: The loss of simplicity bias means Muon does not naturally prioritize "simple" underlying structures over complex, high-frequency noise.

B. Experimental Result 1: Failure to Learn Shared Representations

Setup: In the routing task, the model is trained on a subset of input-output pairs but must generalize to unseen pairs by discovering the shared 4-dimensional structure.
Outcome:
- SGD: Successfully learns the shared low-rank representation (rank $\approx$ 4), generalizing perfectly to unseen pairs.
- Muon (Spectral GD): Achieves perfect training loss but fails to generalize. It memorizes the specific training pairs, resulting in a solution with a significantly higher effective rank (heavy-tailed spectrum).
Implication: Muon struggles to uncover common underlying structures across tasks, preferring to memorize specific data points.

C. Experimental Result 2: Sensitivity to Spurious Features

Setup: Training on MNIST with a spurious pixel feature.
Outcome:
- SGD: Initially learns the dominant feature (the actual digit shape) before learning the spurious pixel. This delay allows for early stopping to prevent over-reliance on the spurious feature.
- Muon: Learns both the digit shape and the spurious feature simultaneously.
Implication: Muon converges faster to a solution that relies on spurious correlations. If the spurious feature is strong, Muon may lock onto it immediately, whereas SGD's sequential learning offers a window where the model relies on the true signal.

4. Significance and Conclusion

Re-evaluating Optimizer Choice: The paper argues that "faster is not always better." The speedup of Muon comes at the cost of losing the implicit regularization (simplicity bias) provided by SGD.
Inductive Bias Matters: Different optimizers induce different inductive biases. While Muon is beneficial in settings requiring balanced learning of imbalanced modalities (as shown in prior work), it is detrimental in settings requiring the discovery of shared, low-rank structures or robustness against spurious correlations.
Future Directions: The authors suggest that future optimizer design should not just focus on convergence speed but must explicitly consider the inductive biases introduced. Ideal optimizers might need to traverse the loss landscape similarly to GD (to preserve simplicity bias) but find ways to break saddle points more efficiently without losing the sequential learning structure.

Summary: The paper serves as a critical reminder that the choice of optimizer fundamentally alters the functional behavior of a model. Muon's orthogonalization mechanism accelerates training by removing the sequential "simplicity bias" of SGD, which can lead to solutions that are faster to train but less generalizable and more prone to overfitting spurious features.

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

1. The Two Teaching Styles

2. The Cost of Speed: Memorization vs. Understanding

Scenario A: The "Shared Secret" (Learning the Underlying Structure)

Scenario B: The "Fake Clue" (Spurious Correlations)

3. The Big Picture: Why This Matters

Summary Analogy

1. Problem Statement

2. Methodology

A. Theoretical Framework: Deep Linear Networks

B. Empirical Experiments

3. Key Contributions & Results

A. Theoretical Insight: Loss of Simplicity Bias

B. Experimental Result 1: Failure to Learn Shared Representations

C. Experimental Result 2: Sensitivity to Spurious Features

4. Significance and Conclusion

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank