NuMuon: Nuclear-Norm-Constrained Muon for Compressible LLM Training

The Big Problem: Giant Brains, Tiny Backpacks

Imagine you have built a super-intelligent robot brain (a Large Language Model or LLM) that can write code, solve math problems, and tell jokes. It's amazing, but it's also huge. It's like trying to carry a library of encyclopedias in a tiny backpack.

Because these models are so big, they are expensive to run and hard to put on phones or small computers. To fix this, engineers try to "compress" them—basically, they try to shrink the backpack without losing the important books inside.

The Current Situation: The "Muon" Optimizer

To train these robot brains, we use a tool called an optimizer. Think of the optimizer as a coach that tells the brain how to learn from its mistakes.

Recently, a new coach named Muon became very popular.

How it works: Unlike older coaches who micromanage every single muscle (parameter), Muon looks at the whole body and makes big, smooth, full-body movements. It's very efficient and makes the brain learn faster.
The Surprise: The researchers in this paper discovered something weird. Even though Muon is a "full-body" coach that doesn't try to limit the brain's size, the brains it trains naturally end up having a lot of redundancy.
- Analogy: Imagine a chef who is told to use every single spice in the kitchen. Surprisingly, the chef ends up using only 20% of the spices for 90% of the flavor. The brain learns to be "low-rank," meaning it relies on a few key patterns rather than millions of random ones.
The Catch: While these Muon-trained brains are somewhat compressible, if you try to shrink them too much (like trying to fit that library into a shoebox), the brain starts to forget things and performs poorly. The compression is "brittle."

The Solution: Enter "NuMuon"

The researchers asked: "What if we could teach the brain to be naturally small and efficient from the very beginning, while still keeping Muon's speed?"

They created a new coach called NuMuon.

The Analogy: The "Budget" Coach

Imagine you are training an artist to paint a masterpiece.

Old Coach (AdamW): Tells the artist to use every brushstroke possible, but eventually, the artist realizes they only need a few colors.
Muon Coach: Tells the artist to use big, sweeping strokes. The artist naturally finds a few dominant colors, but if you force them to use only two colors later, the painting looks muddy.
NuMuon Coach: This coach says, "Use big sweeping strokes, BUT you have a strict budget. You can only use the top 3 colors for your main strokes."

How NuMuon does this:

The Nuclear Norm Budget: This is a fancy math term that basically means "limit the number of important colors you use." NuMuon forces the brain to focus its learning energy on the most important patterns (the top singular vectors) and ignore the noise.
The Result: The brain learns a structure that is designed to be compressed. It's like building a house with a blueprint that already has foldable walls.

Why This Matters: The "Foldable House"

The paper shows that NuMuon-trained models are like foldable houses.

Training: They learn just as fast and well as the other models (Muon).
Compression: When you try to shrink them (compress them), they fold up perfectly. They don't lose their "brainpower" even when you cut them down to 20% of their original size.

Real-world impact:

Faster Speed: A compressed NuMuon model runs much faster on a phone or a server.
Better Quality: Even when shrunk, it answers questions better than a standard model that was shrunk.
Cost: It saves money on electricity and hardware because you don't need a supercomputer to run it.

Summary in One Sentence

The researchers found that a new training method (Muon) accidentally makes AI brains easy to shrink, so they tweaked it (NuMuon) to intentionally teach the brains to be small and efficient from day one, resulting in super-smart AI that fits in your pocket without losing its genius.

1. Problem Statement

The rapid scaling of Large Language Models (LLMs) has led to significant challenges in memory usage and deployment costs. While model compression techniques (such as low-rank factorization) are essential for practical deployment, their effectiveness depends heavily on the intrinsic structure of the trained weight matrices.

The Gap: Most state-of-the-art optimizers (like AdamW) exhibit an implicit low-rank bias, making models naturally compressible. However, the recently proposed Muon optimizer, which uses full-rank orthogonalized updates to improve convergence and training stability, was previously uncharacterized regarding its induced weight-space structure.
The Challenge: It was unclear if Muon-trained models could be effectively compressed. Furthermore, even if they possessed some low-rank structure, it was observed to be "brittle"—performance degraded rapidly under aggressive compression rates.
Goal: The authors aim to understand Muon's implicit bias and develop a variant that explicitly enforces a low-rank structure during training to improve compressibility without sacrificing the optimization benefits of Muon.

2. Methodology: NuMuon

The paper introduces NuMuon, an optimizer that augments Muon with a nuclear-norm constraint on the update direction.

A. Empirical Observation

The authors first demonstrated that despite Muon using full-rank, orthogonalized updates, the resulting weight matrices naturally exhibit a pronounced low-rank structure (low stable rank) throughout training. However, this emergent structure is not robust enough for high-rate compression.

B. Theoretical Formulation

NuMuon reframes the Muon update step through the lens of Linear Minimization Oracles (LMOs) over norm balls.

Muon's View: Muon minimizes a linear objective over a spectral-norm ball ( $\|\Delta W\|_2 \leq \rho$ ), resulting in a full-rank orthogonal update ( $\Delta W = -\rho UV^\top$ ).
NuMuon's View: NuMuon adds a nuclear-norm budget ( $\|\Delta W\|_* \leq \tau$ ) to the constraint set. The new feasible set is the intersection of a spectral-norm ball and a nuclear-norm ball:
$\mathcal{W}^* := \{ \Delta W \mid \|\Delta W\|_2 \leq \rho, \|\Delta W\|_* \leq \tau \}$
Closed-Form Solution: The authors prove that the LMO for this constrained set reduces to a linear program over singular values. The optimal solution is a top- $k$ singular vector update:
$\Delta W^* = -\rho \sum_{i=1}^k u_i v_i^\top$
where $k = \lfloor \tau / \rho \rfloor$ . This effectively truncates the momentum update to its top- $k$ singular directions, forcing the weights to evolve along a lower-dimensional subspace.

C. Practical Implementation

To make NuMuon scalable for billion-parameter models:

Efficient SVD: Instead of computing a full SVD (which is expensive), NuMuon uses a Randomized Block Krylov method to approximate the top- $k$ singular vectors efficiently.
Rank Scheduling: Since early training often requires higher rank to explore the loss landscape, NuMuon employs a rank scheduler (e.g., cosine decay) that starts with a higher rank and anneals to a lower rank ( $k$ ) as training progresses.

D. Convergence Analysis

The paper provides theoretical convergence guarantees for NuMuon in non-convex settings. They establish a stationarity bound for the nuclear norm, showing that convergence depends on the "tail energy" of the gradient (the energy outside the top- $k$ components). Empirical results confirm that this tail energy is small, validating the theoretical assumptions.

3. Key Contributions

Discovery of Implicit Bias: The authors reveal that Muon-trained models inherently possess low-rank structure despite full-rank updates, making them compressible but brittle under aggressive compression.
NuMuon Optimizer: Proposal of a novel optimizer that explicitly constrains the update direction via a nuclear-norm budget, reducing the update to a top- $k$ singular vector operation.
Theoretical Guarantees: Extension of Muon's convergence analysis to the nuclear-norm constrained setting, providing bounds on stationarity.
Empirical Validation: Demonstration that NuMuon-trained models achieve significantly better compression-quality trade-offs compared to both AdamW and standard Muon.

4. Experimental Results

The authors evaluated NuMuon on models ranging from 0.6B to 1.8B parameters (Qwen3, Olmo2, Llama3) trained on the FineWeb-EDU dataset.

Training Dynamics: NuMuon tracks Muon's convergence closely, achieving comparable training and validation perplexity.
Compressibility: Under State-of-the-Art (SoTA) compression pipelines (ASVD, SVD-LLM, Dobi-SVD) at compression rates of 40% to 80%:
- NuMuon models maintained significantly lower perplexity than Muon and AdamW baselines.
- At 80% compression, NuMuon improved downstream task performance by up to 55.9% compared to Muon-trained models.
- Validation perplexity was reduced by up to 99.8% relative to Muon baselines at high compression rates.
Efficiency: NuMuon models achieved faster inference throughput for a fixed perplexity target, particularly in moderate-to-extreme compression regimes.
Subspace Alignment: Analysis showed that NuMuon updates remain more aligned with the dominant spectral subspace of the weights (lower Grassmann distance) compared to Muon, explaining the improved robustness to SVD-based compression.

5. Significance

This work bridges the gap between optimization efficiency and deployment efficiency.

For Training: It offers a principled way to control the effective dimensionality of LLM weights during training without sacrificing the convergence speed of advanced optimizers like Muon.
For Deployment: It enables the training of models that are "compression-native," meaning they are inherently designed to withstand aggressive low-rank factorization. This is crucial for deploying LLMs on resource-constrained hardware (e.g., edge devices) where memory and bandwidth are strict bottlenecks.
Broader Impact: The findings suggest that the "low-rank phenomenon" in LLMs is not just a property of specific optimizers like Adam but can be actively shaped and enhanced through nuclear-norm constraints, opening new avenues for designing compression-aware training pipelines.