On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

Imagine you are an architect designing skyscrapers. You have a blueprint that works perfectly for a 10-story building. But when you try to use that exact same blueprint for a 100-story tower, the building collapses. Why? Because the forces acting on the structure change as it gets bigger.

In the world of Artificial Intelligence (AI), we are currently building "skyscrapers" of code called Neural Networks. As we make these networks wider (adding more neurons, like adding more floors), the "learning rate"—the speed at which the AI learns—often breaks. A speed that works for a small network causes a huge network to crash or learn incredibly slowly.

This paper, titled "On the Width Scaling of Neural Optimizers," solves this problem by giving us a new set of blueprints that work for any size building.

Here is the breakdown of their discovery, using simple analogies.

1. The Problem: The "Speed Limit" Changes with Size

Think of training an AI like driving a car.

Small Network (City Driving): You can drive fast (high learning rate) because the roads are simple and short.
Large Network (Highway Driving): As the network gets wider, the "roads" get more complex and the car gets heavier. If you keep driving at the city speed limit, you might crash. If you slow down too much, you never get there.

Currently, when engineers make a network wider, they have to stop and guess a new speed limit. It's like having to re-calibrate your car's engine every time you add a new floor to your house. This is expensive and inefficient.

2. The Old Solution: "Muon" and the "Spectral Norm"

One popular method, called Muon, tries to fix this by looking at the "shape" of the data. Imagine you are trying to flatten a crumpled piece of paper. Muon tries to smooth it out perfectly.

The Good News: It works well for medium-sized buildings.
The Bad News: The paper found that as the building gets very tall (very wide), the "smoothness" of the road starts to get bumpy. The math shows that the bumps grow with the square root of the size ( $\sqrt{width}$ ). Eventually, the road becomes so bumpy that the car (the AI) can't drive smoothly anymore.

3. The New Solution: "MOGA" (The Universal Blueprint)

The authors propose a new optimizer called MOGA (Matrix Operator Geometry Aware). They realized that the problem wasn't just the car; it was the geometry of the road itself.

They introduced a concept called "Mean-Normalized Geometry."

The Analogy: Imagine you are measuring the height of people in a room.
- Old Way: You measure everyone in inches. If you add 1,000 people, the total height of the room seems to explode. This makes the math messy.
- MOGA Way: You measure the average height per person. No matter how many people you add, the average stays stable.

By switching to this "average" way of measuring the math, they found a geometry where the "road" stays smooth and flat, no matter how wide the network gets.

4. The Two Types of "Roads" (Row vs. Column)

The paper tests two specific ways to drive on this new road:

Column Normalization: This is like checking the weight of every column of the building. It keeps the math stable, but it forces the building materials (the weights) to get very thin and fragile as the building grows. It's stable, but maybe too restrictive.
Row Normalization (The Winner): This is like checking the weight of every row (or floor). The authors found that this method keeps the road smooth AND allows the building materials to stay strong.
- The Result: This "Row Normalization" version of MOGA is the star of the show. It allows the AI to learn at the same speed whether it's a small network or a massive one.

5. The Real-World Test: GPT and LLaMA

The authors didn't just do math on a whiteboard; they built it. They tested their new optimizer on famous AI models (GPT-2 and LLaMA).

The Magic: They tuned the learning speed on a tiny model. Then, they took that exact same speed and applied it to a massive model.
The Outcome: The massive model learned perfectly without needing any re-tuning.
The Bonus: In the later stages of training (when the AI is trying to get very smart and the loss is very low), MOGA was actually faster and more stable than the current industry leaders (AdamW and Muon).

Summary: Why This Matters

Think of this paper as inventing a universal transmission for cars.

Before: You had to manually change gears every time you switched from a bicycle to a truck.
Now: With MOGA, you have a transmission that automatically adjusts to the size of the vehicle. You can build a tiny AI or a giant AI, and you can use the exact same "learning speed" settings.

This saves massive amounts of time and money for AI researchers. Instead of spending weeks guessing the right settings for a new, bigger model, they can just copy the settings from the small one and get straight to work. It makes scaling up AI safer, faster, and more predictable.

Here is a detailed technical summary of the paper "On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer".

1. Problem Statement

Modern deep learning relies on "scaling laws," where model performance improves predictably with increased size. However, a critical gap exists in hyperparameter transfer: optimal learning rates tuned for small networks often fail (diverge or converge slowly) when applied to wider networks.

The Core Issue: Standard optimizers (e.g., AdamW, Muon) lack width-independent optimization properties. As network width ( $w$ ) increases, the Lipschitz constants and smoothness ( $L$ -smoothness) of the loss landscape often deteriorate, requiring manual retuning of learning rates.
The Goal: Design optimization methods where the optimal learning rate remains stable and transferable across varying network widths without re-tuning.

2. Methodology

The authors propose a unified framework interpreting neural network optimizers as steepest descent under specific matrix operator norms.

A. Theoretical Foundation: Matrix Operator Norms

The paper analyzes the optimization problem $\min_{\Theta} f(\Theta)$ by defining the descent direction $D_t$ as the solution to:
$D_t = \arg \min_{\|D\| \leq 1} \langle \nabla f(\Theta_t), D \rangle$
where $\|\cdot\|$ is a matrix operator norm induced by vector norms.

Classical Norms ( $p \to q$ ): Standard norms (e.g., $1 \to \infty $for AdamW/SignSGD,$ 2 \to 2$ for Muon) are examined.
The Flaw: Classical $p \to q$ norms with $p \leq q$ fail to provide width-independent bounds in deep architectures because the "mismatch" between consecutive layers amplifies perturbations. Specifically, the identity map between layers has a norm $>1$ , causing Lipschitz and smoothness constants to grow with width.

B. The Solution: Mean-Normalized Operator Norms

To fix the layer-wise composability issue, the authors introduce Mean-Normalized Operator Norms, denoted as $(p, \text{mean}) \to (q, \text{mean})$ .

Definition: The $(p, \text{mean})$ norm for a vector $x \in \mathbb{R}^n$ is defined as:
$\|x\|_{(p, \text{mean})} = \left( \frac{1}{n} \sum_{i=1}^n |x_i|^p \right)^{1/p} = n^{-1/p} \|x\|_p$
Key Property: This normalization cancels the dimensional scaling of embeddings. It ensures the compatibility condition $\|I\|_{(q, \text{mean}) \to (p, \text{mean})} \leq 1$ , allowing stability bounds to compose cleanly across layers without width-dependent distortion.

C. Smoothness Analysis

The paper derives conditions for width-independent $L$ -smoothness (gradient Lipschitz continuity):

Theorem 2: Under $(p, \text{mean}) \to (q, \text{mean})$ geometry, the smoothness constant scales as $w^{\max(0, 2/q - 1/p)}$ .
Implication:
- Muon ($2 \to 2 $): Scales as$ O(\sqrt{w})$. The landscape becomes rougher as width increases, explaining potential instability.
- Row Normalization ( $p \to \infty$ ): Scales as $O(1)$ if $q=\infty$ .
- Column Normalization ($1 \to q $): Scales as$ O(1) $if$ q \geq 2$.

D. The MOGA Optimizer

Based on these geometric insights, the authors propose MOGA (Matrix Operator Geometry Aware).

Mechanism: It applies steepest descent under mean-normalized norms. This induces a width-aware rescaling of the learning rate.
Scaling Rule: The update direction is identical to standard methods (like Adam or SignSGD) but scaled by a factor dependent on input/output dimensions ( $d_{in}, d_{out}$ $d_{in}, d_{o u t}$ ).
- For Row Normalization ( $p \to \infty$ ): Scale by $d_{in}^{-1/p}$ .
- For Column Normalization ($1 \to q $): Scale by$ d_{out}^{1/q} / d_{in}$.
Connection to $\mu$ P: In the special case of Adam/SignSGD, MOGA recovers the $\mu$ P (Maximal Update Parametrization) scaling rules. However, MOGA derives this from optimization geometry (Lipschitz/smoothness control) rather than the spectral conditions used in $\mu$ P, making it applicable to a broader class of optimizers (e.g., $p=3$ ).

3. Key Contributions

Unified Geometric Framework: Interprets diverse optimizers (AdamW, Muon, SignSGD, GradPower) as steepest descent under specific matrix operator norms.
Identification of Limitations: Proves that classical operator norms ( $p \to q$ ) fail to provide width-independent smoothness, particularly for Muon ($2 \to 2 $), which suffers$ O(\sqrt{w})$ growth in smoothness constants.
Mean-Normalized Geometry: Introduces $(p, \text{mean}) \to (q, \text{mean})$ norms that ensure width-independent Lipschitz and smoothness bounds, enabling stable hyperparameter transfer.
MOGA Optimizer: Proposes a practical, width-aware optimizer based on row/column normalization that requires no spectral assumptions and works for a broad range of $p, q$ .
Theoretical vs. Empirical Validation: Demonstrates that row-normalized MOGA achieves width-independent learning rates and outperforms Muon in low-loss regimes.

4. Experimental Results

The authors validated MOGA on large-scale pre-training tasks (GPT-2 and LLaMA architectures) with varying widths and token budgets.

Learning Rate Transfer:
- Models with vastly different parameter counts (124M to 1.5B) achieved optimal performance at nearly identical peak learning rates when using MOGA with row normalization.
- This holds even for $p=3$ , a regime where standard $\mu$ P spectral assumptions do not apply, proving the robustness of the geometric approach.
Standard Token Budget (~1x Chinchilla):
- MOGA (Row Norm) performed competitively with Muon and significantly better than AdamW on LLaMA-130M.
- On GPT-2 Small, MOGA showed convergence between Muon and AdamW.
Large Token Budget (~8x Chinchilla):
- Crucial Finding: In the low-loss regime (late training), MOGA with row normalization demonstrated a clear speed advantage over Muon and AdamW.
- The loss curves showed MOGA continuing to descend more steeply, suggesting superior stability and efficiency for long-horizon training.

5. Significance and Conclusion

Practical Impact: MOGA enables reliable hyperparameter transfer across model scales. Practitioners can tune a small proxy model and apply the same learning rate to a much larger model without expensive re-tuning.
Theoretical Insight: The paper shifts the perspective on optimizer design from "spectral properties" to "optimization geometry." It highlights that Row Normalization offers a superior trade-off: it provides width-independent smoothness (like Column Norm) but imposes less restrictive constraints on the hypothesis class (preserving representational capacity better than Column Norm).
Future Direction: The work suggests that for large-scale deployment, especially in low-loss regimes, optimizers based on row-normalized steepest descent (MOGA) may be more robust and efficient than current state-of-the-art methods like Muon.

In summary, the paper provides a rigorous geometric justification for why certain normalizations work better than others at scale and introduces a practical optimizer (MOGA) that leverages these insights to achieve stable, transferable, and efficient training across model widths.