On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer

This paper introduces a family of mean-normalized matrix operator norms to derive width-independent smoothness bounds for deep neural networks, leading to the development of MOGA, a row/column-normalized optimizer that enables stable hyperparameter transfer across model widths and outperforms Muon in speed while maintaining competitive performance.

Ruihan Xu, Jiajin Li, Yiping Lu

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are an architect designing skyscrapers. You have a blueprint that works perfectly for a 10-story building. But when you try to use that exact same blueprint for a 100-story tower, the building collapses. Why? Because the forces acting on the structure change as it gets bigger.

In the world of Artificial Intelligence (AI), we are currently building "skyscrapers" of code called Neural Networks. As we make these networks wider (adding more neurons, like adding more floors), the "learning rate"—the speed at which the AI learns—often breaks. A speed that works for a small network causes a huge network to crash or learn incredibly slowly.

This paper, titled "On the Width Scaling of Neural Optimizers," solves this problem by giving us a new set of blueprints that work for any size building.

Here is the breakdown of their discovery, using simple analogies.

1. The Problem: The "Speed Limit" Changes with Size

Think of training an AI like driving a car.

  • Small Network (City Driving): You can drive fast (high learning rate) because the roads are simple and short.
  • Large Network (Highway Driving): As the network gets wider, the "roads" get more complex and the car gets heavier. If you keep driving at the city speed limit, you might crash. If you slow down too much, you never get there.

Currently, when engineers make a network wider, they have to stop and guess a new speed limit. It's like having to re-calibrate your car's engine every time you add a new floor to your house. This is expensive and inefficient.

2. The Old Solution: "Muon" and the "Spectral Norm"

One popular method, called Muon, tries to fix this by looking at the "shape" of the data. Imagine you are trying to flatten a crumpled piece of paper. Muon tries to smooth it out perfectly.

  • The Good News: It works well for medium-sized buildings.
  • The Bad News: The paper found that as the building gets very tall (very wide), the "smoothness" of the road starts to get bumpy. The math shows that the bumps grow with the square root of the size (width\sqrt{width}). Eventually, the road becomes so bumpy that the car (the AI) can't drive smoothly anymore.

3. The New Solution: "MOGA" (The Universal Blueprint)

The authors propose a new optimizer called MOGA (Matrix Operator Geometry Aware). They realized that the problem wasn't just the car; it was the geometry of the road itself.

They introduced a concept called "Mean-Normalized Geometry."

  • The Analogy: Imagine you are measuring the height of people in a room.
    • Old Way: You measure everyone in inches. If you add 1,000 people, the total height of the room seems to explode. This makes the math messy.
    • MOGA Way: You measure the average height per person. No matter how many people you add, the average stays stable.

By switching to this "average" way of measuring the math, they found a geometry where the "road" stays smooth and flat, no matter how wide the network gets.

4. The Two Types of "Roads" (Row vs. Column)

The paper tests two specific ways to drive on this new road:

  • Column Normalization: This is like checking the weight of every column of the building. It keeps the math stable, but it forces the building materials (the weights) to get very thin and fragile as the building grows. It's stable, but maybe too restrictive.
  • Row Normalization (The Winner): This is like checking the weight of every row (or floor). The authors found that this method keeps the road smooth AND allows the building materials to stay strong.
    • The Result: This "Row Normalization" version of MOGA is the star of the show. It allows the AI to learn at the same speed whether it's a small network or a massive one.

5. The Real-World Test: GPT and LLaMA

The authors didn't just do math on a whiteboard; they built it. They tested their new optimizer on famous AI models (GPT-2 and LLaMA).

  • The Magic: They tuned the learning speed on a tiny model. Then, they took that exact same speed and applied it to a massive model.
  • The Outcome: The massive model learned perfectly without needing any re-tuning.
  • The Bonus: In the later stages of training (when the AI is trying to get very smart and the loss is very low), MOGA was actually faster and more stable than the current industry leaders (AdamW and Muon).

Summary: Why This Matters

Think of this paper as inventing a universal transmission for cars.

  • Before: You had to manually change gears every time you switched from a bicycle to a truck.
  • Now: With MOGA, you have a transmission that automatically adjusts to the size of the vehicle. You can build a tiny AI or a giant AI, and you can use the exact same "learning speed" settings.

This saves massive amounts of time and money for AI researchers. Instead of spending weeks guessing the right settings for a new, bigger model, they can just copy the settings from the small one and get straight to work. It makes scaling up AI safer, faster, and more predictable.