Spectral Condition for $μ$P under Width-Depth Scaling

Imagine you are building a skyscraper. In the world of Artificial Intelligence (AI), these skyscrapers are called neural networks. To make them smarter, engineers usually do two things:

Widen them: Add more rooms (neurons) on each floor.
Add more floors: Make the building taller (deeper).

For a long time, building these AI skyscrapers was like trying to construct a tower of Jenga blocks while blindfolded. If you made the building wider, you had to guess new settings for the construction crew. If you made it taller, the whole thing often wobbled and collapsed, or the crew got confused and stopped learning.

This paper introduces a universal blueprint (called µP) that solves this problem. It tells engineers exactly how to adjust their tools and settings so that whether they build a 10-story house or a 10,000-story tower, the construction crew learns at the same steady pace, and the settings they used for the small house work perfectly for the giant tower.

Here is the breakdown using simple analogies:

1. The Problem: The "Jenga" Effect

When AI models get very deep (many layers), information has to travel from the bottom to the top.

The Old Way (Standard Parameterization): Imagine passing a message down a line of people. If the line is short, the message arrives loud and clear. If the line is 1,000 people long, the message either gets whispered so quietly it disappears (vanishing) or everyone starts shouting so loud it becomes noise (exploding).
The Result: The AI stops learning, or the engineers have to spend months re-tuning the settings for every new size of model. It's expensive and inefficient.

2. The Solution: The "Spectral µP" Blueprint

The authors of this paper developed a new set of rules called Spectral µP. Think of this as a "magic scale" for the construction crew.

Instead of just guessing how big the bricks should be, this blueprint says:

"As you add more floors, you must shrink the size of the bricks and the speed of the workers in a very specific mathematical way."

They call this a "Spectral Condition." In plain English, it's a rule about the volume of the signals traveling through the network.

The Rule: If you double the depth of the network, you must shrink the "volume" of the weight updates by a specific factor (like turning down a radio dial) so the signal doesn't get distorted as it travels up the tower.

3. The "Residual" Elevator

Modern AI buildings use "Residual Connections." Imagine an elevator that skips floors. Instead of walking up every single step, you can jump from Floor 1 to Floor 100 directly.

The Challenge: Previous blueprints worked well for wide buildings but failed when the building got very tall because the "elevator" would either shoot you to the sky or drop you in the basement.
The Fix: This paper's blueprint calculates exactly how strong the elevator cables need to be. It ensures that whether you have 4 floors or 256 floors, the elevator moves smoothly without breaking the building.

4. The "One-Size-Fits-All" Toolbelt

One of the coolest parts of this paper is that it works for any construction tool (optimizer).

Whether the crew uses a hammer (SGD), a power drill (AdamW), or a laser cutter (Muon-Kimi), this blueprint tells you exactly how to adjust the power settings.
The Benefit: You can tune your settings on a small, cheap model (a 4-story house). Once you find the perfect settings, you can copy-paste them to a massive model (a 10,000-story skyscraper), and it will work perfectly immediately. No more guessing!

5. Real-World Proof

The authors tested this on a language model (like a mini-GPT).

Without the blueprint: As they made the model wider and deeper, the training became unstable, and the "best" settings changed every time.
With the blueprint: The model stayed stable. The "best" settings for a small model worked perfectly for the giant model. The AI learned consistently, regardless of how big it got.

Summary Analogy

Imagine you are teaching a dog to fetch.

Old Way: You teach a Chihuahua. You use a small ball and a short throw. When you try to teach a Great Dane, you have to guess: "Should I throw a bigger ball? A smaller one? Should I stand further away?" You might break the dog's neck or the ball.
New Way (µP): You discover a rule: "No matter the dog's size, the ball should always be 1% of the dog's weight, and the throw distance should be 1% of the dog's height."
Result: You teach the Chihuahua once. You write down the rule. Now you can teach a Great Dane, a Wolf, or a Lion, and they will all learn to fetch perfectly using the exact same rule.

In a nutshell: This paper gives AI engineers a simple, mathematical "rule of thumb" to build massive, deep AI models without the headache of constant re-tuning, ensuring they learn efficiently no matter how big they get.

1. Problem Statement

Generative foundation models are rapidly scaling in both width (number of neurons) and depth (number of layers). This scaling trend introduces two critical challenges:

Unstable Feature Learning: As models grow, standard parameterizations often lead to exploding or vanishing feature norms, causing training dynamics to become unstable or degenerate.
Hyperparameter (HP) Transfer Failure: Optimal hyperparameters (e.g., learning rates) tuned on small models often fail when transferred to larger models, making the training of massive models prohibitively expensive due to the need for extensive re-tuning.

While Maximal Update Parameterization (µP) has successfully solved these issues for width-only scaling, existing extensions to joint width–depth scaling remain fragmented. Current approaches are often:

Tightly coupled to specific architectures (e.g., Transformers) or optimizers (e.g., SGD, AdamW).
Derived using complex, technically involved tools like Tensor Programs or dynamical mean-field theory.
Lacking a unified theoretical framework that can easily generalize to new optimizers and architectures.

2. Methodology

The authors propose a simple and unified spectral framework to characterize µP under joint width–depth scaling. The methodology relies on elementary linear algebra and probability rather than complex theoretical tools.

A. Problem Setup

The authors analyze deep residual networks (specifically ResNets) where both the model width ( $n$ ) and depth ( $L$ ) scale to infinity. They focus on residual blocks with multi-layer main branches (e.g., two-layer blocks common in Transformers) to capture the core scaling behavior.

B. The Spectral µP Condition (Condition 3.1)

The core contribution is a unified spectral condition that dictates how the RMS operator norms of weights ( $W$ ) and their per-step updates ( $\Delta W$ ) must scale with $n$ and $L$ to satisfy the µP principle:

Scale-invariant feature learning: $\|h_l(x)\|_R = \Theta(1)$ .
Maximal feature change: $\|\Delta h_l(x)\|_R = \Theta(1)$ , maximizing the contribution of parameter updates.

The derived conditions are:

Input/Output Weights: The product of the block multiplier ( $\alpha$ ) and the RMS norm of weights/updates must be $\Theta(1)$ .
Hidden Weights (Initialization): The product of the block multiplier and the RMS norms of the weights in a residual block must scale as $\Theta(1/L)$ . This is stricter than width-only scaling (which is $\Theta(1)$ ) to prevent feature explosion due to the accumulation of residuals over depth.
Hidden Weights (Updates):
- First-order updates: The product of the multiplier, the weight norm, and the update norm must scale as $\Theta(1/L)$ .
- Second-order updates: The product of the multiplier and the norms of the updates of both layers in the block must also scale as $\Theta(1/L)$ .

Key Insight: Unlike width-only scaling where residual multipliers scale as $\Theta(1/\sqrt{L})$ , joint width–depth scaling requires a stronger decay of $\Theta(1/L)$ for the residual multipliers to maintain stability in deeper blocks.

C. Unified Implementation Recipe

Based on Condition 3.1, the authors derive a general recipe for implementing µP across various optimizers:

Map Spectral Constraints to HPs: Translate the spectral norms into concrete parameterizations for initialization variance ( $\sigma$ ), block multipliers ( $\alpha$ ), and learning rates ( $\eta$ ).
Optimizer Agnosticism: The framework is applied to a wide range of optimizers, including:
- Muon-Kimi: A matrix-preconditioned optimizer used in large-scale pretraining.
- SGD, AdamW, Lion, Sophia, SSO, SOAP: The paper derives specific µP formulations for these, recovering known results for some and providing new, theoretically grounded formulations for others.
Bias Handling: The framework is extended to include bias parameters, showing they require $\Theta(1)$ scaling for their norms, independent of the weight scaling.

3. Key Contributions

Unified Spectral Framework: Introduced a simple spectral condition (Condition 3.1) that unifies disparate µP results under a single theoretical umbrella, applicable to arbitrary fixed-depth residual blocks.
Simplified Derivation: Replaced complex Tensor Program derivations with elementary linear algebra and probability, making the theory accessible and easier to extend.
Generalized Optimizer Support: Systematically derived µP formulations for modern optimizers (including Muon-Kimi, Sophia, Lion, and SSO), moving beyond the limited scope of SGD/AdamW.
Practical Recipe: Provided concrete tables and formulas for scaling hyperparameters (learning rates, weight decay, initialization) based on width ( $r_n$ ) and depth ( $r_L$ ) ratios.

4. Experimental Results

The authors validated their theory using GPT-2 style Transformer language models trained on the OpenWebText dataset, comparing Standard Parameterization (SP) against the proposed µP.

Stable Feature Learning:
- Under SP, feature norms ( $\|h_L\|_R$ ) exploded rapidly as width and depth increased.
- Under µP, feature norms remained stable and scale-invariant ( $\Theta(1)$ ) across all tested scales (width up to 4096, depth up to 256).
Robust Hyperparameter Transfer:
- Width Scaling: µP allowed the optimal base learning rate to transfer directly from small to large models. SP required significant re-tuning.
- Depth Scaling: µP maintained a nearly invariant optimal learning rate across depths. SP failed to transfer HPs effectively, especially at large depths (e.g., $L=256$ ), where training often became unstable without LayerNorm.
Performance: Models trained with µP consistently achieved lower validation loss than SP as model size increased, demonstrating more efficient training dynamics.

5. Significance

Theoretical Unification: This work bridges the gap between width-scaling and depth-scaling theories, providing a "one-size-fits-all" spectral perspective that simplifies the understanding of µP.
Practical Scalability: By enabling reliable hyperparameter transfer, the proposed method significantly reduces the computational cost of training massive foundation models. Practitioners can tune HPs on small models and confidently apply them to billion-parameter models.
Future-Proofing: The framework is designed to be easily extensible to new architectures (e.g., Mamba, State Space Models) and emerging optimizers, fostering more efficient and stable scaling of generative AI.

In summary, this paper provides the theoretical foundation and practical guidelines for scaling generative models in both width and depth, ensuring stable training and eliminating the need for costly hyperparameter re-tuning.

Spectral Condition for μμμP under Width-Depth Scaling

1. The Problem: The "Jenga" Effect

2. The Solution: The "Spectral µP" Blueprint

3. The "Residual" Elevator

4. The "One-Size-Fits-All" Toolbelt

5. Real-World Proof

Summary Analogy

1. Problem Statement

2. Methodology

A. Problem Setup

B. The Spectral µP Condition (Condition 3.1)

C. Unified Implementation Recipe

3. Key Contributions

4. Experimental Results

5. Significance

More like this

NS-RGS: Newton-Schulz based Riemannian gradient method for orthogonal group synchronization

Poisson-response Tensor-on-Tensor Regression and Applications

Virtual Dummies: Enabling Scalable FDR-Controlled Variable Selection via Sequential Sampling of Null Features

Eliciting core spatial association from spatial time series: a random matrix approach

Regularized estimation for highly multivariate spatial Gaussian random fields

Spectral Condition for $μ$ P under Width-Depth Scaling