Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

Imagine you are trying to teach a giant robot (an AI) to predict the future by showing it millions of examples. To do this, you have two main levers to pull:

The Brain Size (Model Size): How many neurons does the robot have?
The Study Time (Training Steps): How many examples does it see?

You also have a limited amount of Compute Budget (like a fixed amount of electricity or money). The big question in AI research is: How should I split my budget between building a bigger brain and letting it study longer to get the best results?

This paper investigates a specific way of teaching the robot called signSGD (pronounced "sign-S-G-D") and compares it to the standard method, SGD.

The Two Teachers: SGD vs. signSGD

SGD (The Detailed Teacher): This teacher looks at every single example and calculates the exact direction to move the robot's brain. It says, "You were off by 0.05, so move left by 0.05." It's precise but can be slow and gets confused by noisy data (like a student trying to study in a loud cafeteria).
signSGD (The "Yes/No" Teacher): This teacher is much simpler. It doesn't care about how much you were wrong, only which direction you were wrong. It says, "You were off, so move left!" It ignores the magnitude. This is actually how modern AI giants (like the ones powering chatbots) are often trained, because it's faster and uses less memory.

The Big Discovery: When the "Yes/No" Teacher Wins

The authors ran a mathematical simulation to see how these two teachers perform as you scale up the robot's size and study time. They found two magical effects that make signSGD special:

1. The "Self-Adjusting Compass" (Drift-Normalization)

Imagine you are walking toward a treasure.

SGD walks with a fixed step size. If you are far away, you take big steps. If you are close, you take small steps. But if the terrain is bumpy (noisy), you might overshoot the treasure.
signSGD has a magic compass. When the robot is far from the answer (high error), the compass makes the steps feel "heavier" and more effective. When the robot is close, it naturally slows down.
The Result: This self-adjusting mechanism allows signSGD to learn faster in certain scenarios, especially when the data is messy.

2. The "Noise Filter" (Noise-Shaping)

This is the most surprising part.

SGD is like trying to hear a whisper in a storm. The louder the storm (noise), the harder it is to hear. If you turn up the volume (learning rate) to hear better, the storm gets even louder, drowning out the signal.
signSGD is like wearing noise-canceling headphones. Because it only listens to the direction (sign) and ignores the volume of the error, the "storm" of noise doesn't get louder just because you turn up the volume.
The Result: In situations where the data is very noisy, signSGD can actually get better results by using a larger learning rate, whereas SGD would get confused. This allows signSGD to reach a lower error rate (a "steeper slope") than SGD in specific conditions.

The Secret Sauce: The "Warmup-Stable-Decay" Schedule

The paper also looked at how to change the teacher's instructions over time. They tested a strategy called Warmup-Stable-Decay (WSD), which is like a study plan:

Warmup: Start slow to get the robot comfortable.
Stable: Keep a steady pace while it learns the core concepts.
Decay: Slowly reduce the speed at the end to fine-tune the details.

They found that when the data has a specific structure (fast-changing features but slow-changing targets), this schedule acts like a noise filter. It keeps the robot moving forward steadily while preventing the "storm" of noise from messing up the final details. This combination (signSGD + WSD) creates the most efficient path to the best possible AI performance.

The Bottom Line

In the world of training massive AI models:

SGD is the traditional, precise method.
signSGD is the modern, efficient method used by top AI labs.

This paper proves mathematically why signSGD is often better. It shows that by ignoring the "volume" of mistakes and focusing only on the "direction," and by using a smart study schedule, we can build better AI models with the same amount of computing power. It's like realizing that sometimes, knowing which way to go is more important than knowing exactly how far you need to walk.

1. Problem Statement

The paper addresses a critical gap between theoretical scaling laws and practical optimizer choices in Large Language Model (LLM) training.

Context: While theoretical scaling laws are often derived using Stochastic Gradient Descent (SGD) under the Power-Law Random Features (PLRF) model, state-of-the-art LLMs are trained using adaptive optimizers like Adam.
Motivation: Adam is theoretically difficult to analyze but is often approximated by signSGD (which captures Adam's coordinate-wise adaptivity). The authors aim to derive the scaling laws for signSGD to understand how it differs from SGD and whether it offers computational advantages in specific regimes.
Goal: To characterize the population risk of a linear model trained with one-pass signSGD, express it as a function of model size ( $M$ ), training steps ( $N$ ), learning rate ( $\gamma_0$ ), and feature/target decay parameters ( $\alpha, \beta$ ), and determine the compute-optimal scaling laws.

2. Methodology

The authors employ a rigorous theoretical framework combining stochastic differential equations (SDEs), ordinary differential equations (ODEs), and random matrix theory.

Model Setup:
- PLRF Model: Features $x \sim \mathcal{N}(0, H)$ with eigenvalues decaying as $i^{-2\alpha}$ , and targets $y = \langle x, w^* \rangle$ with coefficients decaying as $i^{-\beta}$ .
- Sketching: A random sketch matrix $S$ projects features to dimension $M$ .
- Optimizer: One-pass signSGD updates parameters via $\theta_{k+1} = \theta_k - \gamma_k \text{sign}(g_k)$ .
Analytical Derivation:
1. One-Step Update: The authors derive a one-step update formula for signSGD using a second-order Taylor expansion and sign-Gaussian identities. This reveals two distinct components: a drift term (systematic decrease) and a quadratic noise term (variance injection).
2. Continuous Limit: They convert the discrete update into a continuous-time ODE. Unlike SGD, the signSGD drift is self-normalized by $1/\sqrt{L(t)}$ , and the noise term lacks the multiplicative $L(t)$ factor found in SGD.
3. Implicit Integral Equation: By solving the ODE, they derive an implicit integral equation for the risk $L(N)$ , decomposing it into approximation error, drift loss, and noise loss.
4. Deterministic Approximation: Using contour integration and deterministic equivalents (similar to Paquette et al., 2024), they solve the integral equation to obtain explicit scaling formulas for the risk.
5. Compute-Optimization: They optimize the learning rate schedule ( $\gamma_0 = M^{-e}$ ) and the trade-off between model size ( $M$ ) and steps ( $N$ ) under a fixed compute budget ($f = MN$) to find the optimal scaling exponents.

3. Key Contributions

The paper identifies two unique mechanisms in signSGD that alter the scaling landscape compared to SGD:

Drift-Normalization Effect:
- In signSGD, the drift term is proportional to $1/\sqrt{L(t)}$ . This self-normalization accelerates convergence when the loss is small ( $L(t) \lesssim 1$ ).
- Result: The exponents for the aligned and distorted feature loss terms ( $D_{al}$ and $D_{dis}$ ) become steeper (decay faster with $N$ ) compared to SGD.
Noise-Shaping Effect:
- In SGD, the noise term scales with $L(t)$ , causing it to decay as training progresses. In signSGD, the noise term is independent of $L(t)$ (it is $O(\gamma^2)$ ).
- Result: The noise floor in signSGD does not decay with $N$ under a constant learning rate. However, this allows the noise term to be balanced differently against the drift, potentially leading to better overall scaling in noise-dominated regimes.
Warmup-Stable-Decay (WSD) Analysis:
- The authors analyze the WSD schedule (widely used in LLMs). They show that WSD maintains the drift velocity during the stable phase while reducing the noise term during the decay phase.
- Result: WSD further sharpens the compute-optimal slope in regimes where feature decay is fast ( $\alpha$ large) but target decay is slow ( $\beta$ small).

4. Key Results

A. Scaling Law Formula

For constant learning rates, the risk $R(M, N, \gamma_0)$ is given by:
$R(M, N, \gamma_0) \asymp \underbrace{M^{-2\alpha + \max(0, 1-2\beta)}}_{\text{Approximation } A(M)} + \underbrace{(M^{\min(\alpha, 0.5)} N \gamma_0)^{-\frac{2(2\alpha+2\beta-1)}{2\alpha-2\beta+1}}}_{\text{Aligned Drift } D_{al}} + \underbrace{\dots}_{\text{Distorted Drift } D_{dis}} + \underbrace{\gamma_0^2 M^{2-\min(1, 2\alpha)}}_{\text{Noise } N_{sign}}$
Note: The exponents for the drift terms are strictly larger (steeper decay) than in SGD.

B. Compute-Optimal Scaling

Under optimal learning rate scaling ( $\gamma_0 = M^{-e}$ ), the authors identify specific regions in the $(\alpha, \beta)$ parameter plane where signSGD outperforms SGD:

Noise-Bottleneck Regimes (Area III-IVsub): In regions where SGD is limited by noise (Phases III and IV of the SGD phase plane), signSGD achieves a steeper compute-optimal slope (faster loss decay with compute). This is because the noise-reshaping effect allows signSGD to balance the noise term against the accelerated drift term more effectively than SGD.
Optimal Model Size: signSGD generally favors larger model sizes compared to SGD in these regimes.
WSD Improvement: For $\alpha > 0.5$ and small $\beta$ , the WSD schedule yields a strictly larger compute-optimal slope than constant learning rates, creating a new beneficial region (Area $Aa^*$ ).

C. Empirical Validation

Experiments on synthetic linear regression tasks confirm the theoretical exponents for both constant and WSD schedules.
Experiments on Transformer architectures with AdamW show that AdamW exhibits similar scaling behavior to signSGD, validating the heuristic that Adam behaves like signSGD when $\beta_2 \to 1$ .

5. Significance and Implications

Bridging Theory and Practice: The work provides the first rigorous scaling law analysis for signSGD (a proxy for Adam) in the PLRF setting, explaining why adaptive optimizers often outperform SGD in practice.
Optimizer Selection: It identifies specific data regimes (characterized by feature and target decay rates) where adaptive methods like signSGD/Adam are theoretically superior to SGD in terms of compute efficiency.
Learning Rate Scheduling: The analysis validates the use of Warmup-Stable-Decay schedules, showing they are not just heuristic but theoretically necessary to mitigate the unique noise characteristics of signSGD in certain regimes.
Future Directions: The paper suggests that these findings extend to Adam and potentially more complex architectures, offering a theoretical basis for the empirical success of adaptive optimizers in large-scale LLM training.

In summary, the paper demonstrates that signSGD can achieve steeper compute-optimal scaling laws than SGD in noise-dominated regimes due to a unique interplay between drift acceleration and noise reshaping, and that WSD scheduling further enhances these gains.