Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

This paper analyzes the scaling laws of signSGD in linear regression under a power-law random features model, demonstrating that its unique noise-reshaping effect and the application of warmup-stable-decay schedules can yield steeper compute-optimal risk reduction than SGD, particularly in noise-dominant regimes with fast feature decay.

Jihwan Kim, Dogyoon Song, Chulhee Yun

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a giant robot (an AI) to predict the future by showing it millions of examples. To do this, you have two main levers to pull:

  1. The Brain Size (Model Size): How many neurons does the robot have?
  2. The Study Time (Training Steps): How many examples does it see?

You also have a limited amount of Compute Budget (like a fixed amount of electricity or money). The big question in AI research is: How should I split my budget between building a bigger brain and letting it study longer to get the best results?

This paper investigates a specific way of teaching the robot called signSGD (pronounced "sign-S-G-D") and compares it to the standard method, SGD.

The Two Teachers: SGD vs. signSGD

  • SGD (The Detailed Teacher): This teacher looks at every single example and calculates the exact direction to move the robot's brain. It says, "You were off by 0.05, so move left by 0.05." It's precise but can be slow and gets confused by noisy data (like a student trying to study in a loud cafeteria).
  • signSGD (The "Yes/No" Teacher): This teacher is much simpler. It doesn't care about how much you were wrong, only which direction you were wrong. It says, "You were off, so move left!" It ignores the magnitude. This is actually how modern AI giants (like the ones powering chatbots) are often trained, because it's faster and uses less memory.

The Big Discovery: When the "Yes/No" Teacher Wins

The authors ran a mathematical simulation to see how these two teachers perform as you scale up the robot's size and study time. They found two magical effects that make signSGD special:

1. The "Self-Adjusting Compass" (Drift-Normalization)

Imagine you are walking toward a treasure.

  • SGD walks with a fixed step size. If you are far away, you take big steps. If you are close, you take small steps. But if the terrain is bumpy (noisy), you might overshoot the treasure.
  • signSGD has a magic compass. When the robot is far from the answer (high error), the compass makes the steps feel "heavier" and more effective. When the robot is close, it naturally slows down.
  • The Result: This self-adjusting mechanism allows signSGD to learn faster in certain scenarios, especially when the data is messy.

2. The "Noise Filter" (Noise-Shaping)

This is the most surprising part.

  • SGD is like trying to hear a whisper in a storm. The louder the storm (noise), the harder it is to hear. If you turn up the volume (learning rate) to hear better, the storm gets even louder, drowning out the signal.
  • signSGD is like wearing noise-canceling headphones. Because it only listens to the direction (sign) and ignores the volume of the error, the "storm" of noise doesn't get louder just because you turn up the volume.
  • The Result: In situations where the data is very noisy, signSGD can actually get better results by using a larger learning rate, whereas SGD would get confused. This allows signSGD to reach a lower error rate (a "steeper slope") than SGD in specific conditions.

The Secret Sauce: The "Warmup-Stable-Decay" Schedule

The paper also looked at how to change the teacher's instructions over time. They tested a strategy called Warmup-Stable-Decay (WSD), which is like a study plan:

  1. Warmup: Start slow to get the robot comfortable.
  2. Stable: Keep a steady pace while it learns the core concepts.
  3. Decay: Slowly reduce the speed at the end to fine-tune the details.

They found that when the data has a specific structure (fast-changing features but slow-changing targets), this schedule acts like a noise filter. It keeps the robot moving forward steadily while preventing the "storm" of noise from messing up the final details. This combination (signSGD + WSD) creates the most efficient path to the best possible AI performance.

The Bottom Line

In the world of training massive AI models:

  • SGD is the traditional, precise method.
  • signSGD is the modern, efficient method used by top AI labs.

This paper proves mathematically why signSGD is often better. It shows that by ignoring the "volume" of mistakes and focusing only on the "direction," and by using a smart study schedule, we can build better AI models with the same amount of computing power. It's like realizing that sometimes, knowing which way to go is more important than knowing exactly how far you need to walk.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →