What do near-optimal learning rate schedules look like?

This paper introduces a search procedure to identify near-optimal learning rate schedule shapes across various workloads, revealing that while warmup and decay are robust features, commonly used schedules are suboptimal and the ideal shape is significantly influenced by hyperparameters like weight decay.

Hiroki Naganuma, Atish Agarwala, Priya Kasimbeg, George E. Dahl

Published 2026-03-12
📖 6 min read🧠 Deep dive

Imagine you are training a very smart, but slightly clumsy, robot to learn a new skill. The robot learns by taking steps, and the size of each step is determined by something called the learning rate.

If the robot takes steps that are too big, it stumbles and falls (the training fails). If the steps are too small, it moves so slowly it never finishes the task (the training is inefficient).

For years, scientists have known that the size of the steps matters most. But they've been arguing about the shape of the step-size plan. Should the robot start with tiny steps, get bigger, and then get smaller again? Should it start big and slowly shrink? Or should it just pick one size and stick with it?

This paper, titled "What do near-optimal learning rate schedules look like?", is like a massive experiment where the researchers built a robot that tries thousands of different "step-size plans" to see which one actually works best.

Here is the breakdown of their journey, using simple analogies:

1. The Problem: We're Guessing the Shape

Most people use a standard "recipe" for these step-size plans. The most popular recipe is called Cosine Decay. It looks like a smooth hill: you start small (warmup), go up to a peak, and then slowly slide down the other side (decay).

But the researchers asked: "Is this hill the best shape? Or is there a weird, jagged, or wavy shape that would make the robot learn even faster?"

2. The Experiment: The "Shape Hunter"

To find the answer, they didn't just guess. They created a Search Procedure. Think of this as a super-fast robot that runs thousands of simulations on three different "training grounds":

  • Linear Regression: A simple math problem (like predicting house prices based on size).
  • CIFAR-10: A task where a computer learns to recognize pictures of cats, dogs, and cars.
  • WikiText-103: A task where a computer learns to write sentences like a human (language modeling).

They tested many different "families" of shapes:

  • The Constant: A flat line (same step size forever).
  • The Cosine: The classic smooth hill.
  • The Spline: A flexible rubber band that can be bent into almost any shape.
  • The Smooth Non-Monotonic: A wild card that can go up and down however it wants, with no rules.

3. The Big Discoveries

A. The "Base Rate" is the King

The most important finding is that the size of the steps matters way more than the shape of the plan.

  • Analogy: Imagine you are driving a car. It doesn't matter if you have a fancy GPS route (the shape); if you are driving at 10 mph when you should be doing 60, you aren't going to get there fast.
  • Takeaway: Before you worry about making a fancy schedule, you must tune the "Base Learning Rate" (the peak speed). If you get that wrong, no amount of fancy shaping will save you.

B. The "Warmup" and "Cool Down" are Real

Even when they let the "wild card" robot (the Smooth Non-Monotonic family) try to be weird and skip the rules, it still figured out that it needed a Warmup and a Cool Down.

  • The Warmup: Start with tiny steps to get your bearings, then speed up.
  • The Cool Down: As you get close to the finish line, slow down so you don't overshoot the target.
  • Why? It turns out these aren't just habits; they are fundamental physics of how these robots learn. Even if you didn't tell the robot to do it, the best-performing robots chose to do it.

C. Simple vs. Complex Worlds

Here is where it gets interesting. The "best" shape depends on what the robot is learning:

  • In the Simple World (Linear Regression): The best plan was actually no warmup at all! It was a flat line of big steps, followed by a sudden, sharp drop to zero at the very end. It's like sprinting the whole race and then slamming on the brakes.
  • In the Complex World (Images and Language): The best plan was a gentle hill. You need the warmup to stabilize the chaotic learning process, and a slow, gentle decay to fine-tune the details.
  • Lesson: What works for simple math doesn't necessarily work for complex AI. Don't copy-paste rules from simple problems to complex ones.

D. The "Weight Decay" Connection

They also found that the learning rate schedule is deeply connected to another setting called Weight Decay (a way to prevent the robot from memorizing the training data too perfectly).

  • Analogy: If you increase the "friction" (weight decay) on the robot's wheels, the best step-size plan changes. It turns out that if you have high friction, you should keep your steps big for longer before slowing down.

4. The Conclusion: What Should You Do?

The researchers concluded that while we can find "near-perfect" shapes using their search method, the gains are often small.

  1. Don't obsess over the shape: The difference between a perfect curve and a standard Cosine curve is often tiny.
  2. Do obsess over the Base Rate: Tuning the peak speed is 100x more important than tweaking the curve.
  3. Stick to the basics: A standard Warmup + Cosine Decay is a very strong, reliable strategy that works almost everywhere.
  4. Beware of the "Simple" trap: Don't assume a strategy that works on a simple math problem will work on a complex language model.

In a nutshell:
The paper is like a master chef testing thousands of recipes. They found that while there are some "secret sauce" shapes that are slightly better, the most important thing is to get the main ingredient (the base learning rate) right. And, surprisingly, even if you let the chef invent a completely new recipe from scratch, they will almost always end up adding a "starter" (warmup) and a "finisher" (decay) because those are the only ways to cook the dish perfectly.