SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning

The paper introduces SPEED-RL, an adaptive online curriculum learning method that selectively samples intermediate-difficulty prompts to theoretically and empirically accelerate reinforcement learning training for reasoning models by 2x to 6x without compromising accuracy.

Ruiqi Zhang, Daman Arora, Song Mei, Andrea Zanette

Published 2026-03-06
📖 3 min read☕ Coffee break read

Imagine you are trying to teach a brilliant but very slow student how to solve complex math problems. You have a giant stack of practice worksheets, but you're doing it the old-fashioned way: you hand them a problem, they try to solve it, you check the answer, and then you move to the next one.

The problem? You are handing them everything at random.

  • Sometimes you give them a problem so easy (like "2 + 2") that they solve it instantly without learning anything new. It's a waste of time.
  • Sometimes you give them a problem so impossible (like advanced quantum physics) that they get frustrated, guess wildly, and learn nothing. It's also a waste of time.
  • Most importantly, you are wasting massive amounts of computer power (which costs money and time) just to shuffle through these random papers.

The New Approach: "SPEED-RL"

This paper introduces a smart new method called SPEED-RL. Think of it as a super-intelligent tutor who knows exactly which worksheet to give the student next.

Here is how it works, using a simple analogy:

1. The "Goldilocks" Zone

Instead of picking problems randomly, this tutor looks at the student's current skill level and picks a problem that is "just right" (not too easy, not too hard).

  • Too Easy: The student already knows it. No growth.
  • Too Hard: The student is lost. No growth.
  • Just Right: The student is challenged but can figure it out with a little effort. This is where the real learning happens.

In the paper, they call this the "Intermediate Difficulty" zone. It's like training for a marathon: you don't start by running a 100-mile race (too hard), and you don't just walk around the block (too easy). You run a distance that makes your legs burn just enough to get stronger.

2. The "Noise" Problem

When the student guesses on a problem that is too hard, the answer is basically random noise. It's like trying to hear a whisper in a hurricane. The computer is trying to learn from that "noise," which confuses the system and slows everything down.

By focusing only on the "just right" problems, the tutor ensures the signal is clear. The student's effort translates directly into a clear lesson, making the learning process 2 to 6 times faster.

3. No Manual Tuning

Usually, if you want to teach a student this way, you need a human expert to constantly watch and say, "Okay, they are ready for harder problems now." That takes forever.

SPEED-RL is like a self-driving car for training. It figures out the difficulty level automatically as it goes. It doesn't need a human to constantly adjust the knobs; it just adapts on the fly, making the whole process seamless.

The Bottom Line

This paper is about stopping the waste. By stopping the computer from practicing on problems it already knows or problems it can't possibly solve, and instead focusing only on the challenging-but-solvable middle ground, we can train AI models to be smarter much faster and much cheaper, without sacrificing how good they are at the end.

It's the difference between randomly throwing darts at a board and having a coach who tells you exactly where to aim to improve your score the quickest.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →