Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

This paper proposes E2H Reasoner, a curriculum reinforcement learning method that schedules tasks from easy to hard to improve the reasoning capabilities of small language models, offering both theoretical convergence guarantees and empirical evidence of superior performance compared to standard RL approaches.

Shubham Parashar, Shurui Gui, Xiner Li, Hongyi Ling, Sushil Vemuri, Blake Olson, Eric Li, Yu Zhang, James Caverlee, Dileep Kalathil, Shuiwang Ji

Published 2026-03-17
📖 4 min read☕ Coffee break read

Imagine you want to teach a child how to become a master chess player.

The Old Way (Direct Learning):
You sit the child down in front of a grandmaster and say, "Watch this game. Now, you play." The child is overwhelmed. They don't know how the pieces move, let alone how to plan ten moves ahead. They get frustrated, make random moves, and never learn the deep strategies because the "reward" (winning) is so rare and far away. This is what happens when we try to teach AI models to solve hard problems (like advanced math or coding) all at once using standard Reinforcement Learning (RL). The AI gets stuck because the "correct answer" is too hard to find on its own.

The New Way (E2H Reasoner):
This paper introduces a method called E2H Reasoner (Easy-to-Hard Reasoner). It's like a personalized curriculum for AI. Instead of throwing the child into the grandmaster's game, you start them with:

  1. Trivial tasks: "Here is a pawn. Move it one square." (The AI learns the basics).
  2. Easy tasks: "Move your knight to attack a pawn." (The AI learns simple tactics).
  3. Medium tasks: "Plan a three-move sequence to checkmate." (The AI learns to think ahead).
  4. Hard tasks: Finally, "Play a full game against a grandmaster."

The magic isn't just what you teach, but how you schedule it.

The Secret Sauce: The "Fading" Teacher

The paper discovered a crucial insight: You can't just teach easy things forever.

If you keep the AI practicing only on "move a pawn" tasks, it gets really good at moving pawns but forgets how to play the whole game. It starts taking shortcuts (like guessing the answer) just to get a reward, a behavior called "reward hacking."

The E2H method uses a smart scheduler (a digital teacher) that acts like a fading spotlight:

  • At the start: The spotlight shines brightly on the easy tasks. The AI builds confidence and learns the core rules.
  • In the middle: The teacher slowly dims the light on the easy tasks and brightens the light on the harder ones.
  • At the end: The easy tasks are almost gone, and the AI is fully focused on the hard challenges.

The paper tested two types of "teachers" (schedulers):

  1. The Cosine Teacher (E2H-C): Smoothly and gradually shifts focus from easy to hard, like a sunset. This works well for subjects where the AI is already decent (like some math problems).
  2. The Gaussian Teacher (E2H-G): This one is more aggressive. It gives the AI a quick crash course on the basics, then immediately pushes it toward the hard stuff to prevent it from getting lazy or overconfident on the easy stuff. This is great for very difficult planning tasks.

Why This Matters (The Results)

The researchers tested this on small AI models (think of them as "smart phones" compared to "supercomputers").

  • Without this method: The small AI would fail at hard math problems or complex planning games. It would just guess or give up.
  • With E2H: The small AI learned to solve problems it was previously "too dumb" to handle. It didn't just memorize answers; it learned reasoning principles (like how to break a big problem into small steps) that it could apply to new, unseen challenges.

The Bottom Line

Think of this paper as the "Training Wheels to Racing Bike" guide for AI.

Previously, we tried to put AI on a racing bike immediately. It fell over and got hurt. This paper says, "No, let's start with training wheels (easy tasks), then switch to a balance bike (medium tasks), and finally take off the training wheels entirely (hard tasks)."

By carefully controlling the difficulty and knowing exactly when to stop helping with the easy stuff, we can teach even small, simple AI models to think like geniuses. It's not about making the AI bigger; it's about teaching it better.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →