Curriculum Reinforcement Learning from Easy to Hard Tasks Improves LLM Reasoning

Imagine you want to teach a child how to become a master chess player.

The Old Way (Direct Learning):
You sit the child down in front of a grandmaster and say, "Watch this game. Now, you play." The child is overwhelmed. They don't know how the pieces move, let alone how to plan ten moves ahead. They get frustrated, make random moves, and never learn the deep strategies because the "reward" (winning) is so rare and far away. This is what happens when we try to teach AI models to solve hard problems (like advanced math or coding) all at once using standard Reinforcement Learning (RL). The AI gets stuck because the "correct answer" is too hard to find on its own.

The New Way (E2H Reasoner):
This paper introduces a method called E2H Reasoner (Easy-to-Hard Reasoner). It's like a personalized curriculum for AI. Instead of throwing the child into the grandmaster's game, you start them with:

Trivial tasks: "Here is a pawn. Move it one square." (The AI learns the basics).
Easy tasks: "Move your knight to attack a pawn." (The AI learns simple tactics).
Medium tasks: "Plan a three-move sequence to checkmate." (The AI learns to think ahead).
Hard tasks: Finally, "Play a full game against a grandmaster."

The magic isn't just what you teach, but how you schedule it.

The Secret Sauce: The "Fading" Teacher

The paper discovered a crucial insight: You can't just teach easy things forever.

If you keep the AI practicing only on "move a pawn" tasks, it gets really good at moving pawns but forgets how to play the whole game. It starts taking shortcuts (like guessing the answer) just to get a reward, a behavior called "reward hacking."

The E2H method uses a smart scheduler (a digital teacher) that acts like a fading spotlight:

At the start: The spotlight shines brightly on the easy tasks. The AI builds confidence and learns the core rules.
In the middle: The teacher slowly dims the light on the easy tasks and brightens the light on the harder ones.
At the end: The easy tasks are almost gone, and the AI is fully focused on the hard challenges.

The paper tested two types of "teachers" (schedulers):

The Cosine Teacher (E2H-C): Smoothly and gradually shifts focus from easy to hard, like a sunset. This works well for subjects where the AI is already decent (like some math problems).
The Gaussian Teacher (E2H-G): This one is more aggressive. It gives the AI a quick crash course on the basics, then immediately pushes it toward the hard stuff to prevent it from getting lazy or overconfident on the easy stuff. This is great for very difficult planning tasks.

Why This Matters (The Results)

The researchers tested this on small AI models (think of them as "smart phones" compared to "supercomputers").

Without this method: The small AI would fail at hard math problems or complex planning games. It would just guess or give up.
With E2H: The small AI learned to solve problems it was previously "too dumb" to handle. It didn't just memorize answers; it learned reasoning principles (like how to break a big problem into small steps) that it could apply to new, unseen challenges.

The Bottom Line

Think of this paper as the "Training Wheels to Racing Bike" guide for AI.

Previously, we tried to put AI on a racing bike immediately. It fell over and got hurt. This paper says, "No, let's start with training wheels (easy tasks), then switch to a balance bike (medium tasks), and finally take off the training wheels entirely (hard tasks)."

By carefully controlling the difficulty and knowing exactly when to stop helping with the easy stuff, we can teach even small, simple AI models to think like geniuses. It's not about making the AI bigger; it's about teaching it better.

1. Problem Statement

The paper addresses the limitations of using Reinforcement Learning (RL) alone to enhance the reasoning capabilities of Large Language Models (LLMs). While recent models like DeepSeek-R1 and OpenAI o1 have shown success, the authors identify two critical bottlenecks when applying RL to inherently difficult tasks:

Distribution Gap: Pre-trained models often have low zero-shot performance on complex tasks. Directly applying RL to these hard tasks creates a large distribution shift between the pre-training data and the target task, leading to sparse reward signals and poor convergence.
Reward Sparsity & Overfitting: In tasks like math or planning, rewards are only granted for correct final answers. If a model is trained directly on hard tasks, it struggles to receive learning signals. Conversely, if trained only on easy tasks, models tend to "reward hack" (memorize shortcuts) or overfit to simple patterns, failing to generalize to complex reasoning.
Ineffective Curriculum Strategies: Existing curriculum learning approaches for LLMs often rely on rigid, fixed-order switching (e.g., train on easy tasks for $N$ steps, then hard tasks). This leads to task forgetting (losing ability on easy tasks) or overfitting to easy tasks.

2. Methodology: E2H Reasoner

The authors propose E2H Reasoner, a Curriculum Reinforcement Learning (CRL) framework that schedules tasks from easy to hard using a probabilistic scheduler rather than a fixed order.

A. Task Decomposition

The training dataset is decomposed into four difficulty levels: Trivial, Easy, Medium, and Hard.

Datasets with Labels: Difficulty is determined by human annotations (e.g., plan length in Blocksworld, number of operands in Countdown, problem level in MATH).
Datasets without Labels: Difficulty is estimated automatically by measuring the error rate of the base model using Chain-of-Thought (CoT) prompting. Samples are bucketed based on quartiles of the error rate distribution.

B. Probabilistic Scheduling Strategies

To mitigate task forgetting and overfitting, the authors introduce two dynamic scheduling strategies that control the probability of sampling tasks from different difficulty levels during training:

Cosine Scheduling (E2H-C):
- Uses a non-parametric cosine function to interpolate sampling probabilities.
- Starts with a high probability of sampling easy tasks and gradually shifts toward hard tasks.
- Best for: Tasks where models perform reasonably well across all levels (e.g., MATH), allowing a smooth transition.
Gaussian Scheduling (E2H-G):
- Inspired by Gaussian Mixture Models. It models task distributions as Gaussians with means corresponding to difficulty levels.
- Uses two hyperparameters: variance ( $\sigma$ ) for concentration and a speed parameter ( $\beta$ ) for the rate of shift.
- Key Advantage: It allows for rapid decay of sampling probability for trivial/easy tasks early in training. This prevents overfitting to simple patterns in sparse-reward environments (e.g., Blocksworld) while still providing initial dense rewards to bootstrap learning.

C. Theoretical Framework

The method is analyzed under the Approximate Policy Iteration (API) framework. The authors derive:

Convergence Guarantees: Proving that the final performance gap is bounded by the sum of approximation errors and the "curriculum approximation error" (the gap between intermediate optimal policies and the final optimal policy).
Sample Complexity: They prove that a well-designed curriculum requires fewer total samples ( $M_{CRL}$ ) than direct learning on the hard task ( $M_{Direct}$ ). Specifically, $M_{CRL} < M_{Direct}$ when the curriculum error is small and the difficulty increases geometrically, effectively bridging the distribution gap.

3. Key Contributions

Novel CRL Framework: Introduction of E2H Reasoner, which moves beyond fixed-order curriculum learning to probabilistic, dynamic scheduling (Cosine and Gaussian) to balance learning stability and generalization.
Theoretical Analysis: First rigorous theoretical analysis of CRL for LLMs under API, providing convergence guarantees and finite-sample complexity bounds that mathematically justify why curriculum learning is more sample-efficient than direct RL.
Empirical Validation: Comprehensive experiments showing that E2H enables small-scale LLMs (e.g., 1.5B, 3B parameters) to solve tasks they previously failed in zero-shot settings.
Integration with Advanced RL: Demonstration that E2H is complementary to other RL techniques like DAPO (Dynamic Advantage Policy Optimization), further reducing zero-advantage batches and improving performance.

4. Experimental Results

The method was evaluated on five reasoning benchmarks: Blocksworld, Countdown, MATH, GSM8K, and AQuA, using models like Qwen 1.5B/2.5B and LLaMA 3.2 3B.

Performance Gains: E2H Reasoner consistently outperformed baselines (Standard GRPO, Traditional Curriculum Learning, Self-Evolve) across all difficulty levels.
- Hard/OOD Tasks: Significant improvements were observed on "Hard" and Out-of-Distribution (OOD) tasks. For example, on Blocksworld, E2H-G achieved 44.1% on Hard tasks vs. 38.9% for standard GRPO with LLaMA 3.2 3B.
- Small Models: Small models (1.5B) trained with E2H could solve complex math problems that were impossible for them in zero-shot or standard RL settings.
Comparison with Baselines:
- vs. Direct Hard Training: Training only on hard tasks resulted in near-zero performance for small models.
- vs. Traditional CL: Fixed-order curriculum often led to forgetting easy tasks or poor generalization on hard tasks. E2H's probabilistic approach maintained better balance.
- vs. Self-Evolve: E2H showed more stable performance across different models and tasks compared to the adaptive 50% success rate sampling of Self-Evolve.
Sample Efficiency: Empirical analysis confirmed the theoretical claim: E2H methods achieved strong performance using significantly fewer "hard" samples (e.g., 3,580 hard samples for E2H-G vs. 12,800 for GRPO-Hard) while maintaining the same total training budget.

5. Significance

Democratizing Reasoning: The work demonstrates that small-scale LLMs can achieve strong reasoning capabilities if trained with the right curriculum, challenging the notion that only massive models can reason.
Theoretical Grounding: It provides a mathematical foundation for why curriculum learning works in RL for LLMs, moving beyond heuristic observations to provable sample efficiency.
Practical Impact: The proposed scheduling strategies (Cosine/Gaussian) are simple to implement and do not require complex adaptive mechanisms, making them highly applicable for improving reasoning in various domains (math, coding, planning) without massive computational overhead.
Future Direction: The paper suggests that while current schedulers are static, future work could explore adaptive strategies that dynamically adjust based on model performance, further optimizing the "easy-to-hard" transition.