DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Here is an explanation of the paper "DeReason" using simple language and creative analogies.

The Big Idea: Teaching a Student the Right Way

Imagine you are trying to teach a brilliant but inexperienced student (an AI model) how to solve complex problems in science, math, and history. You have two main tools to help them learn:

The Textbook Method (SFT): You give the student a list of problems and their correct answers. They study these, memorize the patterns, and learn the facts. This is fast and efficient for building a strong foundation.
The Trial-and-Error Method (RL): You give the student a problem and a scoreboard. They try to solve it. If they get it right, they get a point. If they get it wrong, they get zero. They have to guess, fail, try again, and eventually figure out the logic on their own. This is great for learning how to think, but it's slow and frustrating if they don't know the basics.

The Problem:
For a long time, researchers thought the "Trial-and-Error" method (Reinforcement Learning) was the magic key to making AI smart at reasoning. They tried to throw the student straight into the deep end with the scoreboard, skipping the textbook.

The Discovery:
The authors of this paper ran an experiment and found something surprising: If you throw a student straight into the deep end without teaching them the basics first, they drown.

In general science and math (not just simple math puzzles), trying to learn purely by guessing and getting points was very inefficient. The student learned the facts much faster by just reading the textbook (SFT) first.

However, the textbook has a limit. It teaches the student what to do, but not necessarily how to think through a brand-new, super-hard problem that no one has solved before.

The Solution: "DeReason" (The Smart Syllabus)

The paper proposes a new strategy called DeReason. Instead of mixing all the problems together randomly, they split the training data into two piles based on difficulty and thinking intensity.

Think of it like a Personalized Gym Plan for the AI:

Phase 1: The Warm-up (SFT on "Easy" Stuff)

The Pile: Questions that require remembering facts or applying simple rules (e.g., "What is the capital of France?" or "Solve this basic algebra equation").
The Method: The AI reads the answers (Supervised Fine-Tuning).
The Analogy: This is like the student reading the textbook and memorizing the vocabulary and grammar rules. It's efficient. You don't need to guess the capital of France; you just need to know it.
Goal: Build a strong foundation of knowledge so the AI doesn't waste time guessing basic facts.

Phase 2: The Heavy Lifting (RL on "Hard" Stuff)

The Pile: Questions that require deep, multi-step reasoning, logic chains, and creative problem-solving (e.g., "Derive a new physics formula" or "Solve a complex logic puzzle").
The Method: The AI tries to solve these on its own, gets feedback, and learns to think (Reinforcement Learning).
The Analogy: Now that the student knows the vocabulary, you put them in a debate club or a chess tournament. They have to use what they know to navigate complex, unpredictable situations.
Goal: Teach the AI how to think, not just what to know.

Why This Works Better Than the Old Way

Before this, people often just threw all the problems (easy and hard) into a big bucket and let the AI learn them in a random order, or they tried to teach everything using only one method.

The "DeReason" approach is like a smart coach:

Don't waste time: Don't make the AI guess the answer to a simple fact question (that's a waste of the "guessing" method). Just teach it the fact.
Don't overwhelm: Don't make the AI try to solve a Nobel Prize-level physics problem before it knows basic algebra.
The Result: By splitting the data, the AI learns the basics quickly (via SFT) and then uses its "thinking muscles" to master the hardest challenges (via RL).

The Evidence

The researchers tested this on various benchmarks (like tough science exams and math competitions).

Pure Guessing (RL only): The AI struggled, especially on general science topics.
Pure Memorization (SFT only): The AI was good at facts but couldn't handle the hardest, most complex reasoning tasks.
DeReason (The Hybrid): The AI became the best of both worlds. It knew the facts and could think through complex problems, beating all the previous methods.

In a Nutshell

DeReason is a training strategy that says: "Teach the student the facts first, then teach them how to think."

It realizes that not all problems are the same. Some problems need a library card (SFT), and some need a thinking cap (RL). By sorting the problems and using the right tool for the right job, we can build smarter, more capable AI models much faster.

Here is a detailed technical summary of the paper "DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning."

1. Problem Statement

While Reinforcement Learning with Verifiable Rewards (RLVR) has proven highly effective for eliciting reasoning capabilities in Large Language Models (LLMs) within narrow domains like mathematics and coding, its application to general STEM (Science, Technology, Engineering, and Mathematics) reasoning remains challenging.

The Gap: Recent trends suggest RL is superior to Supervised Fine-Tuning (SFT) for reasoning. However, in general STEM domains, the interplay between SFT and RL is underexplored.
The Challenge: Directly applying RL to base models in general STEM domains is highly sample-inefficient and consistently underperforms compared to SFT using moderate-quality responses.
The Core Question: Given that SFT and RL play complementary roles, how should training data be optimally allocated between these two stages to maximize performance in general reasoning tasks?

2. Methodology: DeReason

The authors propose DeReason, a difficulty-based data decoupling strategy. Instead of modifying the RL or SFT algorithms, DeReason operates at the data selection level to create a curriculum that matches the strengths of each training stage.

A. Core Hypothesis

SFT excels at efficient knowledge acquisition and distilling broad domain knowledge from teacher models.
RL excels at exploring complex reasoning paths and pushing performance boundaries on difficult problems where teacher demonstrations may be insufficient.

B. The DeReason Pipeline

The method partitions the training dataset $D$ into two subsets based on Reasoning Intensity:

Difficulty Estimation:
- An LLM (specifically an instruct model of the same size as the policy model, e.g., Qwen3-4B-Instruct) scores each problem on a scale of 1 to 5.
- Criteria: Scores consider the number of reasoning steps, prerequisite domain knowledge, and potential for error.
- Scoring Logic:
  - Low Scores (1-3): Problems requiring knowledge recall or straightforward fact application.
  - High Scores (4-5): Problems requiring multi-step derivation and complex reasoning.
Data Partitioning:
- $D_{SFT}$ (Easy/Broad): Problems with difficulty scores $\le \tau$ (e.g., $\le 3$ ). These are allocated to the SFT stage.
- $D_{RL}$ (Hard/Focused): Problems with difficulty scores $> \tau$ (e.g., $\ge 4$ ). These are reserved for the RL stage.
Curriculum Training:
- Stage 1 (SFT): Train on $D_{SFT}$ using responses from a moderate-capability teacher model to establish foundational domain knowledge.
- Stage 2 (RL): Initialize the RL training (using GRPO) from the SFT checkpoint and train on $D_{RL}$ to cultivate complex reasoning capabilities.

3. Key Contributions

Systematic Analysis of SFT vs. RL: The paper provides controlled experiments demonstrating that for small models in general STEM domains, SFT consistently outperforms pure RL when trained on the same data. Pure RL is shown to be sample-inefficient without prior SFT.
DeReason Curriculum: A novel, decoupled training strategy that partitions data by difficulty. It demonstrates that SFT on easy/broad data followed by RL on selected hard data significantly outperforms:
- Pure SFT.
- Pure RL.
- Randomly split SFT-then-RL baselines.
Behavioral Analysis: The authors provide fine-grained insights into training dynamics, including:
- Policy Entropy: SFT narrows the policy space early; RL from a base model converges to a more deterministic policy than SFT-initialized RL.
- Response Length: RL acts as a compression mechanism for SFT-initialized models, shortening verbose outputs while preserving quality.
- Reward Optimization: SFT checkpoints start with higher initial rewards, whereas base models show rapid initial gains that plateau.

4. Experimental Results

Experiments were conducted on Qwen3-4B models using datasets WebInstruct-Verified and Webscale-RL, evaluated on benchmarks including MMLU-Pro, GPQA-Diamond, SuperGPQA, BBEH, AIME, and MATH500.

General STEM Performance:
- SFT-only significantly outperformed RL-only across all benchmarks.
- DeReason (SFT Easy + RL Hard) achieved the best overall performance, surpassing all baselines.
- Example (WebInstruct-Verified, 4B Model): DeReason achieved an average score of 43.8, compared to 41.8 for SFT-only and 37.6 for RL-only.
Mathematical Performance:
- Similar trends were observed on math benchmarks (AIME24, AIME25, MATH500). DeReason achieved the highest scores (e.g., 88.1 on MATH500 for Webscale data), outperforming both pure SFT (87.5) and pure RL (81.6).
Benchmark Specifics:
- On easy benchmarks (e.g., MMLU-Pro), the gap between DeReason and SFT-only was small.
- On hard benchmarks requiring complex reasoning (e.g., BBEH, GPQA-Diamond), DeReason showed clear and significant improvements over all baselines, validating the hypothesis that RL is crucial for the "reasoning frontier."

5. Significance and Implications

Paradigm Shift: The work challenges the notion that "RL is always better than SFT" for reasoning. It establishes that for general domains, SFT is an indispensable cold-start mechanism for knowledge acquisition.
Data-Centric Optimization: DeReason proves that how data is allocated is as critical as the algorithm itself. By decoupling data based on reasoning intensity, the method leverages the efficiency of SFT for knowledge and the exploratory power of RL for complex reasoning.
Orthogonality: Since DeReason operates purely at the data selection level, it is orthogonal to algorithmic improvements (like new RL algorithms or loss functions) and can be easily integrated into existing training pipelines (e.g., VeRL, Llama-Factory).
Generalization: The approach offers a scalable, generalized post-training recipe for general STEM reasoning, moving beyond the narrow success of math/code-only RL.

In conclusion, DeReason provides a principled framework for multi-stage LLM training, demonstrating that a difficulty-aware curriculum is essential for unlocking the full reasoning potential of models in complex, general scientific domains.