Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

The Big Idea: The "Tutor" and the "Coach"

Imagine you are trying to teach a very smart student (an AI) how to solve difficult math problems. You have two main tools to help them learn:

The Coach (Reinforcement Learning - RL): This coach watches the student try to solve problems. If the student gets it right, the coach gives a high-five (a reward). If they get it wrong, the coach says "try again." The student learns by practicing over and over, getting better at things they are already somewhat good at.
The Tutor (Supervised Fine-Tuning - SFT): This tutor sits down with the student and shows them the perfect, step-by-step solution to a problem. The student memorizes this specific way of thinking. This is great for learning brand-new concepts, but it requires a lot of time and high-quality examples.

The Problem: The "Echo Chamber"

The paper argues that if you only use the Coach (RL), the student hits a ceiling.

Why? The Coach only reinforces what the student already knows. If the student doesn't know how to solve a specific type of hard problem, they will keep failing, and the Coach just tells them to "try harder" using the same old methods. The student gets stuck in an "echo chamber," repeating the same mistakes or only getting slightly better at easy things.
The Result: The student becomes very fast at easy problems but can't break through to solve the hardest, most complex ones.

If you only use the Tutor (SFT), the student learns the specific answers but might become a "parrot." They memorize the steps but might fail if the problem looks slightly different (they lack generalization).

The Solution: ReLIFT (The Hybrid Strategy)

The authors created a new method called ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning). Think of this as a smart training schedule that switches between the Coach and the Tutor depending on what the student needs right now.

Here is how ReLIFT works, step-by-step:

The Practice Session (RL): The student practices solving problems on their own. The Coach watches and gives rewards for correct answers.
The "Stuck" Detector: The system keeps an eye on the student. When the student encounters a problem they cannot solve at all (a "Hardest" question), the system flags it.
The Emergency Tutoring (SFT): Instead of letting the student struggle forever, the system pauses the practice. It grabs a "Hardest" question, finds a perfect solution (from a super-smart AI or a human expert), and gives it to the student as a Tutoring Session. The student learns the new pattern for this specific type of hard problem.
Back to Practice: Once the student has learned that new trick, they go back to practicing (RL) to solidify the skill and try to apply it to other problems.

Why This is a Game-Changer

Efficiency: You don't need to hire a tutor for every problem (which is expensive and slow). You only call the tutor when the student is truly stuck on the hardest stuff.
Breaking Limits: By bringing in new knowledge (the Tutor) exactly when the student hits a wall, the student can learn things that were previously impossible for them.
Better Results: In the paper's tests, this method beat all other methods. The student became better at math, solved problems faster, and gave shorter, more concise answers.

A Real-World Analogy: Learning to Play Guitar

RL (The Coach): You play the guitar every day. You get better at the chords you already know. You get faster. But if you try to play a complex jazz solo you've never heard, you just keep hitting the wrong notes. You can't learn the solo just by practicing your old chords.
SFT (The Tutor): A master musician sits down and teaches you the exact notes for that jazz solo. You memorize it. Now you can play that one song perfectly.
ReLIFT: You practice your guitar (RL). Every time you hit a wall on a difficult jazz solo, you stop, get a master to show you the specific trick for that solo (SFT), and then you go back to practicing. Eventually, you can play the jazz solo and improvise on your own because you've learned the new patterns and practiced them.

The Bottom Line

The paper proves that Reinforcement Learning is great for polishing what you already know, but Supervised Fine-Tuning is required to learn something completely new. ReLIFT is the smart system that knows exactly when to switch between "polishing" and "teaching new tricks," resulting in a much smarter, more capable AI.

1. Problem Statement

Recent advancements in Large Language Model (LLM) reasoning, particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have demonstrated significant improvements in complex reasoning tasks (e.g., math, code). However, the authors identify a critical limitation in current RL paradigms:

The "Echo Chamber" Effect: RL primarily optimizes based on the model's existing knowledge and generated trajectories. It reinforces behaviors the model already knows are likely to succeed but struggles to instill novel reasoning patterns or knowledge that lie beyond the base model's current capabilities.
Performance Plateau: RL excels at refining performance on questions within the model's current competence but often fails to solve "hardest" questions where the model has zero accuracy. Conversely, Supervised Fine-Tuning (SFT) can introduce new knowledge but often degrades performance on easier questions, increases response length unnecessarily, and suffers from poor Out-of-Distribution (OOD) generalization.
Data Inefficiency: Pure SFT requires massive amounts of high-quality demonstration data, which is expensive to curate.

The core research question is: How can RL and SFT be effectively combined to overcome the limitations of each, enabling models to learn new reasoning patterns for difficult problems while maintaining efficiency and generalization?

2. Methodology: ReLIFT

The authors propose ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning), a novel training framework that dynamically alternates between RL and targeted SFT based on the difficulty of the questions encountered during training.

Key Components:

Training Dynamics Analysis:
- The authors first analyzed the training dynamics of RL and SFT separately across four difficulty levels (Easy, Medium, Hard, Hardest).
- Findings: RL improves accuracy on Easy/Medium questions but fails to make progress on "Hardest" questions (where initial accuracy is 0). SFT is highly effective on "Hardest" questions but can degrade performance on easier ones and causes response lengths to balloon.
The ReLIFT Framework:
- Primary Loop (RL): The model undergoes standard RL training (using GRPO) on a batch of questions.
- Online Hard Question Identification: During the rollout phase, the system identifies "Hardest" questions where the model's accuracy is 0 (i.e., it fails to solve them).
- Online Data Collection: For these identified hard questions, high-quality Chain-of-Thought (CoT) solutions are obtained. These can be sourced from a stronger model (e.g., DeepSeek-R1) or human annotators. Incorrect CoTs are filtered out to ensure only correct $(q, s)$ pairs are kept.
- Interleaved Fine-Tuning (FT): These curated hard examples are stored in a buffer. Once the buffer reaches a predefined threshold ( $M$ ), the training process pauses the RL step to perform a single step of Supervised Fine-Tuning on these specific hard examples.
- Entropy Regularization: To prevent the SFT step from overly constraining the model's exploratory behavior (a common issue in SFT), an entropy regularization term is added to the SFT loss function:
  $L_{FT}(\theta) = L_{CE}(\theta) + \alpha L_{Entropy}(\theta)$
  This encourages the model to maintain diversity in its outputs even while learning from demonstrations.
Adaptive Scheduling:
- The frequency of interleaving is adaptive. Early in training, when the model struggles with many hard problems, SFT is applied more frequently to establish baseline reasoning patterns. As the model improves, the focus shifts back to RL to refine existing skills.

3. Key Contributions

Systematic Analysis of RL vs. SFT: The paper provides empirical evidence that RL and SFT have complementary roles: RL refines existing skills on solvable problems, while SFT is essential for acquiring new knowledge to tackle problems beyond the model's current capabilities.
ReLIFT Framework: A novel, source-agnostic training strategy that interleaves RL with online fine-tuning on the most challenging examples. It dynamically addresses model weaknesses as they emerge without requiring a pre-collected massive dataset of demonstrations.
State-of-the-Art Performance: ReLIFT achieves new SOTA results on mathematical reasoning benchmarks while using significantly less demonstration data and computational resources compared to hybrid baselines.
Generalizability: The method is validated across different model scales (1.5B to 7B) and architectures (Qwen, LLaMA), demonstrating robust performance improvements.

4. Experimental Results

The authors evaluated ReLIFT using the Qwen2.5-Math-7B base model against various baselines (Pure SFT, Pure RL, RL w/ SFT loss, SFT then RL, LUFFY) on five math benchmarks (AIME 2024/2025, AMC, OlympiadBench, MATH500) and one OOD benchmark (MMLU-Pro).

Performance: ReLIFT achieved an overall accuracy of 52.6%, surpassing all baselines. It consistently ranked first or second on individual benchmarks.
Efficiency:
- Data: ReLIFT required only 8,640 demonstration samples (vs. 46,000 for SFT/RL baselines).
- Compute: It required 52 GPU hours (8x8 GPUs), significantly less than the 113.5 GPU hours required for "RL w/ SFT loss" and comparable to pure RL, despite the added SFT steps.
Response Quality: ReLIFT generated significantly more concise solutions (average length ~~3,500 tokens) compared to SFT (~~5,500 tokens) and SFT-then-RL approaches, indicating better efficiency in reasoning.
Ablation Studies:
- Removing the "hardest question" selection (using random or uniform scheduling) led to performance drops, confirming the importance of targeting specific weaknesses.
- The entropy coefficient ( $\alpha$ ) was crucial; an optimal value of $10^{-4}$ balanced learning new patterns with maintaining exploration.
Generalization: ReLIFT showed superior performance on OOD benchmarks (ARC-Challenge, GPQA, HumanEval), outperforming pure RL and hybrid methods, proving it enhances generalization rather than just memorization.

5. Significance

ReLIFT represents a paradigm shift in training reasoning models by moving away from static, pre-defined hybrid training schedules toward dynamic, online adaptation.

Overcoming RL Limitations: It directly addresses the "knowledge ceiling" of RL by injecting external knowledge exactly when the model fails, effectively breaking the "echo chamber" effect.
Resource Efficiency: By only collecting demonstrations for the specific questions the model cannot solve, it drastically reduces the cost of data curation and training time.
Scalability: The framework is applicable to various model sizes and architectures, suggesting a scalable path toward developing more capable and general reasoning agents without the prohibitive costs of massive SFT datasets.

In conclusion, ReLIFT demonstrates that the most effective path to advanced reasoning is not choosing between RL or SFT, but intelligently interleaving them to leverage the strengths of both: RL for exploration and refinement, and SFT for targeted knowledge injection on the hardest problems.

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

The Big Idea: The "Tutor" and the "Coach"

The Problem: The "Echo Chamber"

The Solution: ReLIFT (The Hybrid Strategy)

Why This is a Game-Changer

A Real-World Analogy: Learning to Play Guitar

The Bottom Line

1. Problem Statement

2. Methodology: ReLIFT

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Unsupervised Point Cloud Pre-Training via Contrasting and Clustering

Conceptual Views of Neural Networks: A Framework for Neuro-Symbolic Analysis

Combining Tree-Search, Generative Models, and Nash Bargaining Concepts in Game-Theoretic Reinforcement Learning

3D-LFM: Lifting Foundation Model

Sparse Training for Federated Learning with Regularized Error Correction