Learn Hard Problems During RL with Reference Guided Fine-tuning

This paper introduces Reference-Guided Fine-Tuning (ReGFT), a method that synthesizes model-aligned positive trajectories using partial human reference solutions to overcome reward sparsity and significantly enhance the performance and training efficiency of reinforcement learning for mathematical reasoning.

Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a brilliant but inexperienced student how to solve the world's most difficult math puzzles. You want them to learn by trial and error (Reinforcement Learning), where they get a gold star only if they get the answer right.

Here is the problem: If the puzzle is too hard, the student might try 1,000 times and get zero gold stars. They get frustrated, stop trying, and learn nothing. This is called the "Reward Sparsity" problem.

The paper introduces a clever new method called ReGFT (Reference-Guided Fine-Tuning) to fix this. Here is how it works, broken down into simple analogies.

1. The Old Way: The "Copy-Paste" Trap

Usually, when students are stuck, teachers give them the full solution.

  • The Mistake: If you just give the student the full answer key and say, "Memorize this," they might be able to recite it back. But if you give them a new puzzle that looks slightly different, they fail. Why? Because they didn't learn how to think; they just memorized the text. They are "out of their depth."

2. The Previous "Smart" Way: ReFT (Reinforced Fine-Tuning)

Researchers tried a better method: "Only study the solutions you figured out yourself."

  • How it works: The student tries to solve the puzzle. If they get it right, they study that solution. If they get it wrong, they ignore it.
  • The Problem: For the really hard puzzles, the student gets zero correct answers. They have nothing to study. They are still stuck in the "zero gold star" zone.

3. The New Way: ReGFT (The "Halfway House" Guide)

This is the paper's big idea. Instead of giving the full answer or asking the student to solve it completely alone, the teacher gives a hint.

The Analogy: The Hiking Guide
Imagine the student is trying to hike up a steep, foggy mountain (the hard math problem).

  • The Full Solution: The teacher draws a map of the entire mountain and says, "Here is the path." The student just traces the line. They don't learn how to navigate the fog.
  • The "Do It Yourself" Approach: The teacher says, "Go find the top." The student wanders in circles, gets lost, and never reaches the top.
  • ReGFT (The New Method): The teacher says, "I will walk with you for the first 80% of the hike. I will show you the path up to the halfway point. But then, you must take over and figure out the rest of the way to the summit on your own."

How ReGFT Works in Practice

  1. The Setup: The computer (the student) is given a hard math problem.
  2. The Hint: The computer is shown the first 80% of the human-written solution. This acts as a "guidepost."
  3. The Challenge: The computer is told: "You know the start. Now, use your own brain to figure out the final 20% and the answer."
  4. The Result: Because the computer had a head start, it is much more likely to solve the problem correctly.
  5. The Learning: The computer now has a correct solution that it actually generated itself (based on the hint). It studies this solution.

Why This is a Game-Changer

  • It fills the "Zero Reward" gap: Before, the computer would get stuck on hard problems and get no feedback. Now, with the hint, it can solve them and get a "gold star."
  • It keeps the student's style: Because the computer had to finish the work itself, the solution sounds like its thinking, not a robot copying a human. This makes it easier for the computer to generalize to new problems later.
  • It supercharges the next step: Once the computer has learned this way, it is much better at the "Trial and Error" phase (Reinforcement Learning). It starts with a higher skill level, learns faster, and eventually becomes a math genius.

The Bottom Line

The paper proves that if you want an AI to learn hard things, you can't just let it flail in the dark (pure trial and error), and you can't just force it to memorize answers (pure copying).

You have to be a smart coach: Give the AI a nudge in the right direction, let it do the heavy lifting, and then let it learn from its own success. This simple trick turns "unsolvable" problems into "solvable" ones, unlocking the AI's true potential.