On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

This paper proposes Dynamic Fine-Tuning (DFT), a theoretically motivated method that rectifies the problematic reward structure inherent in standard Supervised Fine-Tuning by dynamically rescaling token probabilities, thereby significantly improving generalization across diverse tasks while offering a streamlined alternative to reinforcement learning.

Yongliang Wu, Yizhou Zhou, Zhou Ziheng, Yingzhe Peng, Xinyu Ye, Xinting Hu, Wenbo Zhu, Lu Qi, Ming-Hsuan Yang, Xu Yang

Published 2026-03-02
📖 5 min read🧠 Deep dive

The Big Picture: The "Copycat" vs. The "Coach"

Imagine you are teaching a student (the AI) how to solve complex math problems.

Standard SFT (Supervised Fine-Tuning) is like giving the student a stack of answer keys and saying, "Memorize these exact solutions."

  • The Problem: The student becomes a master at copying the specific answers in the book. But if you give them a slightly different problem or a harder one they haven't seen, they freeze. They have "memorized" the data but haven't truly "learned" the logic. In the paper's terms, they overfit.

RL (Reinforcement Learning) is like giving the student a coach who says, "Try solving this. If you get it right, you get a gold star. If you get it wrong, try again."

  • The Benefit: The student learns the strategy to solve problems, not just the answers. They generalize better.
  • The Downside: This takes a huge amount of time, energy, and a very smart coach (a reward system) to run. It's expensive and slow.

The Paper's Goal: The authors wanted to find a way to make the "Copycat" method (SFT) work as well as the "Coach" method (RL), without needing the expensive coach. They found a tiny, one-line fix that changes how the student learns.


The Hidden Flaw: The "Shouting" Teacher

The authors realized there is a mathematical glitch in how standard SFT works.

Imagine the teacher is grading the student.

  • If the student is confident and gets the answer right, the teacher gives a gentle "Good job."
  • But if the student is unsure (low probability) and still gets the answer right (because it's the correct answer in the book), the teacher starts screaming.

Why? In the math of SFT, the "reward" for a correct answer is inversely proportional to how likely the student thought it was.

  • High Confidence + Correct: Small reward (gentle nudge).
  • Low Confidence + Correct: Massive reward (screaming nudge).

The Analogy: Imagine a student who is terrified of a specific math concept. They guess the right answer by accident. The teacher screams, "YES! THAT'S RIGHT!" so loudly that the student's brain shakes. The student learns to fear that specific concept and overreacts to it, rather than understanding it calmly. This causes the training to become unstable and the student to memorize the "screaming" moments rather than the logic.

The Solution: "Dynamic Fine-Tuning" (DFT)

The authors proposed a simple fix called Dynamic Fine-Tuning (DFT).

The Fix: They told the teacher: "Stop screaming at the unsure students. Just give everyone a calm, steady 'Good job' regardless of how confident they were."

In technical terms, they multiplied the learning signal by the student's own confidence level.

  • If the student was unsure (low probability), the "scream" is dampened.
  • If the student was confident, the signal stays normal.

The Result: The "screaming" stops. The student learns to focus on the logic of the solution rather than reacting to the intensity of the teacher's reaction. This makes the learning process stable and helps the student handle new, difficult problems they haven't seen before.

What Happened When They Tried It?

The authors tested this "one-line change" on some of the hardest math and coding challenges available (like the Math Olympiad or coding competitions).

  1. Math Reasoning: The standard "Copycat" method (SFT) often got worse on hard problems. The new "DFT" method got significantly better, sometimes doubling the improvement.
  2. Code Generation: It worked great for writing code, helping the AI write cleaner, more logical programs.
  3. Visual Math: It even worked when the AI had to look at a picture of a math problem and solve it.
  4. The "Factual" Caveat: The authors noted one catch. If the task is just memorizing facts (like "What is the capital of France?"), the old "screaming" method (SFT) is actually fine. DFT is best for reasoning and logic, not just rote memory.

The "Aha!" Moment: Why It Works

The paper explains that by fixing this "screaming" issue, the AI stops trying to force itself to be perfect on every single tiny detail (like a comma or a connecting word).

The Analogy: Think of writing an essay.

  • Old SFT: Tries to make the student perfect at every word, including "the," "and," and "but." This wastes brainpower.
  • New DFT: Tells the student, "Don't worry so much about the small connecting words. Focus your energy on the main ideas and the logic."

This creates a "polarized" learning style: The AI becomes very confident on the important, meaningful parts of the answer and lets go of the trivial parts. This is exactly how humans learn to generalize!

Summary

  • The Problem: Standard AI training (SFT) is like a teacher who screams at students who guess right by accident, causing them to memorize answers instead of learning logic.
  • The Fix: A simple mathematical tweak (DFT) that calms the teacher down, treating all correct answers equally.
  • The Outcome: The AI learns faster, stays stable, and becomes much better at solving new, hard problems without needing expensive reinforcement learning.

It's a reminder that sometimes, the best way to improve a complex system isn't to add more tools, but to fix a tiny, hidden flaw in how it processes feedback.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →