On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

The Big Picture: The "Copycat" vs. The "Coach"

Imagine you are teaching a student (the AI) how to solve complex math problems.

Standard SFT (Supervised Fine-Tuning) is like giving the student a stack of answer keys and saying, "Memorize these exact solutions."

The Problem: The student becomes a master at copying the specific answers in the book. But if you give them a slightly different problem or a harder one they haven't seen, they freeze. They have "memorized" the data but haven't truly "learned" the logic. In the paper's terms, they overfit.

RL (Reinforcement Learning) is like giving the student a coach who says, "Try solving this. If you get it right, you get a gold star. If you get it wrong, try again."

The Benefit: The student learns the strategy to solve problems, not just the answers. They generalize better.
The Downside: This takes a huge amount of time, energy, and a very smart coach (a reward system) to run. It's expensive and slow.

The Paper's Goal: The authors wanted to find a way to make the "Copycat" method (SFT) work as well as the "Coach" method (RL), without needing the expensive coach. They found a tiny, one-line fix that changes how the student learns.

The Hidden Flaw: The "Shouting" Teacher

The authors realized there is a mathematical glitch in how standard SFT works.

Imagine the teacher is grading the student.

If the student is confident and gets the answer right, the teacher gives a gentle "Good job."
But if the student is unsure (low probability) and still gets the answer right (because it's the correct answer in the book), the teacher starts screaming.

Why? In the math of SFT, the "reward" for a correct answer is inversely proportional to how likely the student thought it was.

High Confidence + Correct: Small reward (gentle nudge).
Low Confidence + Correct: Massive reward (screaming nudge).

The Analogy: Imagine a student who is terrified of a specific math concept. They guess the right answer by accident. The teacher screams, "YES! THAT'S RIGHT!" so loudly that the student's brain shakes. The student learns to fear that specific concept and overreacts to it, rather than understanding it calmly. This causes the training to become unstable and the student to memorize the "screaming" moments rather than the logic.

The Solution: "Dynamic Fine-Tuning" (DFT)

The authors proposed a simple fix called Dynamic Fine-Tuning (DFT).

The Fix: They told the teacher: "Stop screaming at the unsure students. Just give everyone a calm, steady 'Good job' regardless of how confident they were."

In technical terms, they multiplied the learning signal by the student's own confidence level.

If the student was unsure (low probability), the "scream" is dampened.
If the student was confident, the signal stays normal.

The Result: The "screaming" stops. The student learns to focus on the logic of the solution rather than reacting to the intensity of the teacher's reaction. This makes the learning process stable and helps the student handle new, difficult problems they haven't seen before.

What Happened When They Tried It?

The authors tested this "one-line change" on some of the hardest math and coding challenges available (like the Math Olympiad or coding competitions).

Math Reasoning: The standard "Copycat" method (SFT) often got worse on hard problems. The new "DFT" method got significantly better, sometimes doubling the improvement.
Code Generation: It worked great for writing code, helping the AI write cleaner, more logical programs.
Visual Math: It even worked when the AI had to look at a picture of a math problem and solve it.
The "Factual" Caveat: The authors noted one catch. If the task is just memorizing facts (like "What is the capital of France?"), the old "screaming" method (SFT) is actually fine. DFT is best for reasoning and logic, not just rote memory.

The "Aha!" Moment: Why It Works

The paper explains that by fixing this "screaming" issue, the AI stops trying to force itself to be perfect on every single tiny detail (like a comma or a connecting word).

The Analogy: Think of writing an essay.

Old SFT: Tries to make the student perfect at every word, including "the," "and," and "but." This wastes brainpower.
New DFT: Tells the student, "Don't worry so much about the small connecting words. Focus your energy on the main ideas and the logic."

This creates a "polarized" learning style: The AI becomes very confident on the important, meaningful parts of the answer and lets go of the trivial parts. This is exactly how humans learn to generalize!

Summary

The Problem: Standard AI training (SFT) is like a teacher who screams at students who guess right by accident, causing them to memorize answers instead of learning logic.
The Fix: A simple mathematical tweak (DFT) that calms the teacher down, treating all correct answers equally.
The Outcome: The AI learns faster, stays stable, and becomes much better at solving new, hard problems without needing expensive reinforcement learning.

It's a reminder that sometimes, the best way to improve a complex system isn't to add more tools, but to fix a tiny, hidden flaw in how it processes feedback.

1. Problem Statement

Supervised Fine-Tuning (SFT) is the standard paradigm for adapting Large Language Models (LLMs) to expert demonstrations. However, it suffers from limited generalization compared to Reinforcement Learning (RL). While RL leverages explicit reward signals to explore diverse strategies and generalize better, it is computationally expensive, requires careful hyperparameter tuning, and needs explicit reward models or preference pairs.

The paper identifies a fundamental theoretical flaw in standard SFT:

Implicit Reward Structure: When viewed through the lens of policy gradients, the standard SFT gradient implicitly encodes a reward structure that is sparse (only non-zero for exact matches to the expert) and inversely proportional to the model's current probability of the expert action ( $1/\pi_\theta$ ).
The Consequence: When the model assigns a low probability to an expert token, the importance weight ( $1/\pi_\theta$ ) becomes excessively large. This creates an ill-posed reward landscape, leading to unstable optimization, disproportionately large gradients, and overfitting to rare exact-match samples rather than learning robust reasoning patterns.

2. Methodology: Dynamic Fine-Tuning (DFT)

The authors propose Dynamic Fine-Tuning (DFT), a simple yet theoretically motivated modification to the SFT objective that "rectifies" the reward structure.

Theoretical Insight

By rewriting the SFT gradient using importance sampling, the authors show that standard SFT is equivalent to a policy gradient with a reward $r(x, y) = \mathbb{I}[y=y^*]$ weighted by $1/\pi_\theta(y|x)$ .
$\nabla_\theta L_{SFT} \propto -\mathbb{E} \left[ \frac{1}{\pi_\theta(y|x)} \nabla_\theta \log \pi_\theta(y|x) \cdot \mathbb{I}[y=y^*] \right]$
The term $1/\pi_\theta$ causes the instability.

The Solution

DFT neutralizes this distortion by dynamically rescaling the objective function with the token probability. Instead of minimizing the negative log-likelihood ( $-\log \pi$ ), DFT minimizes a weighted loss where the weight is the probability itself (with a stop-gradient operator to prevent backpropagation through the weight).

The proposed loss function (at the token level) is:
$L_{DFT}(\theta) = \mathbb{E}_{(x, y^*) \sim \mathcal{D}} \left[ - \sum_{t=1}^{|y^*|} \text{sg}(\pi_\theta(y^*_t | y^*_{<t}, x)) \cdot \log \pi_\theta(y^*_t | y^*_{<t}, x) \right]$
Where $\text{sg}(\cdot)$ is the stop-gradient operator.

Key Effects:

Reward Rectification: The effective reward becomes uniform (1.0) for all expert trajectory tokens, rather than being inversely weighted. This mimics a verification-based reward (like in RLVR) where all correct steps are treated equally.
Gradient Stability: By multiplying by $\pi_\theta$ , the gradient magnitude for low-probability tokens is reduced, preventing the "exploding gradient" issue inherent in standard SFT when the model is uncertain.
Implementation: It requires only a single-line change to standard cross-entropy loss implementations.

3. Key Contributions

Theoretical Unification: The paper mathematically establishes SFT as a special case of policy gradient with an ill-posed, inverse-probability-weighted reward. It pinpoints the inverse-probability term as the root cause of SFT's generalization gap compared to RL.
Dynamic Fine-Tuning (DFT): A novel, lightweight method that corrects the SFT gradient by dynamically reweighting the loss with token probabilities.
Empirical Superiority: Demonstrates that DFT consistently outperforms standard SFT across various models (Qwen, LLaMA, DeepSeek), scales, and tasks (Math, Code, Multi-modal).
Offline RL Competitiveness: Shows that DFT, despite being a supervised method, outperforms established offline RL methods (DPO, RFT) and even competes with online RL methods (PPO, GRPO) in specific settings, without requiring a reference model or large batch sizes.

4. Experimental Results

A. Mathematical Reasoning (Main Benchmark)

Setup: Fine-tuned on NuminaMath-CoT using Qwen2.5-Math, LLaMA-3, and DeepSeekMath series.
Performance: DFT significantly outperformed SFT.
- Qwen2.5-Math-1.5B: DFT achieved a +15.66 point average gain over the base model, compared to only +2.09 for SFT (a 5.9x improvement).
- Generalization: On difficult benchmarks like OlympiadBench and AIME 2024, standard SFT often degraded performance (e.g., dropping from 15.88 to 12.63 on Olympiad for Qwen-1.5B), whereas DFT consistently improved it (boosting to 27.08).
Efficiency: DFT converged faster, reaching peak performance within the first 120 steps, whereas SFT often plateaued or required more steps.

B. Offline RL Setting

Setup: Trained on rejection-sampled data (positive/negative pairs) without online interaction.
Comparison: DFT outperformed DPO and RFT (offline) and even surpassed PPO and GRPO (online) on several metrics.
- On Math500, DFT scored 64.71, beating GRPO (62.86) and PPO (56.10).
- On AMC23, DFT achieved 48.44, significantly outperforming GRPO (+7.19 margin).

C. Code Generation & Multi-Modal Reasoning

Code: On HumanEval and MultiPL-E, DFT improved performance across Qwen2.5-Coder models (e.g., +12.8 points on HumanEval for the 7B model).
Multi-Modal: On MathVerse and MathVision, DFT showed consistent gains over SFT, proving applicability beyond text-only reasoning.

D. Limitations & Analysis

Factual Knowledge: DFT performed worse than SFT on the Natural Questions dataset (factual QA). The authors hypothesize that because DFT downweights low-confidence samples, it may hinder the learning of new factual information where the model initially has low confidence. DFT is best suited for tasks requiring reasoning and structured prediction where the model has prior competence.
Probability Distribution: Analysis (Figure 2) shows DFT creates a bimodal distribution: it boosts high-confidence tokens (semantic content) while suppressing low-confidence tokens (often function words like "the", "let"). This suggests DFT encourages the model to focus on substantive concepts rather than memorizing grammatical connectors.

5. Significance

This work bridges the gap between Supervised Fine-Tuning and Reinforcement Learning by offering a theoretically grounded, computationally efficient alternative to complex RL pipelines.

Simplicity: It achieves RL-like generalization with a "one-line code change," eliminating the need for reward modeling, preference data, or online sampling.
Scalability: It is highly resource-efficient, making advanced generalization capabilities accessible for scenarios where RL is impractical.
Paradigm Shift: It challenges the notion that SFT is merely "memorization" by showing that the generalization gap is a result of a specific, correctable mathematical artifact in the loss function, not an inherent limitation of supervised learning.

In conclusion, DFT represents a significant step forward in LLM post-training, providing a robust, stable, and highly effective method to enhance generalization in reasoning tasks without the overhead of traditional RL.