Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Imagine you are teaching a student to solve complex math problems. You have two traditional ways to do this:

The "Guess and Check" Method (Reinforcement Learning): You let the student try to solve a problem. If they get the final answer right, you give them a gold star. If they get it wrong, you give them a thumbs down. You don't tell them where they went wrong, just that the result was bad. The student has to guess and try again thousands of times to figure out which steps were good and which were bad. This is slow and expensive.
The "Perfect Tutor" Method (Distillation): You hire a genius tutor who solves the problem perfectly. You show the student the tutor's step-by-step solution and say, "Copy this exactly." This is very effective, but finding a genius tutor for every single problem is incredibly expensive and often impossible.

SD-ZERO is a new, clever method that combines the best of both worlds without needing a genius tutor or thousands of guesses. It's like giving the student a "superpower" to critique and fix their own work.

Here is how it works, broken down into a simple story:

The Two Roles: The Artist and The Critic

In SD-ZERO, the AI model plays two roles simultaneously, like an artist who is also their own art critic.

The Generator (The Artist): The model tries to solve a problem. It might get it right, or it might make a mistake.
The Reviser (The Critic): The model looks at its own attempt.
- If the answer is wrong, the Critic says: "Wait, this is wrong. Let me start over and fix the mistake."
- If the answer is right, the Critic says: "Good job, but let me rephrase this to make it shorter and cleaner."

Phase 1: Learning to Fix Mistakes (The "Self-Revision" Gym)

First, the model practices this "Critique and Fix" routine.

It generates an answer.
A simple checker says "Right" or "Wrong."
The model is forced to generate a new answer based on that feedback.
The Magic: The model learns that when it sees "Wrong," it needs to find the specific part of its logic that failed and fix it. When it sees "Right," it learns to be more concise.

Think of this like a writer who writes a draft, gets a "Needs Work" stamp, and then rewrites the story. After doing this 6,000 times, the model gets really good at spotting its own errors and fixing them.

Phase 2: The "Telepathic" Upgrade (Self-Distillation)

This is the real magic trick. Usually, if you want to learn from a teacher, you need to read their notes. But here, the model is learning from its own future self.

The model (now acting as the Student) tries to solve a problem in one go.
The model (acting as the Teacher/Critic) looks at that attempt and says, "Here is how I would have fixed that specific sentence."
The Student learns to predict the Teacher's corrections before it even makes the mistake.

The Analogy: Imagine a basketball player who usually shoots, misses, and then has to run back to the coach to get advice on their form.

SD-ZERO is like the player suddenly developing a "sixth sense." They can feel exactly where their form was off while they are shooting, and they adjust their aim instantly without needing to stop and ask the coach. They have internalized the coach's advice.

Why is this a Big Deal?

It Turns "Yes/No" into "Detailed Feedback":
Normally, a binary reward (Right/Wrong) is like a traffic light. It just says "Stop" or "Go." SD-ZERO turns that red light into a detailed map showing exactly which lane you were in and how to steer back. It takes a simple "You failed" and turns it into a dense lesson plan for every single word the AI wrote.
It's Cheaper and Faster:
Because the model learns to fix its own mistakes, it doesn't need a human expert to write the "perfect" answers. It creates its own high-quality training data by fixing its own bad attempts.
It Gets Smarter Over Time:
As the model gets better at fixing mistakes, it becomes a better teacher for itself. The paper shows that if you let the model practice this loop a few times, it keeps getting better and better, almost like a video game character leveling up by fighting its own clones.

The Result

When tested on hard math and coding problems, this method made the AI models 10% smarter than before. More importantly, it made them faster. Instead of writing a long, rambling answer and then backtracking to fix it (which wastes time), the model learned to think clearly the first time, producing shorter, more accurate answers.

In short: SD-ZERO teaches an AI to be its own best teacher, turning simple "pass/fail" grades into a masterclass on how to think correctly.

1. Problem Statement

Current post-training methods for verifiable reasoning tasks (e.g., math, coding) face a fundamental trade-off between supervision density and data availability:

Reinforcement Learning with Verifiable Rewards (RLVR): Methods like GRPO rely on binary rewards (correct/incorrect). While broadly applicable and requiring no external data, they provide sparse supervision. The model must discover correct reasoning paths through trial and error, which is sample-inefficient.
Distillation Methods: These provide dense token-level supervision (learning from every token in a correct response). However, they typically require:
- An external, stronger "teacher" model.
- High-quality demonstrations (gold solutions) that are often expensive to collect or unavailable.
- Existing self-distillation methods (e.g., SDFT, SDPO) still rely on high-quality demonstrations or repeated generation with filtering, which can be costly.

The Core Question: Can a model condition its own initial (potentially incorrect) attempts and their sparse binary rewards to generate improved, dense supervision for itself, without needing an external teacher or gold solutions?

2. Methodology: SD-ZERO

The authors propose Self-Distillation Zero (SD-ZERO), a two-phase framework that trains a single model to play two roles: a Generator (student) and a Reviser (teacher). The core innovation is converting a scalar binary reward into dense, token-level supervision via self-revision.

Phase 1: Self-Revision Training (SRT)

The goal is to teach the model how to critique and correct its own outputs.

Data Collection: For a given problem $x$ , the model samples an initial response $y_{init}$ . A binary verifier checks correctness ( $r \in \{0, 1\}$ ).
Prompting:
- If $r=1$ (Correct): The model is prompted to rephrase the solution (encouraging conciseness).
- If $r=0$ (Incorrect): The model is prompted to start over and correct the error.
Filtering: Only traces where the revised response ( $y_{revised}$ ) is verified as correct are retained.
Training Objective: The model is fine-tuned on these traces using a combined loss:
- Revision Loss: Trains the model to generate $y_{revised}$ conditioned on $x$ , $y_{init}$ , and the reward prompt $P_r$ .
- Generation Loss: Trains the model to generate the correct answer from scratch given only $x$ (using the full trace as context).
- Result: The model learns explicit self-revision behaviors (e.g., "Wait, this is wrong, let me start over").

Phase 2: On-Policy Self-Distillation

The goal is to internalize the revision capability into the generation process, making it sample-efficient and token-efficient.

Setup: The SRT-trained model acts as a frozen Teacher (Reviser). The current model is the Student (Generator).
Process:
- The Student generates an on-policy response $y \sim \pi_\theta(\cdot|x)$ .
- The Teacher (SRT model) conditions on the Student's response $y$ and the binary reward $r$ to generate a revised distribution $\pi_{SRT}(\cdot | x, y, P_r)$ .
Distillation: The Student is updated to match the Teacher's token-level distribution using KL-divergence loss:
$\mathcal{L}_{SD} = \mathbb{E}_{x, y} [ D_{KL}(\pi_\theta(\cdot|x) \parallel \pi_{SRT}(\cdot|x, y, P_r)) ]$
Mechanism: The Teacher converts the binary reward into a dense signal. If the Student's answer is wrong, the Teacher's distribution shifts probability mass away from the erroneous tokens and toward correct alternatives. If the answer is right, the Teacher encourages a more concise rephrasing.

Iterative Self-Evolution

A key feature is Teacher Synchronization. Since the training process improves the model's revision capability, the updated Student can be synchronized with the Teacher for subsequent rounds of distillation, allowing for continuous self-improvement.

3. Key Contributions

Binary-to-Dense Conversion: SD-ZERO is the first method to transform sparse binary outcome rewards into dense token-level self-supervision without requiring an external teacher or gold solutions.
Dual-Role Architecture: It utilizes a single model as both generator and reviser, leveraging the reviser's ability to localize errors (token-level self-localization) based on the binary reward.
Sample Efficiency: Unlike RLVR methods that require many rollouts to find a correct path, or distillation methods requiring external gold data, SD-ZERO achieves strong performance with a single response per question during the distillation phase.
Token Efficiency: The method trains the model to internalize the "backtracking" behavior learned in SRT, resulting in significantly shorter, more direct reasoning chains at inference time compared to explicit multi-turn revision.

4. Experimental Results

Experiments were conducted on Qwen3-4B-Instruct and Olmo-3-7B-Instruct across math (AIME, HMMT, MATH) and code (Codeforces, LiveCodeBench) benchmarks.

Performance Gains:
- SRT (Phase 1 only): Improved average accuracy by 7.8% (Qwen) and 9.2% (Olmo) over base models, outperforming SFT, RFT, and GRPO trained on the same data budget.
- SD-ZERO (Full): Achieved total gains of 10.5% (Qwen) and 10.4% (Olmo) over base models.
- Comparison: Outperformed strong baselines (RFT, GRPO, SDFT) by at least 4.8% on average.
Token Efficiency: The final SD-ZERO model generated roughly 2x fewer tokens than the SRT model and fewer than all baselines while achieving higher accuracy.
Self-Revision Capability: The model showed a 15-16% correction rate (fixing initially wrong answers) in "Generate-then-Revise" evaluations, significantly outperforming the base model (~2.7%).
Iterative Gains: Synchronizing the teacher after one epoch yielded an additional 3%+ performance gain, demonstrating the potential for continuous self-evolution.

5. Significance and Analysis

Token-Level Self-Localization: Analysis (Figure 4) shows that the reviser concentrates the KL-divergence signal (the "reward") on a small subset of tokens responsible for the error, effectively pinpointing mistakes without step-level annotations.
Internalization of Reasoning: The transition from SRT to SD-ZERO shifts the model from explicit self-revision (long traces with "Wait, let me start over") to internalized self-guidance (anticipating pitfalls and generating the correct path directly).
Elimination of External Dependencies: By removing the need for high-quality demonstrations or external teachers, SD-ZERO makes advanced reasoning training accessible for domains where gold solutions are scarce or expensive.
Limitations: The current work focuses on verifiable domains (math/code). Extending this to non-verifiable tasks or "thinking models" (long chains of thought with false starts) remains an open challenge, as distinguishing productive exploration from errors is difficult without ground truth.

In conclusion, SD-ZERO demonstrates that a model can bootstrap its own reasoning capabilities by learning to critique and revise its own failures, effectively turning a simple "right/wrong" signal into a rich, dense learning signal that rivals methods requiring expensive external supervision.