Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

The paper proposes Self-Distillation Zero (SD-Zero), a training method that converts sparse binary rewards into dense token-level supervision by having a single model act as both a generator and a reviser, thereby achieving superior performance on reasoning benchmarks without requiring external teachers or high-quality demonstrations.

Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora

Published 2026-04-15
📖 5 min read🧠 Deep dive

Imagine you are teaching a student to solve complex math problems. You have two traditional ways to do this:

  1. The "Guess and Check" Method (Reinforcement Learning): You let the student try to solve a problem. If they get the final answer right, you give them a gold star. If they get it wrong, you give them a thumbs down. You don't tell them where they went wrong, just that the result was bad. The student has to guess and try again thousands of times to figure out which steps were good and which were bad. This is slow and expensive.
  2. The "Perfect Tutor" Method (Distillation): You hire a genius tutor who solves the problem perfectly. You show the student the tutor's step-by-step solution and say, "Copy this exactly." This is very effective, but finding a genius tutor for every single problem is incredibly expensive and often impossible.

SD-ZERO is a new, clever method that combines the best of both worlds without needing a genius tutor or thousands of guesses. It's like giving the student a "superpower" to critique and fix their own work.

Here is how it works, broken down into a simple story:

The Two Roles: The Artist and The Critic

In SD-ZERO, the AI model plays two roles simultaneously, like an artist who is also their own art critic.

  1. The Generator (The Artist): The model tries to solve a problem. It might get it right, or it might make a mistake.
  2. The Reviser (The Critic): The model looks at its own attempt.
    • If the answer is wrong, the Critic says: "Wait, this is wrong. Let me start over and fix the mistake."
    • If the answer is right, the Critic says: "Good job, but let me rephrase this to make it shorter and cleaner."

Phase 1: Learning to Fix Mistakes (The "Self-Revision" Gym)

First, the model practices this "Critique and Fix" routine.

  • It generates an answer.
  • A simple checker says "Right" or "Wrong."
  • The model is forced to generate a new answer based on that feedback.
  • The Magic: The model learns that when it sees "Wrong," it needs to find the specific part of its logic that failed and fix it. When it sees "Right," it learns to be more concise.

Think of this like a writer who writes a draft, gets a "Needs Work" stamp, and then rewrites the story. After doing this 6,000 times, the model gets really good at spotting its own errors and fixing them.

Phase 2: The "Telepathic" Upgrade (Self-Distillation)

This is the real magic trick. Usually, if you want to learn from a teacher, you need to read their notes. But here, the model is learning from its own future self.

  • The model (now acting as the Student) tries to solve a problem in one go.
  • The model (acting as the Teacher/Critic) looks at that attempt and says, "Here is how I would have fixed that specific sentence."
  • The Student learns to predict the Teacher's corrections before it even makes the mistake.

The Analogy: Imagine a basketball player who usually shoots, misses, and then has to run back to the coach to get advice on their form.

  • SD-ZERO is like the player suddenly developing a "sixth sense." They can feel exactly where their form was off while they are shooting, and they adjust their aim instantly without needing to stop and ask the coach. They have internalized the coach's advice.

Why is this a Big Deal?

  1. It Turns "Yes/No" into "Detailed Feedback":
    Normally, a binary reward (Right/Wrong) is like a traffic light. It just says "Stop" or "Go." SD-ZERO turns that red light into a detailed map showing exactly which lane you were in and how to steer back. It takes a simple "You failed" and turns it into a dense lesson plan for every single word the AI wrote.

  2. It's Cheaper and Faster:
    Because the model learns to fix its own mistakes, it doesn't need a human expert to write the "perfect" answers. It creates its own high-quality training data by fixing its own bad attempts.

  3. It Gets Smarter Over Time:
    As the model gets better at fixing mistakes, it becomes a better teacher for itself. The paper shows that if you let the model practice this loop a few times, it keeps getting better and better, almost like a video game character leveling up by fighting its own clones.

The Result

When tested on hard math and coding problems, this method made the AI models 10% smarter than before. More importantly, it made them faster. Instead of writing a long, rambling answer and then backtracking to fix it (which wastes time), the model learned to think clearly the first time, producing shorter, more accurate answers.

In short: SD-ZERO teaches an AI to be its own best teacher, turning simple "pass/fail" grades into a masterclass on how to think correctly.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →