Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

This paper introduces On-Policy Self-Distillation (OPSD), a framework where a single large language model acts as both teacher and student by leveraging privileged reasoning traces to supervise its own weaker policy, thereby achieving superior mathematical reasoning performance and significantly higher token efficiency compared to traditional off-policy distillation and reinforcement learning methods.

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover

Published 2026-03-06
📖 5 min read🧠 Deep dive

Here is an explanation of the paper "On-Policy Self-Distillation for Large Language Models" (OPSD) using simple language and creative analogies.

The Big Idea: Teaching Yourself by Looking at the Answer Key

Imagine you are a student taking a very hard math test. You are stuck on a problem.

  • The Old Way (Reinforcement Learning/GRPO): You guess an answer. If it's wrong, you get a big red "X" and have to start over. You might try 100 different guesses just to find one that works. It's like shooting arrows in the dark hoping one hits the bullseye. It takes a long time and uses a lot of energy.
  • The New Way (OPSD): You are allowed to peek at the answer key while you are solving the problem. You don't just see the final number; you see the step-by-step logic the teacher used. You try to solve it yourself, but every time you take a step, you check the answer key to see if your reasoning matches the teacher's. If you go off-track, the answer key gently nudges you back.

This paper introduces a method called On-Policy Self-Distillation (OPSD). It's a way for a single AI model to teach itself how to reason better by using the "answer key" (the correct solution) to guide its own thinking process.


The Characters in Our Story

To understand how this works, let's imagine the AI model has two "personalities" or "hats" it wears at the same time:

  1. The Student (The Learner): This version of the AI only sees the math problem. It has to figure out the answer from scratch, just like a human student. It makes mistakes and generates a solution step-by-step.
  2. The Teacher (The Guide): This is the exact same AI model, but it has a secret advantage: it has the Answer Key (the correct solution and the reasoning steps) in its pocket.

The Magic Trick: The Teacher doesn't actually write a new solution. Instead, it looks at what the Student just wrote and says, "Hey, if I had the answer key, here is how I would have continued from where you just stopped."

The Student then compares its own next step with the Teacher's "ideal" next step. If they match, great! If they don't, the Student learns to adjust its thinking to be more like the Teacher.

Why This is a Game-Changer

The paper compares this new method to the two main ways AI learns reasoning today:

1. The "Blind Guessing" Method (Reinforcement Learning / GRPO)

  • How it works: The AI tries to solve a problem. If it gets the final answer right, it gets a cookie (reward). If it's wrong, it gets no cookie.
  • The Problem: The AI has to guess many times (often 8 or more) to find a single correct answer. It's like trying to open a safe by spinning the dial randomly until it clicks. It's slow, expensive, and wasteful.
  • OPSD Advantage: OPSD doesn't need to guess 8 times. It only needs one attempt. Because it has the answer key guiding every single step, it learns much faster. The paper says it is 8 to 12 times more efficient than the guessing method.

2. The "Memorization" Method (Supervised Fine-Tuning / SFT)

  • How it works: The AI is just shown the problem and the perfect solution, over and over again. It tries to memorize the pattern.
  • The Problem: This is like a student memorizing the answers to a practice test but failing the real exam because the questions are slightly different. The AI gets confused when it makes a small mistake early on and doesn't know how to recover.
  • OPSD Advantage: OPSD is like a student who is doing the work while checking the answer key. It learns how to recover from mistakes and how to reason through the problem, not just memorize the final result.

The "Self-Distillation" Concept

The word "Distillation" usually means taking knowledge from a big, smart teacher and pouring it into a smaller student.

In this paper, the AI is distilling itself.

  • It uses its own "smart" side (the Teacher with the answer key) to train its "learning" side (the Student without the key).
  • It's like a person reading a solution to a puzzle, closing the book, and then trying to solve it again, checking their work against the solution as they go. They are teaching themselves how to think better.

Key Findings from the Paper

  1. You need a smart enough brain: This trick only works if the AI is already pretty good at reasoning. If the model is too small or too "dumb," it can't understand the answer key well enough to learn from it. It's like trying to teach calculus to a toddler using a textbook; the toddler just won't get it. The paper found that models with at least 4 billion parameters worked well, but a tiny 1.7 billion model struggled.
  2. More steps = Better learning: The longer the AI is allowed to think (generate more tokens) while checking the answer key, the better it gets. It needs to see the whole journey, not just the destination.
  3. It saves money: Because it learns faster and needs fewer computer cycles to train, this method is much cheaper for companies to use than the current "guessing" methods.

The Bottom Line

This paper proposes a smarter, faster way to train AI to be good at math and logic. Instead of letting the AI flail around in the dark hoping to get lucky, or just forcing it to memorize answers, OPSD lets the AI practice solving problems while holding the answer key in one hand.

It's the difference between:

  • Old Way: "Try 100 times until you get it right."
  • New Way: "Try once, but check your work against the solution at every single step."

The result? An AI that learns to reason better, faster, and with less computing power.