Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

Here is an explanation of the paper "On-Policy Self-Distillation for Large Language Models" (OPSD) using simple language and creative analogies.

The Big Idea: Teaching Yourself by Looking at the Answer Key

Imagine you are a student taking a very hard math test. You are stuck on a problem.

The Old Way (Reinforcement Learning/GRPO): You guess an answer. If it's wrong, you get a big red "X" and have to start over. You might try 100 different guesses just to find one that works. It's like shooting arrows in the dark hoping one hits the bullseye. It takes a long time and uses a lot of energy.
The New Way (OPSD): You are allowed to peek at the answer key while you are solving the problem. You don't just see the final number; you see the step-by-step logic the teacher used. You try to solve it yourself, but every time you take a step, you check the answer key to see if your reasoning matches the teacher's. If you go off-track, the answer key gently nudges you back.

This paper introduces a method called On-Policy Self-Distillation (OPSD). It's a way for a single AI model to teach itself how to reason better by using the "answer key" (the correct solution) to guide its own thinking process.

The Characters in Our Story

To understand how this works, let's imagine the AI model has two "personalities" or "hats" it wears at the same time:

The Student (The Learner): This version of the AI only sees the math problem. It has to figure out the answer from scratch, just like a human student. It makes mistakes and generates a solution step-by-step.
The Teacher (The Guide): This is the exact same AI model, but it has a secret advantage: it has the Answer Key (the correct solution and the reasoning steps) in its pocket.

The Magic Trick: The Teacher doesn't actually write a new solution. Instead, it looks at what the Student just wrote and says, "Hey, if I had the answer key, here is how I would have continued from where you just stopped."

The Student then compares its own next step with the Teacher's "ideal" next step. If they match, great! If they don't, the Student learns to adjust its thinking to be more like the Teacher.

Why This is a Game-Changer

The paper compares this new method to the two main ways AI learns reasoning today:

1. The "Blind Guessing" Method (Reinforcement Learning / GRPO)

How it works: The AI tries to solve a problem. If it gets the final answer right, it gets a cookie (reward). If it's wrong, it gets no cookie.
The Problem: The AI has to guess many times (often 8 or more) to find a single correct answer. It's like trying to open a safe by spinning the dial randomly until it clicks. It's slow, expensive, and wasteful.
OPSD Advantage: OPSD doesn't need to guess 8 times. It only needs one attempt. Because it has the answer key guiding every single step, it learns much faster. The paper says it is 8 to 12 times more efficient than the guessing method.

2. The "Memorization" Method (Supervised Fine-Tuning / SFT)

How it works: The AI is just shown the problem and the perfect solution, over and over again. It tries to memorize the pattern.
The Problem: This is like a student memorizing the answers to a practice test but failing the real exam because the questions are slightly different. The AI gets confused when it makes a small mistake early on and doesn't know how to recover.
OPSD Advantage: OPSD is like a student who is doing the work while checking the answer key. It learns how to recover from mistakes and how to reason through the problem, not just memorize the final result.

The "Self-Distillation" Concept

The word "Distillation" usually means taking knowledge from a big, smart teacher and pouring it into a smaller student.

In this paper, the AI is distilling itself.

It uses its own "smart" side (the Teacher with the answer key) to train its "learning" side (the Student without the key).
It's like a person reading a solution to a puzzle, closing the book, and then trying to solve it again, checking their work against the solution as they go. They are teaching themselves how to think better.

Key Findings from the Paper

You need a smart enough brain: This trick only works if the AI is already pretty good at reasoning. If the model is too small or too "dumb," it can't understand the answer key well enough to learn from it. It's like trying to teach calculus to a toddler using a textbook; the toddler just won't get it. The paper found that models with at least 4 billion parameters worked well, but a tiny 1.7 billion model struggled.
More steps = Better learning: The longer the AI is allowed to think (generate more tokens) while checking the answer key, the better it gets. It needs to see the whole journey, not just the destination.
It saves money: Because it learns faster and needs fewer computer cycles to train, this method is much cheaper for companies to use than the current "guessing" methods.

The Bottom Line

This paper proposes a smarter, faster way to train AI to be good at math and logic. Instead of letting the AI flail around in the dark hoping to get lucky, or just forcing it to memorize answers, OPSD lets the AI practice solving problems while holding the answer key in one hand.

It's the difference between:

Old Way: "Try 100 times until you get it right."
New Way: "Try once, but check your work against the solution at every single step."

The result? An AI that learns to reason better, faster, and with less computing power.

Here is a detailed technical summary of the paper "On-Policy Self-Distillation for Large Language Models" (OPSD).

1. Problem Statement

Large Language Models (LLMs) require effective post-training methods to enhance reasoning capabilities. Current dominant approaches face specific limitations:

Reinforcement Learning with Verifiable Rewards (RLVR): Methods like GRPO rely on sparse, sequence-level binary rewards (correct/incorrect). This leads to high variance, vanishing gradients when all sampled responses are incorrect, and high computational costs due to the need for multiple rollouts per prompt.
Supervised Fine-Tuning (SFT): While efficient, SFT suffers from exposure bias and distribution mismatch between training data and inference-time generation.
Traditional On-Policy Distillation: This approach trains a student on its own generated trajectories using dense token-level supervision from a teacher. However, it requires a separate, often larger, external teacher model, increasing infrastructure costs and complexity.
The Gap: There is a need for a method that combines the sample efficiency of on-policy training, the dense feedback of distillation, and the cost-effectiveness of not requiring an external teacher, while explicitly leveraging ground-truth solutions available in reasoning datasets.

2. Methodology: On-Policy Self-Distillation (OPSD)

The core intuition of OPSD is that a sufficiently capable LLM can "rationalize" a correct solution (ground truth) and use that understanding to teach its own weaker self (the version without access to the ground truth).

Core Framework

OPSD instantiates a single LLM ( $p_\theta$ ) as both a Teacher and a Student by varying the conditioning context:

Teacher Policy ( $p_T$ ): Conditions on the problem $x$ $x$ and the privileged ground-truth solution $y^\star$ $y^{⋆}$ (e.g., the correct answer or reference Chain-of-Thought).
- $p_T(\cdot | x, y^\star) \equiv p_\theta(\cdot | x, y^\star)$
Student Policy ( $p_S$ ): Conditions only on the problem $x$ $x$ , mimicking the inference-time scenario.
- $p_S(\cdot | x) \equiv p_\theta(\cdot | x)$

Training Process

On-Policy Sampling: The student generates a trajectory $\hat{y} \sim p_S(\cdot | x)$ .
Evaluation: Both the teacher and student evaluate this student-generated trajectory. At each token position $n$ $n$ , they produce next-token distributions conditioned on the student's prefix $\hat{y}_{<n}$ $\overset{y}{^}_{< n}$ :
- Student: $p_S(\cdot | x, \hat{y}_{<n})$
- Teacher: $p_T(\cdot | x, y^\star, \hat{y}_{<n})$
Objective: The model minimizes the per-token divergence between the teacher and student distributions over the student's own rollouts.
- Loss Function: $L_{OPSD}(\theta) = \mathbb{E}_{(x,y^\star)} \mathbb{E}_{\hat{y} \sim p_S} \left[ \sum_{n=1}^{|\hat{y}|} D(p_T(\cdot | x, y^\star, \hat{y}_{<n}) \parallel p_S(\cdot | x, \hat{y}_{<n})) \right]$
- Divergence Measure: The authors primarily use the Generalized Jensen-Shannon Divergence (JSD), though KL-divergence is also applicable.
- Gradient Flow: Gradients are backpropagated only through the student policy logits. The teacher acts as a fixed target distribution conditioned on privileged information.

Alternative Objectives

The paper also explores a Sampled-Token Policy Gradient approach (similar to Lu & Lab, 2025), where the advantage is calculated only on the sampled token rather than the full vocabulary distribution. However, experiments show the full-vocabulary divergence yields better performance.

3. Key Contributions

Novel Framework: Introduction of OPSD, a self-distillation framework where a single model acts as both teacher and student, eliminating the need for external teacher models.
Dense Token-Level Supervision: Unlike RLVR which provides sparse sequence-level rewards, OPSD provides dense, token-level feedback by matching the student's generation against the teacher's distribution (which has access to the ground truth).
Efficiency: The method achieves significant improvements in token efficiency compared to RL methods.
Empirical Validation: Comprehensive evaluation on competition-level math benchmarks (AIME, HMMT, AMO-Bench) demonstrating that OPSD outperforms SFT and matches or exceeds GRPO performance with fewer resources.

4. Experimental Results

The authors evaluated OPSD on the Qwen3 model family (1.7B, 4B, and 8B parameters) using the OpenThoughts dataset.

Performance:
- OPSD consistently outperformed standard SFT baselines.
- On 4B and 8B models, OPSD matched or exceeded the performance of GRPO (Group Relative Policy Optimization).
- Scale Dependency: OPSD showed diminishing returns or instability on the 1.7B model, suggesting that successful self-distillation requires a model with sufficient capacity to "rationalize" the ground truth.
Token Efficiency:
- OPSD achieved 8–12× higher token efficiency compared to GRPO.
- While GRPO requires sampling 8 responses per prompt with long generation budgets (16k tokens), OPSD achieves comparable results with 1 response and a shorter generation budget (1k tokens).
- This is attributed to the dense feedback signal, which allows the model to learn from every token rather than waiting for a final sequence-level reward.
Ablation Studies:
- Generation Length: Longer generation lengths (2048/4096 tokens) during the on-policy sampling phase yielded better performance than shorter lengths (1024), as they expose the student to more teacher signals.
- Objective: Full-vocabulary logit distillation outperformed sampled-token policy gradient objectives.

5. Significance and Impact

Cost Reduction: OPSD significantly reduces the computational cost of post-training reasoning models by removing the need for external teacher models and reducing the number of required rollouts per prompt.
Bridging SFT and RL: It effectively bridges the gap between the stability of supervised learning and the generalization of reinforcement learning by providing dense, on-policy supervision without the variance issues of sparse rewards.
Scalability: The method offers a scalable path for improving reasoning in LLMs, provided the base model has sufficient capacity to understand and rationalize ground-truth solutions.
Future Directions: The authors note that the method relies on the model's ability to utilize privileged information; future work could explore curriculum learning to gradually increase problem difficulty as the model improves.

In summary, OPSD presents a highly efficient alternative to current RL-based reasoning training, leveraging the model's own ability to understand correct answers to guide its own learning process, achieving state-of-the-art results with a fraction of the computational cost.