On-Policy Self-Distillation for Reasoning Compression

The Big Idea: "Less Thinking, Better Answers"

Imagine you have a brilliant but overly chatty student. When you ask them a simple question like "What is 2+2?", they don't just say "4." Instead, they write a 500-page essay debating whether you meant binary code, checking if the numbers are prime, questioning the nature of addition, and then finally concluding, "Okay, I'm pretty sure it's 4."

This is exactly what modern AI "reasoning models" do. They think out loud, generating thousands of words before giving an answer. While this helps with hard problems, it often creates noise on easy ones. That noise isn't just annoying; it's dangerous. Every extra word the AI writes is a new chance to make a mistake, get confused, or talk itself into a wrong answer.

OPSDC is a new training method that teaches these chatty AIs to be concise, and surprisingly, this makes them smarter.

The Problem: The "Overthinker's Diet"

Current methods to fix this chatty behavior usually have a catch:

The "Strict Teacher" (Reinforcement Learning): You tell the AI, "If you write more than 50 words, you get a bad grade." The AI learns to shut up, but it often stops thinking at all, collapsing its ability to solve hard problems.
The "Copycat" (Supervised Fine-Tuning): You show the AI examples of short answers written by humans. The AI learns to mimic them but forgets its own unique way of thinking.

The Solution: OPSDC (The "Mirror" Method)

The authors of this paper came up with a clever trick that requires no human answers, no strict word limits, and no complex scoring systems. They call it Self-Distillation.

Here is how it works, using a Mirror Analogy:

The Setup: Imagine the AI is standing in front of a mirror.
- The Student (The AI): The AI tries to solve a math problem normally, chattering away as it usually does.
- The Teacher (The Mirror): The same AI looks at the problem again, but this time, a note is taped to the mirror that says: "Be concise. Cut the fluff. Just get to the point."
The Lesson: The AI compares its own long, chatty answer (Student) with the concise answer it generated while looking at the "Be Concise" note (Teacher).
- It doesn't need a human to tell it which answer is right. It just learns to make its "normal" answer look more like its "concise" answer.
- It's like the AI is teaching itself: "Hey, I know I can be brief when I try, so why am I so wordy when I don't have to be?"
The Magic: The AI repeats this process. Every time it gets better at being concise, the "Teacher" (the mirror) updates to be even more concise. This creates a cycle where the AI gets progressively better at cutting out the noise without losing the signal.

Why Does This Make It Smarter?

You might think, "If I cut out the thinking, won't the AI get dumber?" Actually, the opposite happens.

The "Trip Hazard" Analogy:
Imagine the AI is walking down a long, dark hallway to get to the answer.

The Old Way: The AI takes 10,000 steps, tripping over its own feet, bumping into walls, and talking to itself the whole time. With so many steps, the chance of falling (making a mistake) is huge.
The OPSDC Way: The AI learns to take only the necessary 2,000 steps. It walks straight to the goal. Because it took fewer steps, it didn't trip as often. By walking less, it arrived more safely.

The paper proves that much of what AI says is "noise" that actually causes errors. By removing the noise, they remove the mistakes.

The Results: Less is More

The researchers tested this on Qwen3 (a powerful AI model) with math problems:

MATH-500 (Medium difficulty): The AI cut its thinking time by 57% (it wrote less than half as much) but got 16% more answers correct.
AIME 2024 (Hard difficulty): It cut the length by 41% and improved accuracy by 10 points.

It's like a student who used to write a 10-page essay to solve a problem, getting confused halfway through. After training, they write a 4-page essay, stay focused, and get the right answer every time.

The Secret Sauce: It Adapts Automatically

The coolest part of OPSDC is that it's smart about when to be concise.

Easy Problems: If the answer is obvious, the AI cuts the thinking down to almost nothing.
Hard Problems: If the problem is truly difficult, the AI knows it still needs to "think hard." It keeps the necessary steps and only cuts the fluff.

It doesn't need a human to tell it, "This is an easy problem, be short." It figures it out on its own because the "concise teacher" naturally struggles to be short on hard problems, so the AI learns to keep the depth where it's needed.

Summary

OPSDC is a method where an AI teaches itself to stop overthinking.

Before: The AI was a nervous chatterbox that talked itself into wrong answers.
After: The AI is a calm, direct expert that gets straight to the point.
The Result: Shorter answers, faster processing, and higher accuracy.

It turns out that for AI, just like for humans, sometimes the best way to think clearly is to stop talking so much.

1. Problem Statement

Modern large language models (LLMs) with reasoning capabilities (e.g., OpenAI o1, DeepSeek-R1, Qwen3) often generate verbose reasoning traces ("thinking out loud") before providing an answer. While this deliberation helps with complex problems, it introduces significant inefficiencies:

Redundancy: Models often repeat steps, engage in self-doubt, or verify obvious facts unnecessarily.
Compounding Errors: Every unnecessary token is a potential point of failure. In autoregressive generation, an early error can propagate, causing subsequent steps to build on false premises.
Cost: Long reasoning traces increase inference latency and computational costs.

Existing compression methods suffer from critical trade-offs:

Reinforcement Learning (RL): Requires ground-truth answers (GT) and often collapses model entropy or exploration capabilities.
Supervised Fine-Tuning (SFT): Trains on external data, causing "forgetting" of the model's native reasoning style.
Prompting: Effects are temporary and vanish without specific instructions.
Static Compression: Most methods treat all problems equally, aggressively compressing easy problems that don't need it and potentially harming hard ones.

2. Methodology: OPSDC

The authors propose OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches a model to be concise by distilling its own behavior back into itself, without requiring ground-truth answers or reward models.

Core Concept

The approach relies on a single, trivial idea: Condition the same model on a "be concise" instruction to generate a "Teacher" distribution, and train the "Student" (the base model) to match this distribution.

Teacher ( $\pi_{\bar{\theta}}$ ): The model conditioned on a conciseness instruction $c$ (e.g., "Solve concisely, avoid unnecessary steps").
Student ( $\pi_{\theta}$ ): The same model without the instruction (standard prompting).
Training Objective: Minimize the Reverse KL Divergence ( $D_{KL}(\pi_{\theta} || \pi_{\bar{\theta}})$ ) on the student's own generated rollouts.
$\mathcal{L}(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_{\theta}(\cdot|x)} \left[ \sum_{t=1}^{|y|} D_{KL}(\pi_{\theta}(\cdot | x, y_{<t}) \parallel \pi_{\bar{\theta}}(\cdot | x, c, y_{<t})) \right]$

Key Technical Innovations

On-Policy Training: The student generates its own data ( $y \sim \pi_{\theta}$ ), preventing the distribution shift inherent in off-policy SFT.
Reverse KL vs. Forward KL:
- Reverse KL (used here) weights updates by the student's distribution. This provides natural self-regularization; the student only adjusts in token regions it currently generates, preventing catastrophic drift.
- Forward KL (tested as a baseline) weights updates by the teacher's distribution, leading to instability and accuracy collapse as the teacher refreshes.
Periodic Teacher Updates: Instead of a frozen teacher, the teacher weights $\bar{\theta}$ are synchronized with the student every $M$ steps. This creates a "progressive compression" effect: as the student learns to be concise, the teacher (now the student) becomes even more concise, pushing the student further.
No Ground Truth Required: The method requires only problem prompts and a natural language instruction. It does not need correct answers, reward models, or difficulty estimators.

3. Key Contributions & Theoretical Insights

Implicit Reward Shaping: Theoretically, minimizing Reverse KL is equivalent to maximizing an implicit reward function that combines task performance with a conciseness preference.
Difficulty-Adaptive Compression: The method automatically adapts to problem difficulty.
- Easy problems: The teacher produces very short traces, creating a strong KL signal for compression.
- Hard problems: Even the teacher needs extensive reasoning, resulting in a weak KL signal, thus preserving necessary deliberation.
Entropy Preservation: Unlike RL methods with length penalties that suppress high-entropy "exploratory" tokens (causing collapse), OPSDC preserves model entropy. The model learns to choose conciseness rather than being forced into it.
Error Reduction: The paper posits that compression improves accuracy by reducing "compounding errors." By removing redundant tokens, the model reduces the probability of introducing intermediate reasoning errors that derail the final answer.

4. Experimental Results

The method was evaluated on Qwen3-8B and Qwen3-14B across three benchmarks: MATH-500, AIME 2024, and AIME 2025.

Key Findings:

Simultaneous Compression and Accuracy Improvement:
- MATH-500: Achieved 57–59% token reduction while improving accuracy by 9–16 percentage points (e.g., Qwen3-14B went from 70.0% to 86.1%).
- AIME 2024: Qwen3-14B gained 10.4 points in accuracy with 41% compression.
- AIME 2025: Slight accuracy trade-off on the hardest benchmark, but still maintained high compression (~35%).
General Capability Preservation: MMLU scores remained stable, confirming that general knowledge was not forgotten.
Comparison to Baselines:
- Concise Prompting (Inference only): Improved accuracy slightly but achieved less compression than training.
- Soft Budget Targets: Quantitative instructions (e.g., "use 50% fewer tokens") led to higher compression but significantly lower accuracy compared to qualitative instructions ("be concise").
- Teacher Update Interval: An interval of $M=50$ steps provided the best stability-compression trade-off. $M=1$ caused instability; $M=\infty$ (frozen) limited compression potential.

5. Significance and Impact

Paradox Resolved: The paper demonstrates that "less thinking" can lead to "better answers" by eliminating the noise that causes reasoning errors.
Efficiency: OPSDC offers a highly efficient training pipeline requiring no reward models, value functions, or complex RL infrastructure. It uses standard supervised training infrastructure.
Domain Agnostic: Since it does not require ground-truth answers, it can be applied to domains where verification is difficult or impossible (e.g., creative writing, complex code generation, scientific hypothesis generation).
Scalability: The method shows that larger models (14B) benefit more from self-distillation than smaller ones (8B), suggesting it will scale effectively with future foundation models.

In summary, OPSDC provides a robust, theoretically grounded, and empirically superior framework for compressing reasoning traces, turning verbosity from a safety mechanism into a source of error, and enabling models to reason more efficiently without sacrificing (and often improving) accuracy.

On-Policy Self-Distillation for Reasoning Compression

The Big Idea: "Less Thinking, Better Answers"

The Problem: The "Overthinker's Diet"

The Solution: OPSDC (The "Mirror" Method)

Why Does This Make It Smarter?

The Results: Less is More

The Secret Sauce: It Adapts Automatically

Summary

1. Problem Statement

2. Methodology: OPSDC

Core Concept

Key Technical Innovations

3. Key Contributions & Theoretical Insights

4. Experimental Results

5. Significance and Impact

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models