On-Policy Self-Distillation for Reasoning Compression

The paper introduces OPSDC, an on-policy self-distillation method that trains reasoning models to generate more concise outputs by minimizing reverse KL divergence against their own "be concise" instructions, achieving significant token reduction and accuracy improvements on benchmarks like MATH-500 and AIME 2024 without requiring ground-truth answers or token budgets.

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun

Published 2026-03-06
📖 5 min read🧠 Deep dive

The Big Idea: "Less Thinking, Better Answers"

Imagine you have a brilliant but overly chatty student. When you ask them a simple question like "What is 2+2?", they don't just say "4." Instead, they write a 500-page essay debating whether you meant binary code, checking if the numbers are prime, questioning the nature of addition, and then finally concluding, "Okay, I'm pretty sure it's 4."

This is exactly what modern AI "reasoning models" do. They think out loud, generating thousands of words before giving an answer. While this helps with hard problems, it often creates noise on easy ones. That noise isn't just annoying; it's dangerous. Every extra word the AI writes is a new chance to make a mistake, get confused, or talk itself into a wrong answer.

OPSDC is a new training method that teaches these chatty AIs to be concise, and surprisingly, this makes them smarter.


The Problem: The "Overthinker's Diet"

Current methods to fix this chatty behavior usually have a catch:

  • The "Strict Teacher" (Reinforcement Learning): You tell the AI, "If you write more than 50 words, you get a bad grade." The AI learns to shut up, but it often stops thinking at all, collapsing its ability to solve hard problems.
  • The "Copycat" (Supervised Fine-Tuning): You show the AI examples of short answers written by humans. The AI learns to mimic them but forgets its own unique way of thinking.

The Solution: OPSDC (The "Mirror" Method)

The authors of this paper came up with a clever trick that requires no human answers, no strict word limits, and no complex scoring systems. They call it Self-Distillation.

Here is how it works, using a Mirror Analogy:

  1. The Setup: Imagine the AI is standing in front of a mirror.

    • The Student (The AI): The AI tries to solve a math problem normally, chattering away as it usually does.
    • The Teacher (The Mirror): The same AI looks at the problem again, but this time, a note is taped to the mirror that says: "Be concise. Cut the fluff. Just get to the point."
  2. The Lesson: The AI compares its own long, chatty answer (Student) with the concise answer it generated while looking at the "Be Concise" note (Teacher).

    • It doesn't need a human to tell it which answer is right. It just learns to make its "normal" answer look more like its "concise" answer.
    • It's like the AI is teaching itself: "Hey, I know I can be brief when I try, so why am I so wordy when I don't have to be?"
  3. The Magic: The AI repeats this process. Every time it gets better at being concise, the "Teacher" (the mirror) updates to be even more concise. This creates a cycle where the AI gets progressively better at cutting out the noise without losing the signal.


Why Does This Make It Smarter?

You might think, "If I cut out the thinking, won't the AI get dumber?" Actually, the opposite happens.

The "Trip Hazard" Analogy:
Imagine the AI is walking down a long, dark hallway to get to the answer.

  • The Old Way: The AI takes 10,000 steps, tripping over its own feet, bumping into walls, and talking to itself the whole time. With so many steps, the chance of falling (making a mistake) is huge.
  • The OPSDC Way: The AI learns to take only the necessary 2,000 steps. It walks straight to the goal. Because it took fewer steps, it didn't trip as often. By walking less, it arrived more safely.

The paper proves that much of what AI says is "noise" that actually causes errors. By removing the noise, they remove the mistakes.

The Results: Less is More

The researchers tested this on Qwen3 (a powerful AI model) with math problems:

  • MATH-500 (Medium difficulty): The AI cut its thinking time by 57% (it wrote less than half as much) but got 16% more answers correct.
  • AIME 2024 (Hard difficulty): It cut the length by 41% and improved accuracy by 10 points.

It's like a student who used to write a 10-page essay to solve a problem, getting confused halfway through. After training, they write a 4-page essay, stay focused, and get the right answer every time.

The Secret Sauce: It Adapts Automatically

The coolest part of OPSDC is that it's smart about when to be concise.

  • Easy Problems: If the answer is obvious, the AI cuts the thinking down to almost nothing.
  • Hard Problems: If the problem is truly difficult, the AI knows it still needs to "think hard." It keeps the necessary steps and only cuts the fluff.

It doesn't need a human to tell it, "This is an easy problem, be short." It figures it out on its own because the "concise teacher" naturally struggles to be short on hard problems, so the AI learns to keep the depth where it's needed.

Summary

OPSDC is a method where an AI teaches itself to stop overthinking.

  • Before: The AI was a nervous chatterbox that talked itself into wrong answers.
  • After: The AI is a calm, direct expert that gets straight to the point.
  • The Result: Shorter answers, faster processing, and higher accuracy.

It turns out that for AI, just like for humans, sometimes the best way to think clearly is to stop talking so much.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →