Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

The Big Picture: The "Student Who Outsmarts the Master"

Imagine you are a student trying to learn how to solve complex math problems or write code. You have a Teacher (a very smart AI) and a Student (a smaller, less smart AI).

Usually, when we teach the Student, we use one of two methods:

Off-Policy (The Textbook Method): The Teacher solves problems, writes down the answers, and the Student just memorizes them. The Student never tries to solve a problem on their own; they just copy the Teacher's homework.
On-Policy (The Tutoring Method): The Student tries to solve the problems themselves. When they get stuck or make a mistake, the Teacher looks at the Student's own work and says, "Actually, for this specific step, you should have done X instead of Y." The Student learns from their own mistakes in real-time.

The Problem: The "Tutoring Method" (On-Policy Distillation) is great, but it has a ceiling. The Student usually ends up being just as good as the Teacher, but rarely better. They are stuck mimicking the Teacher's limits.

The Solution: This paper introduces a new technique called ExOPD (Extrapolated On-Policy Distillation). It's like giving the Student a "super-charger" that allows them to not just copy the Teacher, but to surpass them.

The Secret Sauce: "Reward Extrapolation"

To understand how this works, let's look at the two main ingredients the authors added to the recipe:

1. The "Volume Knob" (Reward Scaling Factor)

In standard tutoring, the Teacher's advice and the Student's own confidence are balanced perfectly (50/50).

The Paper's Idea: What if we turn up the volume on the Teacher's advice?
The Analogy: Imagine a coach telling an athlete, "You ran that lap in 10 seconds. That's good."
- Standard Method: The athlete thinks, "Okay, I'll try to run 10 seconds."
- ExOPD (Extrapolation): The coach says, "You ran 10 seconds, but imagine if you could run even faster based on how much better you could be!" The coach exaggerates the reward for doing well.
- The Result: By "extrapolating" (stretching) the reward signal, the Student is pushed to try harder and discover solutions the Teacher didn't even think of. It's like telling a student, "You got an A, but imagine if you could get an A+ by thinking outside the box."

2. The "Reference Point" (Choosing the Right Baseline)

When the Teacher gives feedback, they compare the Student's answer to a "Reference Model" (a baseline of what is expected).

The Problem: If the Teacher is a giant 30-billion-parameter brain and the Student is a tiny 1.7-billion-parameter brain, comparing them directly is unfair. It's like comparing a professional chef to a toddler. The toddler's "mistakes" look huge because the gap is so big.
The Fix: The paper suggests using the Teacher's pre-training self (the Teacher before they learned the specific skills) as the reference point instead of the Student's base.
The Analogy: Instead of comparing the Toddler to the Pro Chef, we compare the Toddler to the Pro Chef's younger self (before they went to culinary school). This makes the feedback more accurate and less noisy, helping the Student learn faster.

What Happened in the Experiments?

The researchers tested this on two tough tasks: Math Reasoning and Code Generation.

Scenario A: Merging Multiple Experts (The "All-Star Team")

Imagine you have a Math Teacher and a Coding Teacher. Both are experts, but they only know their own subject. You want to combine them into one "Super Student" who is good at both.

Old Way: The student learns from both but ends up being average at both, or just as good as the original teachers.
ExOPD Way: By using the "Volume Knob" (extrapolation), the student learned to combine the skills so well that the new student became better than both original teachers. It's like a student who, after studying with a math genius and a coding wizard, becomes a better mathematician and a better coder than either of their mentors.

Scenario B: Big Teacher, Small Student (The "Mentorship")

This is the classic "Strong-to-Weak" setup.

Result: Even when the Student is much smaller than the Teacher, ExOPD helped the Student perform significantly better than standard methods.
Bonus: When they used the "Reference Point" fix (comparing to the Teacher's younger self), the Student performed even better.

Why Does This Matter?

Breaking the Ceiling: Previously, we thought a student could never be smarter than the teacher they learned from. This paper proves that with the right "volume knob," students can break that ceiling.
Efficiency: It's a more efficient way to train AI. Instead of needing massive amounts of new data or computing power, we just tweak how we interpret the feedback the AI gets.
Unified Intelligence: It solves the problem of "specialization." You can take different specialized AIs and merge them into one generalist AI that is actually better than the sum of its parts.

Summary in One Sentence

This paper teaches AI students how to stop just copying their teachers and start "over-achieving" by exaggerating the rewards for doing well, allowing them to become smarter than the experts who taught them.

1. Problem Statement

On-Policy Distillation (OPD) has emerged as a powerful post-training paradigm for Large Language Models (LLMs), where a student model learns from a teacher's logit distribution on trajectories generated by the student itself. While OPD empirically outperforms off-policy distillation (SFT) and standard Reinforcement Learning (RL), its theoretical underpinnings remain under-explored.

Key limitations in the current OPD landscape include:

Lack of Theoretical Clarity: The relationship between OPD and dense KL-constrained RL is not fully established.
Fixed Weighting: Standard OPD implicitly fixes the weight between the reward signal (teacher logits) and the KL regularization (reference model) to a 1:1 ratio, limiting flexibility.
Performance Ceiling: In multi-teacher settings (merging domain experts) or strong-to-weak distillation, standard OPD often struggles to surpass the performance of the best teacher or fully recover the teacher's capabilities without instability.
Reference Model Rigidity: The choice of the reference model in OPD is often arbitrary or fixed to the student's initial state, potentially introducing noise in strong-to-weak scenarios.

2. Methodology: Generalized On-Policy Distillation (G-OPD)

The authors propose G-OPD, a framework that theoretically unifies OPD with dense RL and introduces two key generalizations: a flexible reference model and a reward scaling factor.

Theoretical Foundation

The authors derive that standard OPD is a special case of dense KL-constrained RL where:

The reward function is the log-ratio of the teacher to a reference model: $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)}$ .
The reward term and KL regularization are weighted equally ( $\beta = 1$ ).
The reference model $\pi_{ref}$ can be any model, not just the policy's starting checkpoint.

The G-OPD Objective

To generalize OPD, the authors introduce a reward scaling factor ( $\lambda$ ) and a flexible reference model ( $\pi_{ref}$ ). The objective function is formulated as:

$J_{G-OPD}(\theta) = \max_{\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ \lambda \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} - D_{KL}(\pi_\theta(y|x) \parallel \pi_{ref}(y|x)) \right]$

Where:

$\pi^*$ : The teacher policy.
$\pi_\theta$ : The student policy being optimized.
$\pi_{ref}$ : A flexible reference model.
$\lambda$ : A hyperparameter controlling the relative weight of the reward term against the KL regularization.

Key Variants & Insights

Reward Extrapolation (ExOPD, $\lambda > 1$ ):
- When $\lambda > 1$ , the student is encouraged to fit an "extra shift term" beyond the teacher's log-probabilities.
- Hypothesis: This allows the student to learn beyond the teacher's capability boundary, potentially surpassing the teacher in specific domains.
- Application: Particularly effective in Multi-Teacher Distillation, where a unified student is trained to merge knowledge from multiple domain-specific RL experts.
Reward Correction (Strong-to-Weak Distillation):
- In distilling a large teacher to a small student, the default reference model is the student's base model ( $\pi_{student\_base}$ ).
- Insight: The implicit reward $\log \frac{\pi^*}{\pi_{student\_base}}$ can be noisy due to the fundamental capacity gap between teacher and student.
- Correction: If the teacher's pre-RL base model ( $\pi_{teacher\_base}$ ) is available, using it as the reference model yields a cleaner, more accurate reward signal ( $\log \frac{\pi^*}{\pi_{teacher\_base}}$ ), further boosting performance.

3. Key Contributions

Theoretical Unification: Established that OPD is a special case of dense KL-constrained RL with fixed weights, providing a rigorous theoretical basis for extending it.
G-OPD Framework: Proposed a generalized objective with a tunable reward scaling factor ( $\lambda$ ) and flexible reference model selection.
ExOPD Discovery: Demonstrated that setting $\lambda > 1$ (Reward Extrapolation) consistently outperforms standard OPD and can enable a student to surpass all domain teachers in multi-teacher settings.
Reward Correction Strategy: Identified that using the teacher's pre-RL variant as the reference model in strong-to-weak distillation reduces noise and improves distillation efficacy.

4. Experimental Results

Experiments were conducted on Math Reasoning (AIME, HMMT) and Code Generation (HumanEval, MBPP, LiveCodeBench) tasks.

A. Same-Size Distillation (Single & Multi-Teacher)

Setup: Student and Teachers are the same size (Qwen3-4B). Teachers are domain-specific RL variants (Math/Code).
Findings:
- $\lambda < 1$ (Interpolation): Produces behavior between the base model and the teacher.
- $\lambda = 1$ (Standard OPD): Matches teacher performance closely.
- $\lambda > 1$ (ExOPD): With $\lambda = 1.25$ , ExOPD consistently outperformed both standard OPD and the domain teachers.
- Multi-Teacher: In merging Math and Code experts, ExOPD was the only method to produce a unified student that surpassed both individual domain teachers on all benchmarks.
- Stability: Excessive extrapolation ( $\lambda = 1.5$ ) led to instability, suggesting an optimal range exists.

B. Strong-to-Weak Distillation

Setup: Distilling a large teacher (Qwen3-30B) into smaller students (Qwen3-1.7B and Qwen3-4B).
Findings:
- ExOPD vs. OPD: ExOPD significantly outperformed standard OPD and SFT (e.g., +2.3% to +2.7% average accuracy improvement).
- Reward Correction: When the teacher's pre-RL base model was used as the reference (simulated in the paper), performance improved further (e.g., reaching ~52.3% avg accuracy vs. 51.3% for standard ExOPD), validating the "reward correction" hypothesis.

5. Significance and Implications

Breaking the Teacher Ceiling: The paper challenges the conventional wisdom that a student cannot outperform its teacher in distillation. By extrapolating the reward signal, ExOPD allows students to generalize beyond the teacher's specific training distribution.
Efficient Multi-Task Learning: ExOPD offers a robust solution for merging multiple specialized RL models into a single generalist model without the catastrophic forgetting or sub-optimality seen in SFT or standard OPD.
Practical Guidance: The work provides actionable guidelines for practitioners:
- Use $\lambda > 1$ (e.g., 1.25) to push performance boundaries.
- In strong-to-weak distillation, if the teacher's pre-RL checkpoint is available, use it as the reference model to minimize noise.
Theoretical Bridge: It bridges the gap between Distillation and RL, suggesting that many RL techniques (like reward scaling) can be directly applied to distillation with theoretical guarantees.

In summary, this work transforms On-Policy Distillation from a simple imitation task into a flexible, generalized optimization framework capable of generating models that exceed the capabilities of their teachers.