Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Imagine you have a brilliant, world-class chef (the Teacher) who can cook a complex, 10-course gourmet meal. However, you want to teach a young, energetic apprentice (the Student) to cook the same dishes, but the apprentice has a tiny kitchen, limited ingredients, and a short attention span.

If you just hand the apprentice the chef's massive, detailed recipe book and say, "Copy this exactly," the apprentice will get overwhelmed. They might burn the kitchen down, forget steps, or just start repeating the same sentence over and over because they can't hold all that information in their head.

This is the problem the paper BRIDGE solves. It's a new way to teach small AI models how to think clearly and briefly, without losing the logic.

Here is how the paper's "three-stage curriculum" works, using our kitchen analogy:

Stage 1: The "Jumbled Puzzle" Warm-up

The Problem: If you just ask the apprentice to memorize the recipe word-for-word, they will just parrot it without understanding why you chop the onions before frying the garlic. They are copying, not learning.

The Solution: The paper suggests taking the chef's perfect recipe, shuffling the steps (putting the dessert before the soup!) and hiding some steps (covering the "add salt" instruction with a blank).

The Analogy: Imagine giving the apprentice a jigsaw puzzle where the pieces are mixed up and some are missing. They have to figure out the logical order (you can't bake the cake before mixing the batter) and fill in the missing pieces based on context.
The Result: The apprentice stops trying to memorize the text and starts understanding the structure of cooking. They learn the "skeleton" of the logic.

Stage 2: The "Speed Run" Challenge

The Problem: Now the apprentice understands the logic, but they still talk too much. They might explain every single chop of the knife in excruciating detail. We want them to be concise.

The Solution: The paper introduces a game called GRPO (Group Relative Policy Optimization). Think of this as a cooking competition.

The Analogy: The apprentice is asked to cook the dish again. This time, they generate five different versions of the recipe.
- Version A is correct but 10 pages long.
- Version B is 2 pages long but burns the food.
- Version C is 1 page long and tastes perfect.
The "Judge" (the AI reward system) says: "If the food is burnt, you get zero points, no matter how short the recipe is. But if the food is perfect, the shorter the recipe, the more points you get."
The Result: The apprentice learns to find the "sweet spot." They realize they don't need to explain how to hold the knife; they just need to say "Chop onions." They learn to be efficient without being wrong.

Stage 3: The "Mentor Rewrite" for Hard Cases

The Problem: Even with the speed run, there are some super-hard dishes (like a soufflé) where the apprentice still fails. They get stuck.

The Solution: For these specific hard cases, the Chef steps in again, but differently. The Chef shows the apprentice the full, long, detailed recipe for the soufflé.

The Analogy: The Chef says, "Here is my 10-page recipe. Your job is not to copy it. Your job is to rewrite it into a 1-page cheat sheet that you can actually remember."
The apprentice has to look at the long explanation, understand the core logic, and distill it down into their own simple words.
The Result: The apprentice learns to internalize the complex logic. They don't just memorize the Chef's words; they absorb the idea of the recipe and can reproduce it in their own, shorter style.

The Grand Finale: What Happened?

The researchers tested this on a math problem dataset (GSM8K).

Before: A small 3-billion-parameter AI model (the apprentice) got about 65% of the math problems right, but it wrote very long, rambling answers.
After BRIDGE: The same model got 76% of the problems right (a huge jump!) and its answers were 27% shorter.

Why is this a big deal?
Usually, when you make an AI shorter, it gets dumber. When you make it smarter, it gets longer. This paper found a way to make the AI both smarter and shorter by teaching it to understand the structure of the problem first, then practice being concise, and finally, learn how to summarize complex ideas on its own.

In a nutshell: Instead of forcing a small brain to memorize a giant encyclopedia, this method teaches it how to read the table of contents, understand the chapters, and then write its own perfect summary.

1. Problem Statement

The paper addresses a fundamental bottleneck in Chain-of-Thought (CoT) distillation: the capacity mismatch between large teacher models and compact student models.

The Challenge: Large teachers (e.g., 14B+ parameters) generate verbose, lengthy reasoning chains to ensure correctness. Small student models (e.g., 3B parameters) lack the representational bandwidth to faithfully reproduce these long sequences via standard Supervised Fine-Tuning (SFT).
Failure Modes: Direct SFT on verbose CoT leads to truncated outputs, repetition loops, or superficial mimicry without genuine understanding.
Limitations of Existing Solutions:
- Implicit Reasoning: Compresses reasoning into hidden states, sacrificing interpretability and verifiability.
- Heuristic Compression: Randomly truncates or prunes reasoning, which destroys logical coherence and degrades performance.
- Standard RL: Often leads to "reward hacking" (producing short but incorrect answers) if not carefully constrained.

Goal: Enable small models to maintain explicit, verifiable reasoning while significantly compressing output length without sacrificing accuracy.

2. Methodology: The BRIDGE Framework

The authors propose BRIDGE, a three-stage curriculum learning framework designed to progressively build the student's reasoning capabilities. The core philosophy is "Internalization before Compression."

Stage 1: Structure-Aware Warmup (Reconstruction)

Objective: Establish a "logical skeleton" and structural understanding before attempting generation.
Mechanism: Instead of standard next-token prediction, the student is trained on a Structure-Aware Reconstruction task.
- Step Shuffling: The order of reasoning steps in the teacher's CoT is permuted.
- Step Masking: Approximately 15% of steps are masked (replaced with <MASK>).
Task: The student must reconstruct the complete, correctly ordered reasoning chain from the shuffled and masked input.
Significance: This forces the model to learn causal dependencies and global semantic structure rather than relying on local token patterns or positional shortcuts. It serves as a critical warmup to overcome the capacity mismatch.

Stage 2: GRPO-Based Compression (Optimization)

Objective: Optimize the trade-off between accuracy and brevity.
Mechanism: Uses Group Relative Policy Optimization (GRPO) on masked completion tasks (without shuffling).
Reward Design (Hierarchical): To prevent reward hacking (where the model produces short, wrong answers), a hierarchical reward function is used:
1. Correctness First: Incorrect answers receive a fixed penalty.
2. Efficiency Second: Only if the answer is correct does the model receive a bonus for brevity.
- Formula: $R(r_i) = R_{base}(r_i) + \mathbb{I}[\text{Correct}(r_i)] \cdot R_{eff}(r_i)$
Significance: GRPO eliminates the need for a separate critic model (reducing memory overhead) and the hierarchical reward ensures the model learns to be concise only when it is correct.

Stage 3: Teacher-Guided Internalization (Refinement)

Objective: Handle "failure cases" where the student cannot solve difficult problems even after Stage 2.
Mechanism:
- Identification: Samples where the Stage 2 model fails are collected into a hard set ( $D_{hard}$ ).
- Scaffolding: For these hard samples, the teacher's full verbose CoT is provided as input.
- Rewriting Task: The student is prompted to rewrite the teacher's solution concisely in its own words, rather than copying it.
Reward Design: Similar to Stage 2, but the efficiency reward is relative to the teacher's length ( $|r_T|$ ). Outputs longer than the teacher are penalized.
Significance: This exploits an asymmetry: while small models struggle to generate verbose CoT from scratch, they have sufficient capacity to comprehend and compress existing logic when guided. This allows the student to internalize complex reasoning patterns.

3. Key Contributions

Identification of Capacity Mismatch: The paper formally identifies that direct SFT on verbose CoT is detrimental to small models and that existing compression methods often sacrifice logical integrity.
BRIDGE Framework: A novel three-stage curriculum that moves from structural reconstruction to self-discovered compression, and finally to teacher-guided internalization.
Hierarchical Reward Mechanism: A specific reward design for GRPO that strictly prioritizes correctness over brevity, effectively preventing reward hacking.
State-of-the-Art Performance: Demonstrates that a 3B model can outperform instruction-tuned variants and prior distillation methods while significantly reducing token count.

4. Experimental Results

The framework was evaluated on mathematical reasoning benchmarks (GSM8K, SVAMP, MATH-500) using Qwen2.5-3B-Base and Llama-3.2-3B-Base as students, with DeepSeek-R1-Distill-Qwen-14B as the teacher.

Key Findings on GSM8K (Qwen2.5-3B):

Accuracy: BRIDGE achieved 76.19% accuracy, an 11.29% improvement over the Base model (64.90%) and outperforming standard CoT distillation (71.50%).
Efficiency: The average output length was reduced to 167 tokens, a 27.4% reduction compared to the Base model (230 tokens) and a massive reduction compared to standard distillation (374 tokens).
Comparison:
- Std-CoT KD: High accuracy (71.50%) but very long outputs (374 tokens).
- Short-CoT: Short outputs (165 tokens) but poor accuracy (39.42%).
- SuperRL: Good accuracy (75.36%) but failed to compress effectively (209 tokens).
- BRIDGE: Best of both worlds (76.19% / 167 tokens).

Generalization:

The model showed strong zero-shot transfer to SVAMP (83.33%) and MATH-500 (38.20%), indicating that the internalized reasoning patterns are generalizable problem-solving strategies, not just dataset memorization.

Ablation Studies:

Removing Shuffling or Masking in Stage 1 significantly degraded performance, confirming the necessity of structural learning.
Stage 3 was crucial for recovering accuracy lost during the compression pressure of Stage 2, particularly for hard samples.

5. Significance and Impact

Efficiency in Edge Deployment: BRIDGE provides a viable path for deploying high-quality reasoning capabilities on resource-constrained devices (e.g., mobile, edge servers) by compressing 14B+ reasoning capabilities into 3B models without losing verifiability.
Paradigm Shift in Distillation: It moves away from "copying" teacher outputs toward "internalizing" reasoning structures. The insight that compression is easier than generation for small models when guided is a critical theoretical contribution.
Robust RL Training: The hierarchical reward design offers a blueprint for applying RL (specifically GRPO) to reasoning tasks where correctness must be strictly enforced before optimizing for other metrics like length.
Interpretability: Unlike implicit reasoning methods, BRIDGE preserves explicit, human-readable reasoning chains, which is essential for debugging and auditing AI decisions.

In summary, BRIDGE successfully bridges the gap between the reasoning power of large models and the efficiency requirements of small models through a structured, curriculum-based approach that prioritizes logical understanding before enforcing brevity.

Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

Stage 1: The "Jumbled Puzzle" Warm-up

Stage 2: The "Speed Run" Challenge

Stage 3: The "Mentor Rewrite" for Hard Cases

The Grand Finale: What Happened?

1. Problem Statement

2. Methodology: The BRIDGE Framework

Stage 1: Structure-Aware Warmup (Reconstruction)

Stage 2: GRPO-Based Compression (Optimization)

Stage 3: Teacher-Guided Internalization (Refinement)

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation