Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO

This paper proposes a three-stage curriculum learning framework that leverages structure-aware masking and Group Relative Policy Optimization (GRPO) to efficiently distill Chain-of-Thought reasoning into compact student models, achieving significant accuracy gains and output length reduction on GSM8K by progressively guiding the model from structural understanding to self-optimized brevity and targeted knowledge internalization.

Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, world-class chef (the Teacher) who can cook a complex, 10-course gourmet meal. However, you want to teach a young, energetic apprentice (the Student) to cook the same dishes, but the apprentice has a tiny kitchen, limited ingredients, and a short attention span.

If you just hand the apprentice the chef's massive, detailed recipe book and say, "Copy this exactly," the apprentice will get overwhelmed. They might burn the kitchen down, forget steps, or just start repeating the same sentence over and over because they can't hold all that information in their head.

This is the problem the paper BRIDGE solves. It's a new way to teach small AI models how to think clearly and briefly, without losing the logic.

Here is how the paper's "three-stage curriculum" works, using our kitchen analogy:

Stage 1: The "Jumbled Puzzle" Warm-up

The Problem: If you just ask the apprentice to memorize the recipe word-for-word, they will just parrot it without understanding why you chop the onions before frying the garlic. They are copying, not learning.

The Solution: The paper suggests taking the chef's perfect recipe, shuffling the steps (putting the dessert before the soup!) and hiding some steps (covering the "add salt" instruction with a blank).

  • The Analogy: Imagine giving the apprentice a jigsaw puzzle where the pieces are mixed up and some are missing. They have to figure out the logical order (you can't bake the cake before mixing the batter) and fill in the missing pieces based on context.
  • The Result: The apprentice stops trying to memorize the text and starts understanding the structure of cooking. They learn the "skeleton" of the logic.

Stage 2: The "Speed Run" Challenge

The Problem: Now the apprentice understands the logic, but they still talk too much. They might explain every single chop of the knife in excruciating detail. We want them to be concise.

The Solution: The paper introduces a game called GRPO (Group Relative Policy Optimization). Think of this as a cooking competition.

  • The Analogy: The apprentice is asked to cook the dish again. This time, they generate five different versions of the recipe.
    • Version A is correct but 10 pages long.
    • Version B is 2 pages long but burns the food.
    • Version C is 1 page long and tastes perfect.
  • The "Judge" (the AI reward system) says: "If the food is burnt, you get zero points, no matter how short the recipe is. But if the food is perfect, the shorter the recipe, the more points you get."
  • The Result: The apprentice learns to find the "sweet spot." They realize they don't need to explain how to hold the knife; they just need to say "Chop onions." They learn to be efficient without being wrong.

Stage 3: The "Mentor Rewrite" for Hard Cases

The Problem: Even with the speed run, there are some super-hard dishes (like a soufflé) where the apprentice still fails. They get stuck.

The Solution: For these specific hard cases, the Chef steps in again, but differently. The Chef shows the apprentice the full, long, detailed recipe for the soufflé.

  • The Analogy: The Chef says, "Here is my 10-page recipe. Your job is not to copy it. Your job is to rewrite it into a 1-page cheat sheet that you can actually remember."
  • The apprentice has to look at the long explanation, understand the core logic, and distill it down into their own simple words.
  • The Result: The apprentice learns to internalize the complex logic. They don't just memorize the Chef's words; they absorb the idea of the recipe and can reproduce it in their own, shorter style.

The Grand Finale: What Happened?

The researchers tested this on a math problem dataset (GSM8K).

  • Before: A small 3-billion-parameter AI model (the apprentice) got about 65% of the math problems right, but it wrote very long, rambling answers.
  • After BRIDGE: The same model got 76% of the problems right (a huge jump!) and its answers were 27% shorter.

Why is this a big deal?
Usually, when you make an AI shorter, it gets dumber. When you make it smarter, it gets longer. This paper found a way to make the AI both smarter and shorter by teaching it to understand the structure of the problem first, then practice being concise, and finally, learn how to summarize complex ideas on its own.

In a nutshell: Instead of forcing a small brain to memorize a giant encyclopedia, this method teaches it how to read the table of contents, understand the chapters, and then write its own perfect summary.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →