Structured Agent Distillation for Large Language Model

The paper proposes Structured Agent Distillation, a framework that compresses large language model agents into smaller, efficient models by applying span-specific losses to align reasoning and action segments, thereby outperforming standard token-level distillation while preserving decision-making fidelity.

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Published 2026-03-13
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, world-class chef (the Teacher) who can cook complex, multi-course meals. This chef doesn't just throw ingredients into a pot; they have a specific way of thinking: "First, I need to chop the onions. Then, I'll sauté them. Oh, the pan is too hot, I should lower the flame. Now, add the garlic."

Now, you want to teach a young, inexperienced cook (the Student) to make the same meal, but the young cook has a tiny kitchen and a limited budget. They can't afford the expensive, high-end equipment the master chef uses.

The Old Way: "Just Watch and Copy"

Previously, researchers tried to teach the student by saying, "Watch every single word the master says and repeat it exactly."

  • The Problem: The student ends up memorizing the words but not the logic. They might say, "Add garlic," but they forgot to chop the onions first because the master skipped a step in the transcript. Or, they might get confused because the master's internal thoughts ("The pan is too hot") and their physical actions ("Lower flame") are mixed together in a long, flat list of words. The student learns to mimic the surface, but fails when the situation changes slightly. They become a "parrot" rather than a "chef."

The New Way: "Structured Agent Distillation" (SAD)

This paper introduces a new teaching method called Structured Agent Distillation. Instead of treating the master chef's entire process as one long, messy stream of words, this method acts like a smart editor that cuts the script into two distinct types of scenes:

  1. The "Thinking" Scene (REASON): Where the chef plans, worries, and calculates. "I need to chop onions first."
  2. The "Doing" Scene (ACT): Where the chef actually moves their hands. "Chop onions."

The Analogy: The Director and the Actor

Think of the Teacher as a Director and the Student as an Actor.

  • Token-Level Distillation (The Old Way): The Director says, "Say exactly what I say, word for word." The Actor memorizes the script but doesn't understand why they are saying it. If the Director changes the line slightly, the Actor freezes.
  • Structured Agent Distillation (The New Way): The Director gives the Actor a script with two different colored highlighters:
    • Blue Highlight (Reasoning): "This part is your internal monologue. You need to think through the logic here. Don't just say the words; understand the why."
    • Red Highlight (Action): "This part is your physical movement. You need to execute this specific command perfectly."

The student learns that Thinking and Doing are two different skills that need different kinds of practice.

  • When the student is in the "Blue Zone," they are graded on how well they understand the logic.
  • When they are in the "Red Zone," they are graded on whether they actually picked up the right tool.

Why This Matters

In the real world, AI agents (like chatbots that can browse the web or control robots) need to do both: Think (plan a route, solve a math problem) and Act (click a button, move a robot arm).

  • Without SAD: The AI tries to learn thinking and doing at the same time, like trying to learn to drive a car while simultaneously learning to juggle. It gets confused, makes mistakes, and the "thinking" part often gets messy.
  • With SAD: The AI learns to separate the two. It gets really good at the "Thinking" part (planning) and really good at the "Doing" part (executing).

The Results

The paper tested this on three different "kitchens":

  1. ALFWorld: A virtual house where an agent has to find objects (like a robot butler).
  2. WebShop: An online store where an agent has to buy specific items.
  3. HotPotQA: A trivia game where the agent has to search the web to answer hard questions.

The Outcome:
The students trained with this new "colored highlighter" method (SAD) were:

  • Smarter: They solved more tasks correctly.
  • Faster: They didn't waste time thinking about things they didn't need to.
  • More Faithful: Their internal thought process matched the master's logic much better, not just their final answer.

The Bottom Line

This paper is like discovering that to teach a robot to be a smart assistant, you shouldn't just feed it a transcript of what a human did. You need to teach it the difference between "thinking" and "acting." By separating these two skills during training, you can create small, cheap, fast AI agents that are just as smart as the giant, expensive ones, without needing a supercomputer to run them.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →