IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

This paper proposes Imaginary Planning Distillation (IPD), a novel offline reinforcement learning framework that enhances decision transformer policies by integrating a world model and Model Predictive Control to augment suboptimal trajectories with imagined optimal rollouts, thereby outperforming state-of-the-art methods on the D4RL benchmark.

Yihao Qin, Yuanfei Wang, Hang Zhou, Peiran Liu, Hao Dong, Yiding Ji

Published 2026-03-05
📖 5 min read🧠 Deep dive

The Big Picture: The "Chef Who Only Reads Cookbooks"

Imagine you want to teach a robot chef how to cook a perfect 5-star meal. However, you only have a library of old, slightly burnt cookbooks (this is your Offline Dataset). You can't let the robot go into the kitchen and try new things because it might burn the house down (this is the danger of Online Reinforcement Learning).

Most current AI methods are like chefs who just memorize the recipes in the books. If the book says "add salt," they add salt. If the book has a bad recipe that tastes terrible, they follow it anyway because they don't know any better. They struggle to fix mistakes or combine good parts of different recipes to make something new.

IPD (Imaginary Planning Distillation) is a new method that gives this robot chef a "mental kitchen." It allows the chef to imagine cooking perfect meals inside their head, learn from those imaginary successes, and then apply that wisdom to the real world without ever risking a fire.


How IPD Works: The Three-Step Magic

The paper proposes a three-step process to upgrade the robot chef's brain.

1. Building the "Mental Kitchen" (The World Model)

First, the AI studies the old, imperfect cookbooks to build a World Model. Think of this as a high-tech simulator inside the robot's head.

  • What it does: It learns how the kitchen works. If you throw an egg, where does it land? If you turn the stove to high, how fast does it burn?
  • The Safety Check: Crucially, this simulator knows when it is unsure. If the robot tries to imagine a scenario that is very different from the old books (like cooking a dragon steak), the simulator says, "Whoa, I don't know enough about this. Let's not guess." This prevents the robot from hallucinating dangerous or impossible scenarios.

2. The "Dream Rehearsal" (Imaginary Planning)

This is the core innovation. Instead of just reading the bad recipes, the robot uses its Mental Kitchen to run Model Predictive Control (MPC).

  • The Analogy: Imagine the robot finds a recipe in the book that says, "Burn the toast." Instead of following it, the robot pauses and says, "Wait, let me imagine what happens if I do this differently."
  • The Process: The robot simulates thousands of different ways to cook that specific dish in its head. It tries adding less salt, turning the heat down, or flipping the pancake earlier. It picks the best imaginary outcome.
  • The Result: It takes these "perfect imaginary meals" and writes them down as new, high-quality recipes. It essentially replaces the bad parts of the old books with perfect, imagined versions.

3. The "Smart Tutor" (Value-Guided Distillation)

Now, the robot needs to learn these new, improved recipes. But how does it know which "Return-to-Go" (the target score) to aim for?

  • The Old Way: Usually, humans have to manually tell the robot, "Aim for a score of 90!" But if the human guesses wrong, the robot gets confused.
  • The IPD Way: The robot uses a Quasi-Optimal Value Function. Think of this as an internal "gut feeling" or a compass. It automatically calculates, "Based on where I am right now, the best possible score I can get is 95."
  • Distillation: The robot trains its main brain (a Transformer, which is like a super-smart pattern recognizer) to mimic these perfect imaginary moves. It learns not just what to do, but why it leads to the best score.

Why Is This Better? (The Analogy of the "Stitch")

The paper mentions that old methods struggle to "stitch" suboptimal trajectories.

  • The Problem: Imagine a journey where you take a wrong turn, get stuck in traffic, but then eventually find a great shortcut. Old AI models see the whole trip as "bad" because of the traffic. They can't separate the bad traffic from the good shortcut.
  • The IPD Solution: IPD looks at the traffic jam, realizes it's a dead end, and uses its "Mental Kitchen" to imagine a different route that avoids the traffic entirely. It then teaches the robot to take that new route. It effectively stitches together the best parts of different journeys to create a perfect path.

The Results: A Proven Winner

The researchers tested this on the D4RL benchmark, which is like a giant gym with 10 different challenging tasks (walking robots, cooking tasks, pen-writing tasks).

  • The Outcome: IPD beat almost every other method, including those that use complex math or standard AI techniques.
  • The Scaling Law: They also found a cool pattern: the more "imaginary data" they generated, the better the robot got. It's like saying, "If you let the chef rehearse in their head 1,000 times instead of 100, they become a master chef."

Summary in One Sentence

IPD is a method that lets AI learn from imperfect past data by building a safe "mental simulator" to imagine perfect futures, then teaching the AI to follow those imaginary perfect paths instead of the flawed real ones.

It turns a robot that just memorizes mistakes into a robot that dreams of perfection and learns from it.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →