IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

The Big Picture: The "Chef Who Only Reads Cookbooks"

Imagine you want to teach a robot chef how to cook a perfect 5-star meal. However, you only have a library of old, slightly burnt cookbooks (this is your Offline Dataset). You can't let the robot go into the kitchen and try new things because it might burn the house down (this is the danger of Online Reinforcement Learning).

Most current AI methods are like chefs who just memorize the recipes in the books. If the book says "add salt," they add salt. If the book has a bad recipe that tastes terrible, they follow it anyway because they don't know any better. They struggle to fix mistakes or combine good parts of different recipes to make something new.

IPD (Imaginary Planning Distillation) is a new method that gives this robot chef a "mental kitchen." It allows the chef to imagine cooking perfect meals inside their head, learn from those imaginary successes, and then apply that wisdom to the real world without ever risking a fire.

How IPD Works: The Three-Step Magic

The paper proposes a three-step process to upgrade the robot chef's brain.

1. Building the "Mental Kitchen" (The World Model)

First, the AI studies the old, imperfect cookbooks to build a World Model. Think of this as a high-tech simulator inside the robot's head.

What it does: It learns how the kitchen works. If you throw an egg, where does it land? If you turn the stove to high, how fast does it burn?
The Safety Check: Crucially, this simulator knows when it is unsure. If the robot tries to imagine a scenario that is very different from the old books (like cooking a dragon steak), the simulator says, "Whoa, I don't know enough about this. Let's not guess." This prevents the robot from hallucinating dangerous or impossible scenarios.

2. The "Dream Rehearsal" (Imaginary Planning)

This is the core innovation. Instead of just reading the bad recipes, the robot uses its Mental Kitchen to run Model Predictive Control (MPC).

The Analogy: Imagine the robot finds a recipe in the book that says, "Burn the toast." Instead of following it, the robot pauses and says, "Wait, let me imagine what happens if I do this differently."
The Process: The robot simulates thousands of different ways to cook that specific dish in its head. It tries adding less salt, turning the heat down, or flipping the pancake earlier. It picks the best imaginary outcome.
The Result: It takes these "perfect imaginary meals" and writes them down as new, high-quality recipes. It essentially replaces the bad parts of the old books with perfect, imagined versions.

3. The "Smart Tutor" (Value-Guided Distillation)

Now, the robot needs to learn these new, improved recipes. But how does it know which "Return-to-Go" (the target score) to aim for?

The Old Way: Usually, humans have to manually tell the robot, "Aim for a score of 90!" But if the human guesses wrong, the robot gets confused.
The IPD Way: The robot uses a Quasi-Optimal Value Function. Think of this as an internal "gut feeling" or a compass. It automatically calculates, "Based on where I am right now, the best possible score I can get is 95."
Distillation: The robot trains its main brain (a Transformer, which is like a super-smart pattern recognizer) to mimic these perfect imaginary moves. It learns not just what to do, but why it leads to the best score.

Why Is This Better? (The Analogy of the "Stitch")

The paper mentions that old methods struggle to "stitch" suboptimal trajectories.

The Problem: Imagine a journey where you take a wrong turn, get stuck in traffic, but then eventually find a great shortcut. Old AI models see the whole trip as "bad" because of the traffic. They can't separate the bad traffic from the good shortcut.
The IPD Solution: IPD looks at the traffic jam, realizes it's a dead end, and uses its "Mental Kitchen" to imagine a different route that avoids the traffic entirely. It then teaches the robot to take that new route. It effectively stitches together the best parts of different journeys to create a perfect path.

The Results: A Proven Winner

The researchers tested this on the D4RL benchmark, which is like a giant gym with 10 different challenging tasks (walking robots, cooking tasks, pen-writing tasks).

The Outcome: IPD beat almost every other method, including those that use complex math or standard AI techniques.
The Scaling Law: They also found a cool pattern: the more "imaginary data" they generated, the better the robot got. It's like saying, "If you let the chef rehearse in their head 1,000 times instead of 100, they become a master chef."

Summary in One Sentence

IPD is a method that lets AI learn from imperfect past data by building a safe "mental simulator" to imagine perfect futures, then teaching the AI to follow those imaginary perfect paths instead of the flawed real ones.

It turns a robot that just memorizes mistakes into a robot that dreams of perfection and learns from it.

1. Problem Statement

Offline Reinforcement Learning (RL) aims to learn policies from fixed, pre-collected datasets without further environmental interaction. While Decision Transformers (DT) have emerged as a powerful paradigm by framing offline RL as a supervised conditional sequence generation problem, they face two critical limitations:

Suboptimal Trajectory Stitching: DTs rely on conditional sequence imitation. They struggle to "stitch" together suboptimal segments from the dataset to form an optimal policy because they lack the dynamic programming mechanisms inherent in traditional RL.
Reliance on Static Data: Their performance is strictly bounded by the quality of the static dataset. They often fail to effectively utilize or improve upon suboptimal experiences and lack explicit planning capabilities to explore better trajectories.
Inference Instability: Conventional DTs use manually tuned "Return-to-Go" (RTG) values as conditioning inputs. These fixed targets can be arbitrary, leading to performance instability and suboptimal decision-making during inference.

2. Methodology: Imaginary Planning Distillation (IPD)

The authors propose IPD, a framework that integrates implicit dynamic programming and explicit Model Predictive Control (MPC) into the training and inference of Transformer-based sequential policies. The process consists of four distinct phases:

A. Offline Quasi-Optimal Value Function Learning

To mitigate value overestimation on out-of-distribution (OOD) state-action pairs, IPD first learns a robust foundation from the offline dataset:

Approach: It adapts Implicit Q-Learning (IQL) principles, using Huber-expectile regression instead of standard mean squared error. This provides asymmetric weighting to focus on optimal values while remaining robust to outliers.
Output: A quasi-optimal value function $V_\psi(s)$ and a Q-function $Q_\theta(s, a)$ . These are used to derive a quasi-optimal policy $\pi^{QOP}_\omega$ via advantage-weighted regression.

B. World Model with Uncertainty Measure

To enable safe "imaginary" planning, a world model is trained to predict future states and rewards:

Architecture: A probabilistic ensemble of Gaussian mixture models (PE) is used to capture both aleatoric uncertainty (environment randomness) and epistemic uncertainty (lack of data).
Uncertainty Quantification: Instead of computationally expensive KL divergence, the authors introduce a Geometric Jensen-Shannon (GJS) divergence to measure disagreement among ensemble members.
Reliability Filter: A threshold $\kappa$ is applied to filter out high-uncertainty states, ensuring that only reliable regions of the state space are used for data augmentation.

C. Data Augmentation via Imaginary Planning

This is the core innovation where the dataset is enhanced:

Suboptimal State Identification: The system compares the Real Return (from the offline dataset) against the Imaginary Return (simulated using the world model and quasi-optimal policy). States where $R_{Imagine} - R_{Real}$ is large are identified as suboptimal candidates.
MPC Rollouts: For identified suboptimal states, the system performs Model Predictive Control (MPC) using the learned world model and value function. It samples multiple trajectories and selects the one with the highest cumulative discounted return.
Uncertainty Constraint: Generated rollouts are strictly filtered through the uncertainty set $E$ to prevent compounding model errors.
Result: An Enhanced Dataset ( $D_{aug}$ ) containing high-quality, stitched trajectories that replace or augment the original suboptimal segments.

D. Imaginary Planning Distillation

The final Transformer policy is trained on the enhanced dataset with a specialized loss function:

Sequence Modeling: Standard action likelihood maximization on the augmented data.
Value-Guided Regularization: A term $\nabla_\eta Q(s, \pi_\eta(s))$ is added to the loss, using the gradient of the Q-function to guide the policy toward higher-value actions.
Dynamic Return-to-Go: Instead of fixed manual RTG, the policy is conditioned on the learned quasi-optimal value function $V_\psi(s_t)$ . This allows the model to dynamically infer the optimal return based on the current state.

3. Key Contributions

Novel Framework (IPD): The first framework to seamlessly integrate supervised sequence modeling with "imaginary" planning (MPC + Dynamic Programming) for offline RL.
Uncertainty-Aware Data Augmentation: A mechanism to identify suboptimal trajectories and replace them with reliable, MPC-generated rollouts, effectively "stitching" optimal paths from suboptimal data.
Value-Guided Distillation: Replacing manual Return-to-Go with a learned quasi-optimal value function, which stabilizes inference and eliminates the need for hyperparameter tuning of target returns.
Scaling Law Discovery: Empirical evidence showing a linear relationship between the volume of imaginary data augmentation and policy performance, suggesting that more "imagined" data leads to better policies.

4. Experimental Results

The authors evaluated IPD on the D4RL benchmark, covering Gym (robotics control), Kitchen (long-horizon tasks), and Adroit (dexterous manipulation) domains.

Performance: IPD significantly outperforms state-of-the-art baselines, including:
- Value-based methods: CQL, IQL.
- Transformer-based methods: Decision Transformer (DT), Decision Diffuser (DD), Elastic DT (EDT), QDT, QT, Reinformer.
- Example: On walker-medium-replay, IPD achieved 96.2 vs. the next best (EDT at 74.8). On pen-cloned-v1, IPD achieved 92.8 vs. QT at 90.1.
Ablation Studies:
- MPC vs. Greedy: MPC-based augmentation consistently outperformed greedy Q-learning rollouts, proving the value of planning in complex decision spaces.
- Value Guidance: Replacing manual RTG with the learned value function significantly reduced performance variance and improved stability.
- Data Scaling: Increasing the amount of augmented data led to consistent performance gains.

5. Significance

This work addresses a fundamental bottleneck in offline RL: the inability of sequence models to plan beyond the static dataset. By distilling "imaginary" planning capabilities into a Transformer, IPD allows agents to:

Transcend Dataset Limits: Generate and learn from high-quality trajectories that do not exist in the original data.
Improve Stability: Remove the reliance on fragile, manually tuned return targets.
Bridge Paradigms: Successfully unify the strengths of supervised learning (Transformers) with the strategic depth of model-based planning (MPC) and dynamic programming.

The proposed framework offers a principled path toward more robust and capable decision-making agents for real-world applications where online exploration is unsafe or impossible.