Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

Imagine you are trying to teach a robot to clean your house based on a voice command like, "Please tidy up the living room."

In the past, trying to teach a robot this complex task was like asking a single person to be the CEO, the architect, the construction worker, and the janitor all at once. They would get overwhelmed, confused, and fail.

To solve this, researchers created Hierarchical Policies. Think of this as hiring a Manager (the High-Level planner) and a Worker (the Low-Level controller).

The Manager looks at the big picture. They break the command "Tidy up" into small steps: "Pick up the cup," "Put it in the sink," "Wipe the table."
The Worker is the one actually moving the arms. They take the instruction "Pick up the cup" and figure out exactly how to move the robot's fingers to grab it.

The Problem: The "Out-of-Touch" Manager

The paper identifies a major flaw in how these teams usually work. The Manager is often trained on a massive library of old videos (offline data). They learn what should happen in theory. The Worker is also trained on these old videos.

But here's the catch: The Manager doesn't know the Worker's current limits.

The Manager might say, "Okay, now jump over the sofa and grab the remote!"
The Worker tries, trips, and fails.

Why? Because the Manager was trained on perfect, idealized videos and doesn't realize the Worker is currently clumsy or the sofa is too high. This is called a "Coupling Mismatch." The Manager's plans are too fancy for the Worker's actual skills.

Previous attempts to fix this were like hiring a middleman to translate between them, or forcing them to share a specific language. But these methods were rigid and still relied on those old, static videos. They couldn't adapt when the robot got better or when the situation changed.

The Solution: HD-ExpIt (The "Practice Loop")

The authors propose a new framework called HD-ExpIt. Think of this as a Self-Reinforcing Practice Loop.

Instead of just watching old videos, the robot team is put in a real training gym where they can try, fail, and learn from the results.

Here is how the loop works, step-by-step:

The Guess: The Manager looks at the task and generates a plan (a sequence of sub-goals) based on what it knows so far.
The Attempt: The Worker tries to execute this plan in the real world.
The Filter (The Magic Step):
- If the Worker succeeds? Great! We save this success story.
- If the Worker fails? Trash it. We don't learn from the failure; we just know that specific plan didn't work for this Worker right now.
- Analogy: Imagine the Manager is a chef writing a recipe. The Worker is the sous-chef. If the sous-chef burns the cake, the Manager doesn't just write a new recipe based on a textbook. Instead, the Manager looks at the successful cakes the sous-chef actually baked, realizes, "Oh, I asked for a 500-degree oven, but you can only handle 400," and updates their future recipes to match the sous-chef's actual oven.
The Refinement: The robot takes all those successful attempts it just made and uses them to retrain both the Manager and the Worker.
- The Worker gets better at doing the tasks.
- The Manager learns to write plans that are actually possible for the Worker to do. It learns the Worker's "feet" before telling them to "run."

Why is this a big deal?

Most AI robots are like students who only study from a textbook and never take a practice exam. They know the theory but freeze when things get real.

HD-ExpIt is like a student who takes a practice test, sees where they messed up, studies the questions they got right, and then takes the test again. They get better every single time.

No Middlemen: It doesn't need a translator or a complex bridge between the Manager and Worker. They just talk to each other through the results of their actions.
Real-World Adaptation: It learns the robot's actual physical limits, not just what the data says they should be.
The Result: In the paper's tests (using the CALVIN benchmark, which is like a very hard obstacle course for robots), this method allowed the robot to complete long chains of tasks (like "open drawer, take block, put in box, close drawer") much more reliably than any previous method.

The Bottom Line

The paper introduces a way for robot teams to learn by doing. By letting the robot try things, keeping only the successes, and using those successes to teach the "Manager" how to give better instructions, the whole system becomes smarter, more coordinated, and much better at following human language commands in the real world. It turns a rigid, textbook-trained robot into a flexible, learning partner.

Here is a detailed technical summary of the paper "Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation" (HD-ExpIt).

1. Problem Statement

The paper addresses the challenge of language-conditioned robotic manipulation, specifically for long-horizon tasks. The core issue lies in hierarchical policies, which decompose complex tasks into a High-Level (HL) planner and a Low-Level (LL) controller.

The Bottleneck: A fundamental "HL-LL coupling mismatch" exists. The HL planner (often a diffusion model) generates subgoals (visual plans) without considering the actual physical capabilities or limitations of the LL controller.
Limitations of Existing Solutions:
- Intermediate "Glue" Models: Adding modules to bridge the gap increases inference overhead and requires learning proxy models, which can cause training instability.
- Shared Representations: Forcing HL and LL to share a latent space is difficult because they have divergent requirements (semantic grounding vs. fine-grained control).
- Offline Data Reliance: Most existing methods rely on fixed, pre-collected datasets. This limits generalization to unseen settings and prevents the policy from improving its understanding of the controller's actual reachability.

2. Methodology: HD-ExpIt

The authors propose HD-ExpIt (Hierarchical Diffusion with Expert Iteration), a framework that iteratively refines hierarchical diffusion policies using environment feedback. It avoids explicit proxy models or shared representations by using a self-reinforcing training loop inspired by the Expert Iteration algorithm.

Core Components

High-Level Planner ( $\pi_{HL}$ ): A diffusion model that takes a textual goal and initial observation to generate a sequence of visual subgoals (a full plan).
Low-Level Controller ( $\pi_{LL}$ ): A goal-conditioned policy (Diffusion Policy or Action Chunk Transformer) that executes action chunks to transition the robot from the current state to a target subgoal.

The Iterative Training Loop

The method operates in a cycle of three phases per iteration $t$ :

Supervised Policy Update:
- Both HL and LL are trained independently in a supervised manner on the current dataset $D_t$ .
- HL: Trained to denoise subgoal sequences conditioned on the goal and initial observation.
- LL: Trained to predict action chunks that transition the environment from a source observation to a target observation.
On-Policy Rollout & Data Collection (The "Expert" Search):
- Instead of using computationally expensive search algorithms (like MCTS), the framework leverages the stochasticity of the diffusion planner as a generative search mechanism.
- For a set of diverse contexts (including standard environment resets and "expert-replayed" states from previous successful trajectories), the current policy $\pi_t$ is executed $K$ times.
- Filtering: Only trajectories that successfully complete the task (based on binary environment rewards) are retained.
- Context Diversity: To ensure exploration beyond the initial dataset, the system collects rollouts from:
  - Environment-reset contexts: Standard initial states.
  - Expert-replayed contexts: States sampled from the end of successful trajectories in the dataset, allowing the agent to explore from intermediate states it has never seen before.
Dataset Aggregation:
- The newly collected successful trajectories ( $R_t$ ) are aggregated with the previous dataset to form $D_{t+1}$ .
- Two strategies are proposed:
  - HD-ExpIt (Standard): Merges $R_t$ into $D_t$ and retrains from scratch (mitigates catastrophic forgetting but scales quadratically with iterations).
  - HD-ExpIt-ft (Fine-tuning): Uses $R_t$ as the training set and fine-tunes the current policy (linear scaling, but risks forgetting).

3. Key Contributions

Self-Reinforcing Training Cycle: A novel framework that uses the diffusion planner's stochasticity to autonomously discover successful behaviors, distilling them back into the policy without needing an external oracle or proxy model.
Implicit Alignment: The method implicitly aligns the HL planner with the LL controller's capabilities. By only training the HL on subgoals that the LL successfully executed, the planner learns to generate feasible plans within the controller's "reachable region."
State-of-the-Art Performance: The approach achieves SOTA results on the challenging CALVIN benchmark among methods trained from scratch, significantly outperforming baselines that rely solely on offline data.

4. Experimental Results

The method was evaluated on two environments: Franka-3Blocks (10 tasks) and CALVIN (34 tasks, long-horizon).

Performance Gains:
- Franka-3Blocks: A single iteration increased the success rate from 70% to >94%.
- CALVIN (Long-Horizon): On the LH-MTLC benchmark (completing 5 consecutive tasks), HD-ExpIt more than doubled the success rate for completing 5 tasks compared to the initial offline-trained policy.
- SOTA Achievement: HD-ExpIt achieved the highest performance among all methods trained from scratch on CALVIN, surpassing strong baselines like MDT, TaKSIE, and HULC.
Component Synergy:
- HL Improvement: Cross-evaluation showed that HLs trained with HD-ExpIt generalize better to different LL controllers, proving they learn better planning strategies.
- LL Improvement: Even when guided by Ground Truth subgoals, the LL fine-tuned with HD-ExpIt outperformed the offline-trained LL, proving intrinsic improvement in control.
- Feasibility: The refined HL generates subgoals that are more feasible for the specific controller than even human-generated (Ground Truth) subgoals.
Ablation Studies:
- Context Diversity: Using only environment resets for data collection caused performance to degrade in long-horizon tasks, highlighting the necessity of "expert-replayed" contexts for robust exploration.
- Replanning: The iterative refinement reduced the agent's reliance on replanning, as the initial plans became more accurate and feasible.

5. Significance

Bridging the Gap: HD-ExpIt solves the critical HL-LL coupling mismatch without the complexity of auxiliary "glue" models or unstable joint training.
Beyond Offline Learning: It demonstrates that hierarchical policies can continuously improve and generalize to unseen settings by leveraging environment feedback, moving beyond the limitations of static offline datasets.
Efficiency & Stability: By using supervised learning on filtered successful trajectories rather than direct reinforcement learning gradients, the method maintains training stability while achieving the performance benefits of online refinement.
Practical Impact: The ability to generate plans that are inherently feasible for the specific robot controller makes this approach highly relevant for deploying robust, long-horizon robotic agents in real-world scenarios.

Iterative On-Policy Refinement of Hierarchical Diffusion Policies for Language-Conditioned Manipulation

The Problem: The "Out-of-Touch" Manager

The Solution: HD-ExpIt (The "Practice Loop")

Why is this a big deal?

The Bottom Line

1. Problem Statement

2. Methodology: HD-ExpIt

Core Components

The Iterative Training Loop

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers