Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition

Here is an explanation of the paper "Compose Your Policies!" using simple language and creative analogies.

The Big Idea: The "Super-Team" Strategy

Imagine you are trying to teach a robot to do a complex task, like stacking bowls or hanging a mug. Usually, you train one "brain" (a policy) to do this job. But sometimes, that single brain gets stuck, makes mistakes, or just isn't good enough.

Traditionally, to make the robot better, you would have to feed it more data and retrain it from scratch. This is like hiring a new teacher and spending months teaching them everything again. It's expensive, slow, and requires massive amounts of data.

This paper proposes a different, smarter idea: Instead of training a new brain, why not let two (or more) existing brains work together?

The authors call their method GPC (General Policy Composition). Think of it as a mixture of experts. If you have two experts—one who is great at seeing colors but bad at judging depth, and another who is great at judging depth but bad at colors—you don't fire them. You put them in a room together, let them debate, and combine their advice to make the perfect decision.

The Core Problem: The "Data Bottleneck"

Robotics is currently stuck in a "data bottleneck." To make a robot really smart, you need millions of hours of video showing humans doing tasks. Collecting this data is hard and expensive.

Old Way: "We need a better robot? Let's collect more data and train a bigger model!" (Expensive, slow).
New Way (GPC): "We have two good models already. Let's just mix their brains together to make a super-model instantly." (Free, fast, no new training needed).

How It Works: The "Blended Smoothie" Analogy

Imagine you are making a smoothie.

Model A is a strawberry smoothie. It tastes great, but it's a bit too sweet.
Model B is a blueberry smoothie. It's healthy, but a bit too tart.

If you drink just the strawberry one, you get a sugar crash. If you drink just the blueberry one, your mouth puckers.
GPC is the act of blending them together in the right ratio (say, 60% strawberry, 40% blueberry). The result is a perfectly balanced smoothie that tastes better than either ingredient alone.

In the robot's world:

The Ingredients: The "flavors" are the scores (mathematical guesses) that each robot model makes about what action to take next.
The Blender: The paper proves mathematically that if you mix these guesses together (using a "convex combination"), the errors cancel out. If Model A is wrong in one direction and Model B is wrong in another, the average points closer to the truth.
The Result: The robot takes a path that is smoother, safer, and more successful than if it had followed just one model.

The Secret Sauce: "Test-Time Search"

Here is the tricky part: You don't always want a 50/50 mix.

Sometimes, for a specific task (like "Hang a Mug"), Model A is a genius and Model B is a novice. You want to listen to Model A 90% of the time.
Other times (like "Stack Bowls"), Model B is the expert.

The paper introduces a clever trick called Test-Time Search.
Instead of guessing the perfect mix ratio, the robot tries out a few different mixes right before it starts moving.

Try 1: "Let's try 50/50." (Robot simulates the move in its head).
Try 2: "Let's try 80/20." (Simulates again).
Decision: "Okay, the 80/20 mix looks like it will succeed. Let's go with that!"

This happens in seconds, requiring no new training, just a quick "what-if" calculation.

Why This is a Big Deal (The Results)

The authors tested this on many different robots and tasks (from video games to real physical robots).

The Result: The "blended" robot consistently beat the single robots.
The Analogy: It's like having a sports team where the players cover each other's weaknesses. If the striker misses the ball, the defender is there to catch it. The team wins more games than the best individual player could alone.
Real-World Impact: They showed this works even when mixing different types of robots (some that use cameras, some that use 3D point clouds) and different types of AI architectures. It's a "plug-and-play" upgrade.

Summary: The "No-Training" Upgrade

The Problem: Making robots smarter usually requires expensive data and months of training.
The Solution: GPC takes two or more existing, pre-trained robot brains and blends their advice together in real-time.
The Magic: By mathematically averaging their "opinions" and quickly finding the best mix ratio, the robot becomes smarter, more stable, and more successful without learning a single new thing.

It's the difference between hiring a new employee to fix a problem versus holding a quick meeting with your current team to solve it together. The paper shows that the meeting (composition) often yields a better result than hiring a new person (retraining).

Here is a detailed technical summary of the paper "COMPOSE YOUR POLICIES! IMPROVING DIFFUSION-BASED OR FLOW-BASED ROBOT POLICIES VIA TEST-TIME DISTRIBUTION-LEVEL COMPOSITION" (GPC).

1. Problem Statement

Robot learning, particularly using Diffusion Policies (DPs) and Flow-based models, has achieved significant success in representing complex, multi-modal action distributions. However, these approaches face two critical bottlenecks:

Data Scarcity and Cost: Scaling performance requires massive, high-quality interaction datasets, which are expensive and difficult to acquire.
Training Overhead: Traditional methods to improve policy performance, such as Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL), require additional data collection, reward engineering, and extensive online interaction or retraining.

The paper addresses the question: Can we create a superior robot policy by combining existing, pre-trained models without any additional training?

2. Methodology: General Policy Composition (GPC)

The authors propose General Policy Composition (GPC), a training-free framework that enhances policy performance at inference time by composing the distributional scores of multiple pre-trained policies.

Core Mechanism

GPC operates on the principle that the score function (gradient of the log-probability density) of a composed distribution can be derived from a convex combination of the score functions of individual policies.

Input: Multiple pre-trained policies ( $\pi_1, \pi_2, \dots, \pi_n$ ) which may differ in architecture (e.g., Diffusion vs. Flow Matching), input modality (e.g., RGB vs. Point Cloud), or conditioning (e.g., Vision-Action vs. Vision-Language-Action).
Process:
1. Score Extraction: At each denoising step $t$ , the score estimates (or noise predictions $\epsilon_\theta$ ) are extracted from each policy.
2. Convex Combination: The scores are combined linearly with weights $w_i$ such that $\sum w_i = 1$ :
  $\hat{s}_{comp} = \sum_{i=1}^n w_i s_{\theta}(\tau_t, t, c_i)$
3. Test-Time Search: Since the optimal weights are task-dependent, GPC employs a lightweight search over the weight space (e.g., $w \in [0.0, 1.0]$ in steps of 0.1) during inference to find the configuration that maximizes success rate (SR) on a small validation set.
4. Sampling: The composed score is used to update the action trajectory via standard ODE/SDE solvers.

Flexibility

Architecture Agnostic: Works with Diffusion (DDPM, DDIM) and Flow Matching models.
Modality Agnostic: Can combine Vision-Action (VA) and Vision-Language-Action (VLA) policies.
Prediction Type Agnostic: Handles different parameterizations (noise prediction, data prediction, velocity prediction) by converting them to a common score representation before composition.

3. Theoretical Foundation

The paper provides a rigorous theoretical justification for why convex score composition improves performance:

Functional-Level Improvement (Proposition 4.1):
- The authors prove that for two score estimators with different biases and noise characteristics, there exists a convex weight $w^*$ such that the Mean Squared Error (MSE) of the combined estimator is strictly lower than that of either individual estimator (unless their errors are perfectly aligned).
- Essentially, combining models allows for the cancellation of idiosyncratic errors.
System-Level Stability (Proposition 4.2):
- Using a Grönwall-type bound, the authors demonstrate that the error in the final generated trajectory is bounded by the integrated error of the score function over time.
- Corollary 4.1: Since convex composition reduces the score error at each step, this improvement propagates through the entire generation trajectory, leading to a strictly tighter bound on the final sampling error.

4. Key Contributions

Theoretical Proof: Established that convex combination of distributional scores yields a provably superior functional objective and that this advantage propagates to the system level via stable sampling dynamics.
GPC Framework: Introduced a versatile, training-free method to compose heterogeneous policies (VA/VLA, Diffusion/Flow) via test-time score combination and weight search.
Empirical Validation: Demonstrated consistent performance gains across diverse benchmarks (Robomimic, PushT, RoboTwin) and real-world robotic tasks, showing that composed policies often outperform the best individual parent policy.
Analysis of Composition: Provided insights into how different modalities (e.g., RGB + Point Cloud) and architectures complement each other, and how weight tuning is critical for maximizing gains.

5. Experimental Results

The authors evaluated GPC on simulation and real-world benchmarks:

Robomimic & PushT:
- GPC consistently improved success rates over single-policy baselines.
- Example: Combining a VLA model (Florence-Flow) with a VA model (Flow Policy) yielded a +7.55% average improvement.
- Combining two Diffusion policies (DP + Mamba Policy) improved performance by +2.22%.
RoboTwin (Bimanual Manipulation):
- GPC achieved up to +7% improvement in success rate.
- Notably, combining a VLA (RDT) with a Point Cloud policy (DP3) improved the average SR from 0.65 to 0.72.
Real-World Experiments:
- Tasks included "Place Bottles," "Hang Mug," "Clean Table," and "Punch Holes."
- GPC consistently outperformed individual policies (e.g., 14/20 successes on Clean Table vs. 12/20 for the best base policy).
Efficiency:
- GPC requires no retraining. The cost is limited to a test-time weight search (approx. 1–2.5 hours for a full sweep, reducible to ~1 hour with optimized heuristics) and a modest inference latency increase (from 0.09s to 0.13s per chunk).

6. Significance and Impact

Paradigm Shift: Moves away from the "train a bigger model" or "collect more data" paradigm toward model composition. It leverages the "wisdom of crowds" among pre-trained models.
Plug-and-Play: Enables the integration of heterogeneous models (e.g., a language-conditioned model and a geometry-conditioned model) without modifying their weights or architecture.
Cost-Effective: Offers a low-cost alternative to fine-tuning or RL, making high-performance robot control more accessible.
Generalizability: The framework is not limited to diffusion models; it extends to flow-matching and potentially other generative policy classes, suggesting a broad applicability in embodied AI.

In conclusion, GPC demonstrates that by mathematically composing the probabilistic distributions of existing policies, one can construct a "super-policy" that is more robust, accurate, and adaptable than its individual components, all without the need for additional training data or model retraining.