Grounding Generated Videos in Feasible Plans via World Models

The Big Problem: The "Dreaming" Robot

Imagine you have a robot that needs to push a box across a room to a specific spot. To help the robot, you ask a super-smart AI video generator (like a high-tech version of Sora or Runway) to "dream up" a video of the robot doing the task.

The AI generates a beautiful video. It looks perfect! But there's a catch: The AI is a dreamer, not a physicist.

It might make the box suddenly teleport from one side of the room to the other.
It might make the robot's arm pass through a wall.
It might make the box change shape or color in the middle of the video.

If you try to tell the real robot to copy these video moves exactly, the robot will crash, break, or get stuck because the video violates the laws of physics.

The Solution: GVP-WM (The "Reality Check" System)

The authors propose a new method called GVP-WM. Think of this system as a strict editor or a reality-check coach that sits between the "Dreaming AI" and the "Real Robot."

Here is how it works, step-by-step:

1. The Dreamer (Video Generator)

First, the system asks the video AI to generate a plan.

Analogy: Imagine an architect drawing a sketch of a house. The sketch looks amazing, but the architect might have drawn a staircase that leads to the ceiling or a door that opens into thin air. The sketch is a vision, not a blueprint that can be built yet.

2. The Reality Engine (World Model)

The system has a second component called a World Model. This is a trained simulator that knows exactly how the real world works (gravity, friction, how objects collide).

Analogy: This is like a structural engineer who knows that if you put a heavy beam on a weak pillar, the house will collapse. The engineer knows the "rules of physics."

3. The Translation (Latent Collocation)

This is the magic part. GVP-WM doesn't just copy the video. Instead, it takes the "Dream Sketch" and tries to rewrite it so it fits the "Engineer's Rules."

The Process:
1. The system looks at the video plan.
2. It asks the World Model: "If the robot tries to do this move, what actually happens?"
3. If the video says "Teleport the box," the World Model says, "No, that's impossible. The box can only slide."
4. The system then optimizes the plan. It keeps the goal of the video (move the box to the red spot) but changes the path to something physically possible.
Analogy: Imagine you are trying to follow a GPS route that says "Drive through the mountain." Your car (the robot) can't do that. The GVP-WM system is like a smart co-pilot who looks at the GPS, says, "Okay, we want to get to the other side, but we can't drive through the mountain. Let's find the tunnel instead." It keeps the destination but changes the route to be drivable.

Why is this special?

Most previous methods tried to do one of two things:

Copy the video directly: This fails because the video is often impossible (like trying to drive through a mountain).
Ignore the video and plan from scratch: This works, but it's slow and the robot might forget what the original goal looked like.

GVP-WM does both: It uses the video as a guide (to know what to do) but uses the World Model as a filter (to ensure how it's done is possible).

The Results

The researchers tested this on robots pushing objects and navigating mazes.

The "Dream" Videos: Often had glitches like motion blur or objects disappearing.
The Result: Even when the video was messy or physically impossible, GVP-WM successfully figured out a real, executable path for the robot. It was much better than systems that just tried to copy the video directly.

Summary in One Sentence

GVP-WM is a system that takes a "dreamy," often impossible video plan generated by AI and edits it in real-time to ensure it follows the laws of physics, turning a fantasy into a workable instruction for a real robot.

1. Problem Statement

Large-scale video generative models (e.g., diffusion-based image-to-video models) have emerged as powerful zero-shot visual planners. They can synthesize temporally coherent videos that depict task completion. However, these generated videos often suffer from two critical issues when applied to robotic control:

Physical Infeasibility: Generated plans may violate real-world dynamics, such as object teleportation, morphological drift (objects changing shape/identity), or rigid-body physics violations.
Temporal Inconsistency: Videos may contain motion blur or inconsistent transitions, especially in out-of-distribution (OOD) settings.

Directly mapping these video frames to executable actions using Inverse Dynamics Models (IDM) or standard video-to-action baselines often fails because the underlying visual plan does not adhere to the physical constraints of the environment. Existing methods that use video as subgoals often assume the visual subgoals are feasible, ignoring the quality of the video plan itself.

2. Methodology: GVP-WM

The authors propose GVP-WM (Grounding Video Plans with World Models), a test-time planning method that converts potentially invalid video plans into physically feasible action sequences. The core idea is to treat the video plan as a semantic guide rather than a direct instruction, projecting it onto the manifold of dynamically feasible trajectories using a pre-trained action-conditioned world model.

Key Components:

Video Plan Generation: A conditional video generative model (e.g., Image-to-Video diffusion) generates a video plan $\tau_{vid}$ from an initial observation $o_0$ and a goal observation $o_g$ .
Latent Space Projection: The video plan is encoded into a sequence of latent states $z^{vid}_{0:T}$ using a frozen visual encoder from a pre-trained world model.
Video-Guided Latent Collocation: This is the core optimization step. Instead of treating the video frames as fixed waypoints, GVP-WM formulates the problem as a constrained trajectory optimization in the latent space.
- Decision Variables: Both the latent states ( $Z$ ) and actions ( $A$ ) are optimized jointly.
- Objective Function: The optimization minimizes a weighted sum of:
  1. Video Alignment Loss ( $L_{vid}$ ): Penalizes the angular deviation (cosine similarity) between the optimized latent states and the video-generated latent states. This ensures semantic alignment while allowing magnitude drift.
  2. Terminal Goal Loss ( $L_{goal}$ ): Ensures the final state reaches the target.
  3. Action Regularization: Encourages smooth actions.
- Constraints: The trajectory must strictly satisfy the world model dynamics ( $z_{t+1} = f_\psi(z_{t-H:t}, a_{t-H:t})$ ). This ensures physical feasibility.
Optimization Algorithm: The non-linear constrained problem is solved using the Augmented Lagrangian Method (ALM). This alternates between updating primal variables (states and actions) and dual variables (Lagrange multipliers) to enforce dynamics constraints.
Execution: The optimized action sequence is executed using Model Predictive Control (MPC) with a receding horizon, allowing the system to correct errors at each step.

3. Key Contributions

Novel Planning Framework: Introduction of GVP-WM, a test-time method that grounds zero-shot video plans into feasible actions without requiring additional policy training or environment interaction.
Latent Collocation Formulation: A new formulation of grounding as a latent-space trajectory optimization problem. Unlike previous methods that treat video frames as fixed subgoals, GVP-WM jointly optimizes latent states and actions to satisfy world-model dynamics while preserving semantic alignment with the video.
Robustness to Degradation: The method demonstrates robustness to physically invalid and temporally inconsistent video plans (e.g., motion blur, object teleportation) by prioritizing physical feasibility over strict video adherence.

4. Experimental Results

The authors evaluated GVP-WM on two long-horizon tasks: Push-T (contact-rich manipulation) and Wall (2D navigation). They compared against:

MPC-CEM / MPC-GD: World-model-based planners without video guidance.
UniPi: A direct video-to-action baseline using inverse dynamics.

Key Findings:

Superior Performance: GVP-WM significantly outperformed UniPi across all settings. While UniPi failed completely (0% success) on zero-shot video plans for manipulation tasks, GVP-WM recovered feasible plans with success rates up to 80% (Push-T, T=25) and 94% (Wall, T=25).
Robustness to Motion Blur: When synthetic motion blur was introduced to expert videos, UniPi's performance collapsed (dropping to ~2-4% success). In contrast, GVP-WM maintained high robustness (e.g., 94% success on Wall with severe blur), proving its ability to ignore inconsistent video frames in favor of physical dynamics.
Efficiency: GVP-WM was substantially faster than sampling-based planners (MPC-CEM) and comparable to gradient-based planners, with planning times roughly 30-50% lower than MPC-CEM.
Ablation Studies:
- Removing video guidance (random initialization) degraded performance on high-quality (domain-adapted) videos but improved performance on poor zero-shot videos, highlighting the trade-off between semantic guidance and noise.
- Removing latent collocation (fixing latent states to the video plan) caused performance to collapse, confirming that video-generated latent trajectories are not dynamically feasible and must be optimized.

5. Significance and Implications

Bridging Generative AI and Robotics: The paper provides a critical mechanism to leverage the semantic reasoning capabilities of large-scale video models for robotics without being hindered by their physical hallucinations.
Test-Time Adaptation: Unlike methods that require fine-tuning policies via reinforcement learning, GVP-WM operates at test-time, making it applicable to new tasks and environments where collecting interaction data is expensive or impossible.
Handling OOD Scenarios: The method is particularly valuable for out-of-distribution scenarios where video generators are likely to produce physically impossible transitions. By grounding these plans in a learned world model, GVP-WM ensures that the resulting actions are executable in the real world.

In summary, GVP-WM transforms "dreamed" video plans into "real" actions by using a world model as a physical filter, enabling robust long-horizon planning even when the visual guidance is imperfect.