World Action Models are Zero-shot Policies

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi Wang, Ryan Julian, Danfei Xu, Yilun Du, Yevgen Chebotar, Scott Reed, Jan Kautz, Yuke Zhu, Linxi "Jim" Fan, Joel Jang

Published 2026-02-19

📖 5 min read🧠 Deep dive

View on arXiv ↗PDF ↗

Imagine you are trying to teach a robot how to do chores.

The Old Way (The "VLA" Model):
Think of current robot brains (called Vision-Language-Action models) like a very smart librarian who has read every instruction manual ever written. If you say, "Move the red cup to the blue table," the librarian knows exactly what those words mean. They can find the cup and the table.

But here's the problem: The librarian has never actually moved a cup before. They only know the words for moving. If you ask them to "untie a shoelace" or "fold a shirt in a specific way," they freeze. They know what a shoe is, but they don't have a mental movie of how the laces move when you pull them. They are great at understanding language, but terrible at understanding physics and motion.

The New Way (DreamZero / The "World Action Model"):
The researchers at NVIDIA built a new kind of robot brain called DreamZero. Instead of just being a librarian, DreamZero is like a Hollywood Director who is also a stunt double.

Here is how it works, using a simple analogy:

1. The "Mental Movie" Trick

When you tell DreamZero, "Put the orange in the pumpkin," it doesn't just look for the orange and the pumpkin. It first imagines a short movie of the future.

It says to itself: "Okay, I see the orange. I see the pumpkin. Now, let me play a movie in my head of my hand grabbing the orange, lifting it, and dropping it into the pumpkin."

Once it has "filmed" this future movie in its mind, it looks at the video and says, "Okay, to make this movie happen, my arm needs to move like THIS."

This is the magic. By learning to predict what the world will look like next (the video), the robot automatically learns how to move (the action) to make that video real. It learns physics by watching the world evolve, just like we learn by watching how things fall or roll.

2. Learning from "Chaos" instead of "Repetition"

Traditional robots are like students who only learn by doing the exact same math problem 1,000 times. If you change the numbers slightly, they get confused.

DreamZero is different. It was trained on 500 hours of chaotic, real-world video. It watched robots (and humans) doing thousands of different, messy, non-repetitive tasks in kitchens, offices, and stores.

Analogy: Imagine learning to drive.
- Old Robot: Drives the exact same route on a closed track 1,000 times. If you put a cone in a new spot, it crashes.
- DreamZero: Drives through a busy city, dealing with traffic, rain, pedestrians, and weird road signs. It learns the rules of the road (physics) rather than just memorizing a route.

Because it learned from this "chaos," it can walk into a brand new room it has never seen before and still figure out how to pick up a strange object.

3. The "Magic Mirror" (Cross-Embodiment)

One of the coolest things DreamZero can do is learn from watching others, even if they look totally different.

The Scenario: You have a robot with two arms (AgiBot). You want it to learn a new trick.
The Old Way: You need a human to hold the robot's arms and physically guide it through the motion for hours.
The DreamZero Way: You just show the robot a 12-minute video of a human doing the task.
- Analogy: It's like watching a video of a gymnast doing a backflip. Even though you have legs and the gymnast has a different body, you can figure out the physics of the flip. DreamZero watches the video, understands the "movie" of the backflip, and then figures out how its own body can do it.
- The paper shows that with just 30 minutes of "play" data, DreamZero can switch to a completely different robot body and still work perfectly.

4. The Speed Problem (The "Flash" Upgrade)

There was one big catch: DreamZero is a 14-billion-parameter model that generates video. Usually, generating video is slow (like rendering a movie). Robots need to move fast (7 times a second).

The team built a "Turbo Mode" called DreamZero-Flash.

Analogy: Imagine you are painting a picture. Usually, you paint the whole scene, then the whole sky, then the whole ground. It takes forever.
The Fix: DreamZero-Flash realizes that for the robot's movement, it doesn't need a perfect, high-definition movie. It just needs a rough sketch of the future to know where to move next. It skips the "high-definition" steps for the video and focuses on the action, making it 38 times faster. Now, it can think and move in real-time.

Summary

DreamZero is a robot brain that learns by imagining the future.

Instead of memorizing instructions, it simulates movies of what will happen next.
Because it understands the "movie" of physics, it can handle new tasks and new environments without needing to be retrained.
It can learn from videos of humans or other robots, skipping the need for hours of physical training.
It is fast enough to control a real robot in real-time.

It's the difference between a robot that knows the dictionary definition of "open the door" and a robot that can visualize the door swinging open and knows exactly how hard to push the handle to make it happen.

1. Problem Statement

Current state-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization (understanding language and object identities) but struggle with physical generalization. Specifically, they fail to:

Adapt to novel physical motions or skills not present in their training data (e.g., untying a shoelace vs. picking up an object).
Generalize effectively to unseen environments without extensive, repetitive task-specific demonstrations.
Transfer knowledge across different robot embodiments (e.g., from a human to a robot) without action labels.

Existing VLAs rely on static image-text pretraining, which lacks the spatiotemporal priors (physics, dynamics, and motion continuity) required to predict how the world evolves in response to actions.

2. Methodology: DreamZero (World Action Model)

The authors introduce DreamZero, a 14B parameter World Action Model (WAM) built upon a pretrained image-to-video diffusion backbone (Wan2.1). Unlike VLAs that map observations directly to actions, DreamZero learns inverse dynamics by jointly predicting future visual states and actions.

Core Architecture & Training

Joint Prediction: The model takes language instructions, visual context (current/past frames), and proprioceptive states as input. It autoregressively generates future video frames and action chunks simultaneously.
Training Objective: It uses Flow Matching to denoise joint video and action latents.
- Teacher Forcing: The model is trained to denoise the current noisy chunk conditioned on clean previous chunks.
- Data Diversity: Instead of repetitive demonstrations, DreamZero is trained on heterogeneous, non-repetitive data (500 hours of teleoperation across 22 diverse real-world environments).
Autoregressive Design: DreamZero uses an autoregressive architecture (rather than bidirectional) to preserve native frame rates and ensure precise alignment between video frames and robot actions. It utilizes KV-caching to support long contexts efficiently.

Real-Time Execution Optimizations

Video diffusion models are typically too slow for real-time control (requiring ~5.7s per step). DreamZero achieves 7Hz closed-loop control through a three-tier optimization strategy:

System-Level:
- CFG Parallelism: Running conditional and unconditional forward passes on separate GPUs.
- DiT Caching: Reusing cached velocity vectors when consecutive predictions are similar, reducing effective diffusion steps from 16 to 4.
Implementation-Level:
- Torch Compile & CUDA Graphs: Eliminating CPU overhead and fusing operators.
- Quantization: Using NVFP4 (on Blackwell architecture) for weights/activations while keeping sensitive operations in FP8/FP16.
Model-Level (DreamZero-Flash):
- Decoupled Noise Schedules: During training, video timesteps are biased toward high-noise states (Beta distribution) while action timesteps remain uniform. This allows the model to predict clean actions from noisy visual contexts, enabling single-step inference without significant quality loss.

3. Key Contributions

New Architecture (WAM): Introduction of DreamZero, a 14B model that jointly predicts video and actions, inheriting rich physical priors from web-scale video data.
Superior Generalization: Demonstrated >2x improvement in zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs (GR00T N1.6, $\pi_0.5$ ).
Data Efficiency: Proved that generalist policies can be learned from diverse, non-repetitive data rather than requiring thousands of repetitive demonstrations per task.
Cross-Embodiment Transfer:
- Video-Only Transfer: Using only 10–20 minutes of video data (from humans or other robots) without action labels improved unseen task performance by >42%.
- Few-Shot Adaptation: Adapted a model trained on one robot (AgiBot G1) to a new robot (YAM) with only 30 minutes of play data, retaining zero-shot generalization.
Real-Time Performance: Achieved 7Hz inference speed (38x speedup over baseline) via system and model optimizations, enabling reactive closed-loop control.

4. Experimental Results

The model was evaluated on AgiBot G1 (bimanual mobile manipulator) and Franka (single-arm) robots.

Zero-Shot Generalization (Unseen Tasks):
- On tasks absent from training (e.g., untying shoelaces, ironing, painting), DreamZero achieved 39.5% average task progress on AgiBot, compared to near-zero (<1%) for from-scratch VLAs and ~16% for pretrained VLAs.
- DreamZero successfully generalized to unseen objects and environments (different geographic locations) where baselines failed.
Post-Training Performance:
- After fine-tuning on specific tasks (shirt folding, fruit packing), DreamZero retained its environment generalization capabilities, outperforming baselines on fruit packing and matching them on others.
Cross-Embodiment Transfer:
- Robot-to-Robot: Transfer from YAM to AgiBot improved task progress from 38.3% to 55.4%.
- Human-to-Robot: Transfer from human egocentric video to AgiBot improved task progress to 54.3%.
Ablation Studies:
- Data Diversity: Training on diverse data yielded 50% task progress vs. 33% on repetitive data.
- Model Scale: The 14B model significantly outperformed the 5B model (50% vs. 21%), whereas scaling VLA size alone did not improve performance on diverse data.
- Architecture: Autoregressive (AR) models produced smoother motions and were 3-4x faster than bidirectional models due to KV caching.

5. Significance and Future Impact

Paradigm Shift: The paper challenges the VLA paradigm, suggesting that World Action Models (predicting future states) are more effective for physical robotics than direct observation-to-action mapping.
Scalability: It opens a pathway to leverage the massive scale of internet video data and human egocentric videos for robot learning without needing action labels, potentially solving the data scarcity bottleneck in robotics.
Efficiency: By decoupling video and action noise schedules, DreamZero-Flash demonstrates that high-quality physical control can be achieved with minimal diffusion steps, making large foundation models viable for real-time edge deployment.
Open Source: The authors have open-sourced model weights, inference code, and benchmarks (RoboArena, PolaRiS), facilitating reproducibility and further research.

In summary, DreamZero establishes a new benchmark for robot foundation models by demonstrating that learning physical dynamics through video prediction enables robust zero-shot generalization, efficient cross-embodiment transfer, and real-time control, surpassing current VLA capabilities.

World Action Models are Zero-shot Policies

1. The "Mental Movie" Trick

2. Learning from "Chaos" instead of "Repetition"

3. The "Magic Mirror" (Cross-Embodiment)

4. The Speed Problem (The "Flash" Upgrade)

Summary

1. Problem Statement

2. Methodology: DreamZero (World Action Model)

Core Architecture & Training

Real-Time Execution Optimizations

3. Key Contributions

4. Experimental Results

5. Significance and Future Impact

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank