Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Imagine you are teaching a robot to fold a towel. You have two main ways to teach it:

The "Daydreamer" Approach: Before the robot moves its arm, it closes its "eyes" and spends a lot of time imagining every possible way the towel could move in the next few seconds. It simulates the future in its head, then decides what to do based on that daydream.
The "Experienced Chef" Approach: The robot doesn't waste time daydreaming about the future. Instead, it learned how things move by watching thousands of videos while it was in school. Now, when it sees the towel, it just uses that deep understanding to move its arm immediately.

Fast-WAM is a new robot brain that asks a simple question: "Do we actually need the robot to daydream at the moment of action, or is the real magic just in the learning phase?"

The authors of this paper found that the "Daydreamer" approach is actually too slow. The robot spends so much time imagining the future that it moves sluggishly. But, the "Experienced Chef" approach (which they call Fast-WAM) is just as good at the job, but it's 4 times faster.

Here is the breakdown of their discovery using simple analogies:

1. The Old Way: "Imagine, Then Do"

Most current robot models work like a movie director who shoots a scene before acting it out.

The Process: The robot sees a task (e.g., "pick up the cup"). It first generates a video in its head showing the cup being picked up. Then, it watches that imaginary video and says, "Okay, based on that video, I will move my arm this way."
The Problem: Generating that imaginary video takes a lot of computing power and time. It's like trying to drive a car while you are still drawing the map of the road ahead. It's accurate, but it's slow.

2. The New Way: Fast-WAM (The "Mental Gym")

The authors realized that the "imagining" part might be unnecessary during the actual task.

The Training (The Gym): While the robot is learning (training), it does practice imagining the future. It watches videos and predicts what happens next. This is like a gymnast practicing flips in the gym to build muscle memory and understand physics. This step is crucial because it teaches the robot how the physical world works.
The Inference (The Game): When it's time to actually do the task (test time), Fast-WAM skips the daydreaming. It doesn't generate a video. Instead, it just uses the "muscle memory" and physics knowledge it built up in the gym to move immediately.
The Result: It's like a gymnast who doesn't need to visualize the routine before every jump; their body just knows what to do because of the training.

3. The Big Discovery

The researchers tested this by creating different versions of the robot:

Version A: The "Daydreamer" (Imagines future, then acts).
Version B: The "Fast-WAM" (Trains by imagining, but acts immediately).
Version C: The "No-Training" (Just acts, never practiced imagining).

The Shocking Result:

Version A and Version B were almost equally good at the tasks.
Version C (the one that never practiced imagining) failed miserably.

The Conclusion: The secret sauce isn't the robot "thinking about the future" while it's working. The secret sauce is learning how the world works during training. Once the robot understands physics and cause-and-effect, it doesn't need to waste time simulating the future in real-time.

Why This Matters

Speed: Fast-WAM is incredibly fast (190 milliseconds). It can react in real-time, making it safe and practical for real-world robots.
Efficiency: It saves a massive amount of computer power. You don't need a supercomputer to run the robot; a standard one works fine.
Simplicity: It proves that we don't need complex, slow "future vision" systems to build smart robots. We just need to teach them well, and then let them act on instinct.

In a nutshell: Fast-WAM teaches robots to be experts in physics so they don't have to be slow calculators when it's time to move. It's the difference between a chess grandmaster who sees the whole board instantly versus someone who has to calculate every single move from scratch before making a move. The grandmaster wins because of their training, not because they are calculating slower.

1. Problem Statement

World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control. Unlike standard VLAs that map observations directly to actions, WAMs explicitly model how visual observations evolve under actions. However, most existing WAMs follow an "imagine-then-execute" paradigm:

Imagination: The model iteratively generates (denoises) future video frames at test time.
Execution: The model predicts actions conditioned on these imagined future frames.

The Core Issue: This paradigm incurs substantial test-time latency due to iterative video denoising, making real-time control difficult. Furthermore, it remains unclear whether the performance gains of WAMs stem from:

(A) Explicit Future Imagination: The benefit of having foresight via generated future frames during inference.
(B) Video Co-training: The benefit of learning physically meaningful world representations during the training phase via a video prediction objective.

Current systems entangle these two factors, making it impossible to isolate which component drives performance.

2. Methodology: Fast-WAM

The authors propose Fast-WAM, an architecture designed to decouple video modeling during training from explicit future generation during inference.

Core Architecture

Backbone: Built upon the Wan2.2-5B video Diffusion Transformer (DiT).
Mixture-of-Transformer (MoT): The model consists of a shared Video DiT and a specialized Action Expert DiT.
Token Structure:
- Clean First-Frame Tokens: Serve as the shared visual anchor (current observation).
- Noisy Future Video Tokens: Used only during training for video modeling.
- Action Tokens: Processed by the Action Expert.
Attention Mechanism: A structured attention mask ensures that:
- Action tokens can attend to the current frame and language instructions.
- Action tokens cannot attend to future video tokens (preventing information leakage).
- Video and action branches share attention to language embeddings.

Training vs. Inference

Training (Joint Co-training): The model is trained with a joint flow matching objective. It simultaneously learns to predict action chunks ( $a_{1:H}$ ) and future video latents ( $z_{1:T}$ ). The loss function is $L = L_{act} + \lambda L_{vid}$ . This forces the visual backbone to learn physically grounded representations.
Inference (Direct Policy): Fast-WAM skips the future video generation step entirely.
- It takes the current observation and instruction.
- It passes them through the video backbone in a single forward pass to extract latent world representations.
- It directly predicts the action chunk using the Action Expert.
- Result: No iterative denoising of future frames; latency is reduced to a single pass.

Controlled Variants

To rigorously test their hypothesis, the authors created three variants under the same framework:

Fast-WAM (Ours): Trains with video co-training; infers without future generation.
Fast-WAM-Joint: Trains with video co-training; infers by jointly denoising video and actions (Standard "Imagine-then-Execute").
Fast-WAM-IDM: Trains with video co-training; infers by generating future video first, then predicting actions (Causal "Imagine-then-Execute").
Fast-WAM w.o. Video Co-train: Same architecture as Fast-WAM but removes the video prediction loss during training (serves as the control for the training objective).

3. Key Contributions

Conceptual Insight: The paper identifies and isolates the source of WAM performance, arguing that the primary value lies in training-time video modeling rather than test-time future imagination.
Fast-WAM Architecture: A novel design that retains the representational benefits of world modeling while enabling real-time inference by removing the computationally expensive future synthesis step.
Empirical Evidence: Through controlled ablation studies, the authors demonstrate that removing the video co-training objective causes a massive performance drop, whereas removing test-time imagination (Fast-WAM) results in negligible performance loss compared to imagination-based variants.

4. Experimental Results

Benchmarks

Simulation: Evaluated on LIBERO (4 suites) and RoboTwin 2.0 (bimanual manipulation).
Real-World: Evaluated on a towel-folding task using the Galaxea R1 Lite platform.

Performance Findings

Competitive Accuracy: Fast-WAM achieves state-of-the-art results without embodied pretraining.
- RoboTwin: 91.8% success rate (vs. 92.2% for pretrained LingBot-VA).
- LIBERO: 97.6% average success rate (outperforming OpenVLA and $\pi_0.5$ ).
The "Co-training" vs. "Imagination" Trade-off:
- Fast-WAM (No Imagination) $\approx$ Fast-WAM-Joint/IDM (With Imagination). The performance gap between these is minimal.
- Fast-WAM w.o. Video Co-train suffers a significant drop (e.g., RoboTwin drops from ~91% to 83.8%; LIBERO drops to 93.5%).
- Conclusion: The video prediction objective during training is the critical factor; explicit future generation at test time is largely unnecessary.
Real-World Efficiency:
- Latency: Fast-WAM runs at 190 ms (real-time).
- Comparison: It is 4x faster than Fast-WAM-IDM (810 ms) and significantly faster than Joint variants.
- Task Quality: On the towel-folding task, removing video co-training caused success rates to plummet to ~10% and completion times to skyrocket, confirming that the learned world representation is vital for complex manipulation.

5. Significance

Efficiency: Fast-WAM proves that high-performance embodied agents do not need to "dream" about the future at runtime. This removes the bottleneck of iterative video generation, enabling real-time control on standard hardware.
Paradigm Shift: The findings suggest that the community should focus on improving world representation learning during training rather than optimizing complex inference-time generation pipelines.
Practical Deployment: By achieving SOTA performance with 190ms latency and no need for expensive embodied pretraining, Fast-WAM offers a highly viable path for deploying general-purpose robots in dynamic, real-world environments.

In summary, the paper argues that World Action Models are valuable because they learn better representations during training, not because they imagine the future during execution. Fast-WAM leverages this insight to create a faster, equally effective alternative to existing WAMs.