Inference-time Physics Alignment of Video Generative Models with Latent World Models

Imagine you have a magical video camera that can dream up any scene you describe. You tell it, "Show me a glass falling off a table and shattering," and it creates a beautiful, high-definition video. But there's a catch: in the video, the glass might float upward before falling, or it might pass through the table like a ghost. The visuals are stunning, but the physics are broken.

This is the current problem with the world's best AI video generators. They are great at making things look pretty, but they don't always understand how the real world works (like gravity, collisions, or how liquids flow).

This paper introduces a clever fix called WMReward. Instead of trying to re-teach the AI how physics works from scratch (which is hard and slow), they decided to give the AI a smart supervisor to check its work while it's dreaming.

Here is how it works, broken down with simple analogies:

1. The Problem: The Dreamer vs. The Realist

Think of the video generator (like Sora or MAGI) as a creative dreamer. It has a huge library of images and videos it has seen, and it tries to mix them together to create something new. Sometimes, it gets so creative that it forgets the rules of reality.

2. The Solution: The "Physics Coach" (Latent World Model)

The authors brought in a Physics Coach. This coach is a special AI called a "Latent World Model" (specifically, one named VJEPA-2).

How the Coach thinks: Unlike the video generator, which worries about every single pixel (like the exact color of a shirt), the Coach looks at the "big picture" of how things move. It understands that if you push a ball, it rolls; if you drop a cup, it breaks. It ignores the fancy details and focuses on the logic of motion.
The "Surprise" Score: The Coach watches the video the dreamer is making. If the video follows the laws of physics, the Coach says, "That makes sense!" (Low surprise). If the video shows a car driving on the ceiling, the Coach says, "Wait, that's impossible!" (High surprise).
The Reward: The system gives the video a "score" based on how much the Coach was surprised. Less surprise = Better physics = Higher score.

3. The Strategy: Trying Many Paths

The video generator doesn't just make one video and stop. It's like a painter trying to get the perfect brushstroke. The authors use two main tricks to find the best video:

The "Best of N" (BoN) Strategy: Imagine the AI paints 16 different versions of the same scene. Then, the Physics Coach looks at all 16 and picks the one that makes the most physical sense. It's like ordering 16 pizzas and only eating the one that tastes right.
The "Guidance" Strategy: Imagine the AI is painting, and the Coach is standing right next to it, whispering, "No, gravity pulls down, not up!" The AI adjusts its brushstrokes while it paints to follow the Coach's advice.

The paper combines these two: The AI paints 16 versions, but while it's painting them, the Coach is whispering advice to make them all better. Then, it picks the absolute best one.

4. The Results: Winning the Physics Olympics

The team tested this on a very hard test called PhysicsIQ, which is like a physics exam for AI.

Before: The best AI models got about 55% on the test. They were good, but they still made silly physics mistakes.
After: With the new "Physics Coach" system, the AI jumped to 62.6%. They didn't just beat the previous record; they crushed it, taking first place in a recent competition.

Why This Matters

This is a big deal because it doesn't require retraining the massive, expensive video models. It's like giving a student a cheat sheet (the reward model) during the exam instead of forcing them to go back to school for another year.

In summary:
The paper shows that if you pair a creative video AI with a logical "Physics Coach" that checks its work in real-time, you can stop the videos from breaking the laws of nature. The result is videos that look not just beautiful, but real.

1. Problem Statement

State-of-the-art video generative models (e.g., Sora, MAGI-1, vLDM) produce visually compelling content but frequently violate fundamental physics principles (e.g., gravity, object permanence, fluid dynamics). This limits their utility in robotics, autonomous driving, and reliable world modeling.

While prior research attributes this failure to insufficient physics understanding during pre-training, the authors argue that suboptimal inference strategies are also a major cause. Existing inference-time methods often rely on prompt rewriting or Vision-Language Models (VLMs) for motion planning, which are either limited in scope or perform near chance levels when used as reward signals for physics plausibility. The paper posits that the manifold of physically plausible videos exists within the generative model's latent space but requires better alignment strategies to access them.

2. Methodology: WMReward

The authors propose WMReward, an inference-time alignment framework that treats improving physics plausibility as a search problem guided by a Latent World Model (LWM).

Core Concept

Instead of retraining the generative model, the method uses a pre-trained LWM (specifically VJEPA-2) as a reward function to steer the sampling process of a video generative model. The LWM is chosen because it learns to predict future states in a compressed latent space, focusing on dynamics and structure rather than superficial pixel details.

The Reward Signal (Surprise Score)

The reward is derived from the "surprise" or prediction error of the LWM:

Sliding Window: A generated video is split into context frames ( $C$ ) and future frames ( $M$ ).
Prediction: The LWM encoder processes the context frames, and the predictor attempts to forecast the latent representation of the future frames.
Comparison: The predicted latent representation is compared against the actual latent representation of the generated future frames (encoded by the same LWM encoder).
Reward Calculation: The reward $r(x)$ $r (x)$ is defined as the inverse cosine similarity between the prediction and the ground truth (generated) future:
$r(x) = \frac{1}{|K|} \sum_{k \in K} (1 - \cos(\hat{z}^{fut}_k, z^{fut}_k))$
- Low Surprise (High Reward): The LWM accurately predicted the video's future, implying the video follows physical laws.
- High Surprise (Low Reward): The LWM was "surprised," implying the video contains physically implausible dynamics.

Sampling Schemes

The authors explore three strategies to sample from the reward-weighted distribution $p^*(x) \propto w(x)p(x)$ :

Guidance ( $\nabla$ ): Uses the gradient of the reward to steer the denoising trajectory (gradient-based).
Best-of-N (BoN): Generates $N$ independent samples and selects the one with the highest reward (gradient-free).
$\nabla$ + BoN: A hybrid approach where $N$ samples are generated using gradient guidance, and the best one is selected. This combines the benefits of steering the distribution and filtering for the highest quality.

3. Key Contributions

Inference-Time Alignment for Physics: The paper reframes physics plausibility as an inference-time alignment problem, demonstrating that pre-trained models can be steered toward physical correctness without retraining.
WMReward Design: Introduces a novel reward signal based on the "surprise" of a latent world model (VJEPA-2), showing that latent-space prediction error is a superior proxy for physics plausibility compared to pixel-reconstruction errors (VideoMAE) or VLM-based binary judgments.
Scaling Laws: Demonstrates that physics plausibility scales effectively with increased inference compute (search space size). The performance improves steadily as the number of particles ( $N$ ) increases, particularly when combined with guidance.
State-of-the-Art Performance: Achieves new records on the PhysicsIQ benchmark, outperforming previous state-of-the-art models by a significant margin.

4. Experimental Results

The method was evaluated on three video generation settings: Image-to-Video (I2V), Video-to-Video (V2V), and Text-to-Video (T2V), using models like MAGI-1, vLDM, and Sora2.

PhysicsIQ Benchmark:
- V2V Generation: Achieved a score of 62.64% (using MAGI-1 with $\nabla$ +BoN), surpassing the previous best by 7.42%.
- I2V Generation: Surpassed Sora2 by 4.13%.
- Comparison: WMReward significantly outperformed alternative reward signals like VideoMAE and Qwen-VL (which performed near chance levels).
VideoPhy (T2V): Showed substantial improvements in Physics Consistency (PC) for both vLDM (+6.9%) and MAGI-1 (+8.1%).
Human Preference Study:
- Annotators preferred WMReward-generated videos over baselines in 54.9% (vLDM) and 54.5% (MAGI-1) of cases for Physics Plausibility.
- Notably, visual quality and prompt alignment were maintained or slightly improved, indicating that physics alignment does not degrade aesthetic quality.
Generalization: The method works across diverse architectures (autoregressive vs. diffusion) and conditioning types (text, image, video).

5. Significance and Impact

Viability of Latent World Models: The work proves that latent world models, originally designed for prediction, can serve as powerful, differentiable reward functions for generative tasks.
Compute-Performance Trade-off: It establishes a new paradigm where allocating more inference-time compute (via search and guidance) directly translates to better physical reasoning, offering a path to improve model reliability without massive retraining costs.
Foundation for Future Research: The paper suggests that while current LWMs have limitations (e.g., understanding material properties like friction), they provide a scalable foundation for building more reliable world models. It highlights that the "physics gap" in generative AI can be bridged through better inference strategies and alignment with predictive models.

In conclusion, WMReward successfully bridges the gap between generative video models and physical reality by leveraging the predictive power of latent world models as a guide during the sampling process, achieving state-of-the-art results in physics plausibility.