LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference

Imagine you have a magical robot painter that can create movies just by listening to your voice. You ask it, "Show me a ball bouncing," and it paints a video. But sometimes, the robot gets confused: the ball might bounce up into the sky, or it might pass right through the floor like a ghost.

The problem is, the robot's videos often look beautiful. The colors are right, the lighting is perfect, and the ball looks real. So, how do you tell if the robot actually understands physics (the rules of how the real world works) or if it's just guessing based on how things look?

This is the puzzle the paper "LikePhys" solves. Here is the explanation in simple terms:

1. The Problem: The "Beautiful Lie"

Current AI video makers are like talented forgers. They can paint a picture that looks exactly like a real scene, even if the physics inside it is impossible.

The Old Way: To check if they were lying, humans (or other AI judges) would watch the video and say, "Hmm, that ball bounced weirdly." But this is slow, subjective, and the judges often get distracted by how pretty the video looks rather than the physics.
The Goal: We need a way to test the robot's brain, not just its eyes. We need to know if it understands gravity, or if it's just mimicking the look of gravity.

2. The Solution: The "Gut Feeling" Test (LikePhys)

The authors created a method called LikePhys. Instead of asking the robot to make a video and then judging it, they ask the robot to guess which of two videos is "real."

Here is the analogy:
Imagine you are a music critic. You have two songs playing:

Song A: A perfect, harmonious melody.
Song B: The same melody, but someone randomly changed a few notes to sound terrible.

You don't need to compose a new song to know which one is better. You just need to listen and say, "Song A feels right; Song B feels wrong."

LikePhys does this for video:

The Setup: They use a computer simulator to create pairs of videos.
- Video A (Valid): A ball bounces normally.
- Video B (Invalid): The same ball, but it bounces up into the sky or passes through the floor.
- Crucially: Both videos look identical in every way (same colors, same lighting, same camera angle). The only difference is the physics.
The Test: They feed these videos into the AI model. The model doesn't generate anything; it just "looks" at the video and tries to predict the noise inside it (a technical step the model does internally).
The Score: If the model has learned physics, it will "feel" that the Valid video is more familiar and natural. It will assign a higher "likelihood" (a confidence score) to the real one and a lower score to the fake one.
- If the model prefers the fake video, it fails the test.
- If it prefers the real video, it passes.

3. The Results: Who is the Smartest?

The researchers tested 12 different AI video models using this "Gut Feeling" test.

The Findings: Older models were terrible at this; they often couldn't tell the difference between a real bounce and a ghost-bounce. They were just copying the look of a bounce.
The Good News: Newer, bigger models (like Hunyuan T2V and Wan2.1) are getting much better. They are starting to actually learn the rules of the universe.
The Weakness: Even the smartest models still struggle with fluids (like water flowing in a river) and chaos. They are great at solid objects (like blocks and balls) but get confused when things splash or swirl.

4. Why This Matters

Think of AI video models as World Simulators.

If you want an AI to help a robot learn to walk, or a self-driving car to navigate a storm, the AI needs to know that if a car hits a wall, it stops. It can't just look pretty; it has to be physically true.
LikePhys is a new ruler that measures how "real" the AI's understanding of the world is, without needing a human to sit there and watch every video.

Summary Metaphor

Imagine you are teaching a child to play with blocks.

Old Method: You build a tower, knock it over, and ask the child, "Did that look right?" The child might say "Yes" just because the colors were nice.
LikePhys Method: You show the child two towers. One falls down naturally. The other floats in the air. You ask the child, "Which one feels like it belongs in our world?"
If the child points to the floating one, they don't understand gravity. If they point to the falling one, they are learning the rules of the world.

LikePhys is simply a way to ask the AI, "Which one feels real?" and trust its answer to tell us how smart it really is.

1. Problem Statement

Video Diffusion Models (VDMs) have achieved remarkable success in generating visually compelling videos. However, they frequently produce outputs that violate fundamental laws of physics (e.g., objects passing through each other, impossible energy transfers, or inconsistent shadows).

The Challenge: Accurately evaluating a model's "intuitive physics" understanding is difficult. Existing methods often rely on:
- Violation-of-Expectation Paradigms: Designed for discriminative vision models, not generative ones.
- Vision-Language Models (VLMs): These introduce subjective biases, interpretive variance, and often conflate visual aesthetic quality with physical correctness.
- Pixel-level Comparisons: Require image-conditioned generation, which limits applicability to text-to-video models.
The Gap: There is a lack of a training-free, objective metric that can disentangle visual appearance from physical plausibility in text-to-video generation.

2. Methodology: LikePhys

The authors propose LikePhys, a training-free evaluation method that leverages the density estimation capabilities of diffusion models. Instead of analyzing the generated video, it analyzes the model's internal likelihood (probability density) of a video sequence.

Core Intuition

A VDM trained on real-world data should implicitly learn the distribution of physically valid events. Therefore, for a pair of videos—one physically valid ( $x^+$ ) and one physically invalid ( $x^-$ ) but visually similar—the model should assign a higher likelihood (lower denoising loss) to the valid sample.

The Plausibility Preference Error (PPE)

The evaluation metric is defined as the Plausibility Preference Error (PPE):

Dataset Construction: A synthetic benchmark is created using Blender. It contains 12 scenarios across 4 domains (Rigid Body, Continuum, Fluid, Optical).
- Controlled Violations: For each scenario, "valid" videos obey physics laws, while "invalid" videos contain a single, controlled violation (e.g., teleportation, super-elastic bounce, inverted shadow).
- Visual Consistency: Crucially, valid and invalid pairs are rendered with identical camera angles, lighting, textures, and object geometry. The only difference is the physics violation. This ensures that likelihood differences are attributed solely to physics, not visual style.
Likelihood Estimation:
- Videos are corrupted with noise at specific timesteps.
- The diffusion model predicts the noise.
- The denoising loss ( $\mathcal{L}_{\text{denoise}}$ ) serves as a proxy for negative log-likelihood (ELBO).
- Lower loss $\implies$ Higher likelihood.
Calculation:
- For every valid-invalid pair, the model is checked: Does it assign a lower loss to the valid video?
- PPE is the fraction of pairs where the model fails to prefer the valid video (i.e., assigns lower loss to the invalid one).
- Lower PPE indicates better intuitive physics understanding.

3. Key Contributions

LikePhys Framework: A novel, training-free evaluation protocol that uses likelihood preference (via denoising loss) to measure physics understanding, avoiding the biases of VLMs and the need for conditional generation.
Comprehensive Benchmark: A synthetic dataset of 12 scenarios spanning Rigid Body Mechanics, Continuum Mechanics, Fluid Mechanics, and Optical Effects. Each scenario isolates specific physics violations under matched visual conditions.
Systematic Benchmarking: Evaluation of 12 state-of-the-art VDMs (including Hunyuan, Wan, CogVideoX, LTX, etc.), revealing significant performance gaps and trends.
Analysis of Factors: A deep dive into how model architecture (UNet vs. DiT), model size, training data scale, frame count, and inference settings (CFG) influence physics understanding.

4. Key Results

Model Performance

Architecture Matters: Recent Diffusion Transformer (DiT) models significantly outperform older UNet-based models.
- Top Performers: Hunyuan T2V (43.6% PPE), Wan2.1-14B (43.8%), and CogVideoX1.5-5B (43.8%).
- Lower Performers: Early UNet models like AnimateDiff often exceed 60% PPE (near random guessing).
Scaling Laws: There is a clear trend where larger model sizes and larger training datasets correlate with lower PPE (better physics understanding).

Alignment with Human Preference

Correlation: LikePhys shows a strong correlation ( $\tau = 0.44$ ) with human judgments of physical plausibility, outperforming VLM-based evaluators (VideoPhy, Qwen2.5-VL) which showed weaker or inconsistent correlations.
Disentanglement: PPE has near-zero correlation with aesthetic quality metrics (e.g., Aesthetic Quality, Subject Consistency). This proves that LikePhys measures physics reasoning independently of visual beauty, addressing a major flaw in previous metrics.

Domain-Specific Insights

Fluid Mechanics: Models struggle the most here (highest PPE), particularly with complex flows like rivers.
Optical Effects: Models perform best here (lowest PPE), likely due to strong priors learned from static imagery.
Specific Laws: Models handle Geometric Invariance and Optical Consistency well but struggle with Temporal Continuity and Conservation of Energy/Mass, suggesting a lack of global spatiotemporal constraints in current training objectives.

Inference Factors

Frame Count: Increasing the number of frames (context window) consistently improves physics understanding, suggesting longer temporal context aids in capturing complex dynamics.
Classifier-Free Guidance (CFG): Varying CFG strength has minimal impact on PPE, indicating that physics plausibility is governed by the learned distribution rather than inference-time noise calibration.

5. Significance and Limitations

Significance:

Objective Benchmarking: Provides a rigorous, automated way to assess if generative models are learning "world models" rather than just visual patterns.
Guidance for Development: Highlights that while visual quality is scaling well, physics understanding lags, particularly in fluid dynamics and conservation laws. It suggests future work should focus on longer context training and physics-aware objectives.
Model Selection: Offers a metric for developers to select checkpoints that are physically plausible, which is critical for applications in robotics and autonomous driving.

Limitations:

Synthetic Data: The benchmark relies on Blender simulations. While controlled, it may not capture the full complexity of real-world noise and occlusion.
Access Requirements: The method requires access to the model's noise prediction (denoising loss), making it difficult to apply to closed-source models where internal gradients/losses are hidden.
Cost: Curating the controlled valid/invalid pairs requires significant simulation effort compared to simple text-prompt evaluation.

Conclusion

LikePhys represents a paradigm shift in evaluating video generative models. By moving from "judging the output" to "measuring the model's internal belief," it successfully isolates intuitive physics understanding from visual aesthetics. The results indicate that while current models are improving, they still lack a robust, generalizable understanding of physical laws, particularly in complex dynamic systems.