VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Imagine you are trying to teach a robot how to understand the physical world. You show it a video of a red ball rolling down a ramp and hitting a stack of blocks. The blocks tumble, and the ball bounces.

Now, you ask the robot: "What happened?"

Most current AI models (Multimodal Large Language Models) are like very talented actors. They can look at the video and give you a perfect script: "The ball rolled down, hit the blocks, and they fell over because of gravity." They sound smart, and they get the "story" right.

But here's the catch: They might just be reciting a script they memorized. They haven't actually understood the physics. If you asked them to predict exactly how the blocks would fall in a slightly different scenario, they might guess wrong because they are just guessing based on patterns, not because they know the laws of physics.

The New Idea: "Show Me the Code"

The paper introduces a new way to test these robots, called VisPhyWorld. Instead of asking the robot to just talk about what happened, the researchers say:

"Don't just tell me what happened. Write the computer code that simulates it. Then, run that code and show me the video."

Think of it like this:

Old Way (VQA): You ask a student, "If I drop an egg, will it break?" The student says, "Yes, because eggs are fragile." (Correct answer, but maybe they just memorized the fact).
New Way (VisPhyWorld): You ask the student, "Build a virtual egg and a virtual floor in a computer program, drop the egg, and show me the simulation."

If the student's code is bad, the virtual egg might float in the air, pass through the floor like a ghost, or bounce like a rubber ball. The code reveals the truth. You can't fake physics in a running program.

The "VisPhyBench" Test

The researchers built a giant test suite called VisPhyBench. It's like a driving test for AI, but instead of driving a car, the AI has to drive a physics simulation.

They show the AI two frames of a video (start and a little later).
The AI has to write code to recreate the scene and predict what happens next.
They run the code. If the video looks realistic and follows the laws of physics (gravity, collisions, friction), the AI passes. If the objects glitch through each other or move strangely, the AI fails.

What Did They Find?

The results were a bit of a "reality check" for the smartest AI models today:

They are great at "Describing," but bad at "Doing."
The AI models were excellent at describing the scene in words. They could tell you, "That's a red ball, and it's moving fast." But when asked to write the code to simulate that movement, they often failed. They couldn't figure out the exact speed, the angle of the bounce, or how heavy the objects were.
The "Magic Engine" Problem.
The researchers found that the AI struggled even more when they didn't give it a "physics engine" (a tool that handles the math of gravity and collisions). Without that tool, the AI tried to "guess" the motion, and the results looked like a cartoon where objects float or phase through walls. It turns out, the AI doesn't actually "know" physics; it just knows what physics looks like.
The "Code" is the Truth.
The biggest win of this paper is that code is honest. You can look at the code the AI wrote and say, "Ah, I see why it failed. It forgot to add gravity to the ball." With a normal video generation model, you just see a weird video and have no idea why it went wrong. With VisPhyWorld, the mistake is visible in the code itself.

The Big Picture

This paper suggests that to make AI truly "smart" about the real world, we need to stop just asking them to predict what a video looks like (which can be faked with patterns) and start asking them to build the world (which requires understanding the rules).

It's the difference between a tourist who takes a photo of a waterfall and says, "Wow, that's loud," and an engineer who builds a dam and actually understands how the water pressure works. VisPhyWorld forces the AI to be the engineer.

1. Problem Statement

Current benchmarks for evaluating Multimodal Large Language Models (MLLMs) on physical reasoning rely heavily on recognition-style protocols (e.g., Visual Question Answering, Violation of Expectation). These methods have critical limitations:

Ambiguity of Reasoning: Models can achieve high scores by leveraging visual correlations or dataset biases rather than genuine causal physical understanding.
Lack of Testability: Text-based outputs do not provide explicit, falsifiable hypotheses about physical dynamics.
Inapplicability to Generative Models: Traditional evaluation metrics (like likelihood) used for generative world models do not naturally extend to MLLMs, which primarily output text.

The core challenge is to determine whether MLLMs can genuinely infer physical parameters (mass, friction, velocity) and simulate consistent dynamics, rather than just describing scene semantics.

2. Methodology: VisPhyWorld Framework

The authors propose VisPhyWorld, a paradigm shift that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations.

Core Workflow

Input: The model receives two key frames from a video ( $I_{start}$ and $I_{later}$ ) and optionally a detection context ( $D$ ) containing object bounding boxes and attributes.
Reasoning & Generation: The MLLM performs two tasks:
- Motion Analysis: A textual description of the scene and physical interactions.
- Code Generation: The model produces executable code (HTML/JavaScript) that reconstructs the scene and simulates its future motion.
Execution: The generated code is run in a physics engine (e.g., Three.js with Cannon.js for 3D, or P5.js for 2D) to render a video ( $\hat{X}$ ).
Evaluation: The generated video is compared against the Ground Truth (GT) video ( $X$ ) using a multi-metric suite.

Key Design Features

Executable Hypothesis: The code serves as a "falsifiable" physical hypothesis. If the simulation fails to match the GT, the error is directly attributable to the model's logic (e.g., incorrect friction coefficients or collision handling) rather than opaque pixel generation.
Decoupling: The framework separates reasoning (code generation) from rendering (execution), allowing for controlled comparisons across different MLLM backbones using the same physics engine.
Self-Repair Mechanism: If the initial code execution fails (syntax errors, runtime crashes), the system appends the error log to the prompt and requests a single retry. If this fails, a minimal fallback template is used to ensure evaluation continuity.

3. VisPhyBench: The Evaluation Benchmark

To support this framework, the authors introduce VisPhyBench, a standardized evaluation suite:

Scale: Comprises 209 evaluation scenes derived from 108 physical templates.
Scope: Includes both 2D (PHYRE engine) and 3D (Three.js + Cannon.js) scenarios involving collisions, sliding, stacking, and toppling.
Difficulty Stratification: Scenes are rated 1–5 by STEM-trained annotators and categorized into Easy, Medium, and Hard splits.
Metrics: The benchmark uses a comprehensive set of metrics grouped into five families:
1. Reconstruction & Perceptual Quality: PSNR, SSIM, LPIPS, FSIM.
2. Visual Semantic Consistency: CLIP-Img, DINO (measuring object identity and layout).
3. Text-Video Consistency: CLIP-Cap, BERTScore (comparing model analysis to reference descriptions).
4. Motion & Physical Plausibility: RAFT-based Optical Flow (End-Point Error) and temporal alignment.
5. Holistic Quality: A Gemini-2.5-Pro judge scoring 1–10 based on physical laws (gravity, collisions) and visual coherence.

4. Key Results

Experiments were conducted on state-of-the-art MLLMs (GPT-5, GPT-4.1, Gemini-3-Pro, Claude 4.5, Qwen3-VL) and pixel-space baselines (Veo-3.1, SVD).

Semantic vs. Physical Gap: While models excel at semantic scene understanding (high CLIP/DINO scores), they struggle significantly with fine-grained physical parameterization. Many models fail to correctly simulate Newtonian dynamics even in simple 2D settings.
Code Backend Impact: The choice of rendering engine matters. Three.js (with rigid-body physics) consistently outperforms P5.js and non-physics backends (SVG/Manim). Models using Three.js achieve lower LPIPS and higher structural similarity, proving that access to a true physics solver is crucial for grounding visual evidence.
Comparison with Pixel-Based Models:
- Pixel-space baselines (Veo-3.1, SVD): Often produce visually plausible videos with high semantic similarity but fail to expose interpretable intermediate states. They frequently exhibit "hallucinated dynamics" (e.g., objects passing through each other) that are hard to diagnose.
- Code-driven models: While some struggle with complex physics, successful generations are interpretable and editable. For instance, GPT-5 with Three.js achieved the highest holistic scores, correctly simulating collision dynamics where pixel-based models failed.
Success Rates: The pipeline successfully produced valid reconstructed videos in 97.7% of cases (after self-repair), demonstrating the robustness of the code-generation approach.

5. Significance and Contributions

New Evaluation Paradigm: VisPhyWorld moves beyond "black-box" video generation evaluation. By requiring executable code, it transforms physical reasoning into a testable, inspectable, and falsifiable process.
Diagnostic Capability: The framework allows researchers to pinpoint why a model fails (e.g., incorrect object initialization vs. wrong physics parameters) by inspecting the generated code, which is impossible with pixel-space generation.
Revealing Model Limitations: The study highlights a critical dichotomy: current MLLMs are excellent at describing the world but poor at simulating it. They rely on superficial visual pattern matching rather than grounded causal understanding.
Safety and Reliability: By grounding generation in explicit symbolic logic, this approach offers a mechanism to audit and verify physical "hallucinations," which is crucial for deploying AI in safety-critical domains like robotics.

Conclusion

VisPhyWorld demonstrates that while MLLMs have made strides in visual understanding, they lack the ability to construct verifiable physical models of the world. The proposed code-driven framework provides a rigorous, transparent, and diagnostic path forward for evaluating and improving physical reasoning in AI.

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

The New Idea: "Show Me the Code"

The "VisPhyBench" Test

What Did They Find?

The Big Picture

1. Problem Statement

2. Methodology: VisPhyWorld Framework

Core Workflow

Key Design Features

3. VisPhyBench: The Evaluation Benchmark

4. Key Results

5. Significance and Contributions

Conclusion

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks