VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

The paper introduces VisPhyWorld, a novel execution-based framework and benchmark (VisPhyBench) that evaluates Multimodal Large Language Models' physical reasoning by requiring them to generate executable simulator code from visual observations, revealing that while current models excel at semantic understanding, they struggle to accurately infer physical parameters and simulate consistent dynamics.

Jiarong Liang, Max Ku, Ka-Hei Hui, Ping Nie, Wenhu Chen

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a robot how to understand the physical world. You show it a video of a red ball rolling down a ramp and hitting a stack of blocks. The blocks tumble, and the ball bounces.

Now, you ask the robot: "What happened?"

Most current AI models (Multimodal Large Language Models) are like very talented actors. They can look at the video and give you a perfect script: "The ball rolled down, hit the blocks, and they fell over because of gravity." They sound smart, and they get the "story" right.

But here's the catch: They might just be reciting a script they memorized. They haven't actually understood the physics. If you asked them to predict exactly how the blocks would fall in a slightly different scenario, they might guess wrong because they are just guessing based on patterns, not because they know the laws of physics.

The New Idea: "Show Me the Code"

The paper introduces a new way to test these robots, called VisPhyWorld. Instead of asking the robot to just talk about what happened, the researchers say:

"Don't just tell me what happened. Write the computer code that simulates it. Then, run that code and show me the video."

Think of it like this:

  • Old Way (VQA): You ask a student, "If I drop an egg, will it break?" The student says, "Yes, because eggs are fragile." (Correct answer, but maybe they just memorized the fact).
  • New Way (VisPhyWorld): You ask the student, "Build a virtual egg and a virtual floor in a computer program, drop the egg, and show me the simulation."

If the student's code is bad, the virtual egg might float in the air, pass through the floor like a ghost, or bounce like a rubber ball. The code reveals the truth. You can't fake physics in a running program.

The "VisPhyBench" Test

The researchers built a giant test suite called VisPhyBench. It's like a driving test for AI, but instead of driving a car, the AI has to drive a physics simulation.

  • They show the AI two frames of a video (start and a little later).
  • The AI has to write code to recreate the scene and predict what happens next.
  • They run the code. If the video looks realistic and follows the laws of physics (gravity, collisions, friction), the AI passes. If the objects glitch through each other or move strangely, the AI fails.

What Did They Find?

The results were a bit of a "reality check" for the smartest AI models today:

  1. They are great at "Describing," but bad at "Doing."
    The AI models were excellent at describing the scene in words. They could tell you, "That's a red ball, and it's moving fast." But when asked to write the code to simulate that movement, they often failed. They couldn't figure out the exact speed, the angle of the bounce, or how heavy the objects were.

  2. The "Magic Engine" Problem.
    The researchers found that the AI struggled even more when they didn't give it a "physics engine" (a tool that handles the math of gravity and collisions). Without that tool, the AI tried to "guess" the motion, and the results looked like a cartoon where objects float or phase through walls. It turns out, the AI doesn't actually "know" physics; it just knows what physics looks like.

  3. The "Code" is the Truth.
    The biggest win of this paper is that code is honest. You can look at the code the AI wrote and say, "Ah, I see why it failed. It forgot to add gravity to the ball." With a normal video generation model, you just see a weird video and have no idea why it went wrong. With VisPhyWorld, the mistake is visible in the code itself.

The Big Picture

This paper suggests that to make AI truly "smart" about the real world, we need to stop just asking them to predict what a video looks like (which can be faked with patterns) and start asking them to build the world (which requires understanding the rules).

It's the difference between a tourist who takes a photo of a waterfall and says, "Wow, that's loud," and an engineer who builds a dam and actually understands how the water pressure works. VisPhyWorld forces the AI to be the engineer.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →