Inferring Dynamic Physical Properties from Video Foundation Models

This paper introduces new synthetic and real-world video datasets for predicting dynamic physical properties like elasticity, viscosity, and friction, and evaluates various inference methods—including classical computer vision, prompt-based adaptation of video foundation models, and multi-modal large language models—demonstrating that pre-trained generative and self-supervised video models achieve performance comparable to each other and approaching an oracle baseline, while currently outperforming MLLMs.

Original authors: Guanqi Zhan, Xianzheng Ma, Weidi Xie, Andrew Zisserman

Published 2026-04-14
📖 5 min read🧠 Deep dive

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are watching a video of a rubber ball bouncing, honey dripping, or a toy car sliding across a table. You don't need a ruler or a stopwatch to know that the ball is bouncy, the honey is thick, or the table is slippery. Your brain intuitively "feels" the physics just by watching the motion.

This paper asks a big question: Can computers learn to do the same thing?

The researchers from Oxford University and Shanghai Jiao Tong University wanted to see if modern AI models (the "smart" computers that can generate videos or answer questions) actually understand the rules of physics hidden inside a video, or if they are just guessing based on what things look like.

Here is a simple breakdown of what they did and what they found.

1. The Three "Physics Tests"

To test the AI, they created a new dataset called PhysVid. They didn't just throw random videos at the AI; they designed three specific "exams" based on how things move over time:

  • The Bouncy Ball (Elasticity): They watched balls drop and bounce. A super-bouncy ball (like a superball) bounces high; a sad ball (like a lump of clay) barely bounces. The AI had to guess how "bouncy" the ball was just by watching the height of the jumps.
  • The Sticky Liquid (Viscosity): They watched liquids pour onto a plate. Water spreads out fast; honey spreads out slowly. The AI had to guess how "thick" or "sticky" the liquid was based on how quickly it spread.
  • The Sliding Toy (Friction): They watched objects slide across different surfaces. A toy sliding on ice goes far; the same toy sliding on a rug stops quickly. The AI had to guess how "slippery" the surface was based on how fast the object slowed down.

2. The Three AI "Students"

They tested three different types of AI models to see who could pass the exam:

  • The "Generative" Artist (DynamiCrafter): This AI is trained to make videos. It knows how things should move because it has tried to create millions of realistic videos.
    • Analogy: Like a movie director who knows how a ball should bounce because they've directed thousands of action scenes.
  • The "Self-Supervised" Observer (V-JEPA-2): This AI is trained by watching videos and trying to guess what happens next. It learns the "grammar" of motion without being told the rules.
    • Analogy: Like a baby watching the world, learning that if you drop a cup, it falls, without anyone explaining gravity.
  • The "Multilingual" Chatbot (MLLMs like Gemini, GPT-4o): These are the famous chatbots that can see images and talk. The researchers asked them questions like, "How bouncy is this ball?" using different ways of phrasing the question (prompts).
    • Analogy: Like a very smart librarian who has read every book on physics but has never actually seen a ball bounce in real life.

3. The "Oracle" (The Cheat Sheet)

Before testing the AI, the researchers built a perfect "Oracle" system. This isn't an AI; it's a set of math rules and computer vision tools that can measure the exact physics (like measuring the exact height of a bounce) with 100% accuracy.

  • Analogy: This is the teacher with the answer key and a laser ruler. It sets the "perfect score" that the AI students are trying to reach.

4. The Results: Who Passed?

  • The Oracle: Got an A+ (obviously). It could measure the physics perfectly.
  • The Generative & Self-Supervised Models: These did surprisingly well! They got mostly As and Bs.
    • They were great at predicting how bouncy a ball was or how thick a liquid was.
    • They struggled a bit with "friction" (sliding objects) because that requires understanding complex angles and how the camera moves, which is harder to guess.
    • Key Takeaway: These models have actually "learned" some physical intuition just by watching videos. They aren't just guessing; they understand the flow of time.
  • The Chatbots (MLLMs): These did the worst. Even with special instructions (prompts) telling them how to look at the video, they often failed.
    • Analogy: The chatbots were like the librarian who knows the theory of friction but gets confused when looking at a real video. They often focused on what the object was (e.g., "It's a red ball") rather than how it moved.
    • However, when the researchers gave them "few-shot" examples (showing them a few solved problems first), they improved, but still couldn't beat the other models.

5. The Big Picture

The paper concludes that while AI is getting very good at understanding what is in a video (identifying objects, people, and scenes), it is still learning how to understand how things move physically.

  • The Good News: Video generation models (like the ones that make deepfakes or AI movies) have accidentally learned a lot about physics. They know that water spreads and balls bounce because they've practiced making those videos.
  • The Bad News: The "smart" chatbots that can talk and see are currently worse at physics than the video generators. They are great at language, but their "eyes" aren't quite tuned to the laws of physics yet.

In summary: If you want a robot to know how slippery a floor is so it doesn't fall, you should probably ask the video-generation AI, not the chatbot. The video AI has "seen" the world move; the chatbot has only "read" about it.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →