MOSIV: Multi-Object System Identification from Videos

The paper introduces MOSIV, a novel framework that leverages differentiable simulation and geometry-aligned objectives to identify continuous, per-object material parameters from videos of complex multi-object interactions, outperforming existing methods on a new synthetic benchmark.

Chunjiang Liu, Xiaoyuan Wang, Qingran Lin, Albert Xiao, Haoyu Chen, Shizheng Wen, Hao Zhang, Lu Qi, Ming-Hsuan Yang, Laszlo A. Jeni, Min Xu, Yizhou Zhao

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you walk into a chaotic kitchen where a bowl of jelly, a bag of sand, and a rubber ball are bouncing off each other, sliding across the table, and squishing together.

If you were to record this with a video camera, could you figure out exactly how "squishy" the jelly is, how "gritty" the sand is, and how "bouncy" the rubber ball is? And more importantly, could you use that knowledge to predict exactly what would happen if you threw a third object into the mix?

This is the problem the paper MOSIV solves.

Here is the breakdown of what they did, using simple analogies:

1. The Problem: The "Guessing Game" of Physics

Previous methods for understanding physics from video were like playing a multiple-choice quiz.

  • The Old Way: Imagine a robot watching the jelly. It has a small library of "material cards" in its head: Card A: Jelly, Card B: Water, Card C: Clay. The robot looks at the video and guesses, "Hmm, it looks like Card A."
  • The Flaw: Real life isn't a multiple-choice quiz. The jelly might be slightly stiffer than the one in Card A, or the sand might be wetter than Card B. When objects crash into each other (like the jelly hitting the sand), these "guesses" get confused. The robot might think the sand is actually jelly because they are touching, leading to a simulation that looks wrong and falls apart quickly.

2. The Solution: MOSIV (The "Digital Twin" Maker)

The authors created a new system called MOSIV. Instead of guessing which card a material is, MOSIV acts like a master chef who tastes the food and measures the exact ingredients.

  • Step 1: The 4D Snapshot (The Camera)
    MOSIV watches the video from many angles (like having 11 cameras around the table). It builds a super-detailed, moving 3D model of every object. Think of this as creating a "digital twin" of the scene that captures exactly how the jelly wobbles and the sand shifts.

  • Step 2: The Physics Engine (The Simulator)
    Inside the computer, MOSIV runs a physics simulator. But instead of just guessing the material, it treats the physical properties (like stiffness, friction, and squishiness) as knobs it can turn.

  • Step 3: The "Tuning" Process (The Magic)
    This is the core innovation. MOSIV runs the simulation and compares the result to the real video.

    • Simulation says: "The jelly should bounce here."
    • Video says: "No, it squished there."
    • MOSIV's reaction: "Okay, I need to turn the 'stiffness' knob down a tiny bit and the 'friction' knob up a tiny bit."
      It does this millions of times, adjusting the "knobs" for each object individually, until the simulation matches the video perfectly.

3. Why is this special?

The paper highlights two main superpowers:

  • No More "Material Confusion": Because MOSIV looks at each object separately (even when they are touching), it doesn't get confused. It knows the sand is sand and the jelly is jelly, even when they are mashed together. It learns the exact recipe for that specific piece of sand and that specific blob of jelly.
  • Crystal Ball Prediction: Once MOSIV has figured out the exact "knobs" for the objects, it can predict the future.
    • Example: If you watched a video of a rubber ball hitting a wall, MOSIV could tell you exactly what would happen if you threw a heavier ball, or if the wall was stickier, even though it never saw that specific scenario before. It creates a "physics engine" for that specific scene.

4. The "Kitchen Test" (The Experiment)

To prove it works, the researchers built a virtual kitchen with 45 different scenarios involving 10 different shapes (like apples, pawns, bananas) and 5 different materials (elastic, plastic, liquid, sand, snow).

They compared MOSIV against the old "multiple-choice" methods.

  • The Old Methods: The simulations looked blurry, the objects melted into each other, and the predictions drifted off course after a few seconds.
  • MOSIV: The simulations were sharp, the objects kept their shape, and the predictions stayed accurate for a long time. It was like comparing a blurry, low-resolution photo to a 4K movie.

The Bottom Line

MOSIV is a new tool that lets computers learn the "secret recipe" of physical objects just by watching them move. Instead of guessing what something is made of, it measures the exact physics of every single item in a chaotic scene. This means we can eventually build robots that can handle messy, real-world tasks (like a robot chef cooking with sticky dough and slippery vegetables) or create video games where the physics feel incredibly real and unpredictable.