MOSIV: Multi-Object System Identification from Videos

Imagine you walk into a chaotic kitchen where a bowl of jelly, a bag of sand, and a rubber ball are bouncing off each other, sliding across the table, and squishing together.

If you were to record this with a video camera, could you figure out exactly how "squishy" the jelly is, how "gritty" the sand is, and how "bouncy" the rubber ball is? And more importantly, could you use that knowledge to predict exactly what would happen if you threw a third object into the mix?

This is the problem the paper MOSIV solves.

Here is the breakdown of what they did, using simple analogies:

1. The Problem: The "Guessing Game" of Physics

Previous methods for understanding physics from video were like playing a multiple-choice quiz.

The Old Way: Imagine a robot watching the jelly. It has a small library of "material cards" in its head: Card A: Jelly, Card B: Water, Card C: Clay. The robot looks at the video and guesses, "Hmm, it looks like Card A."
The Flaw: Real life isn't a multiple-choice quiz. The jelly might be slightly stiffer than the one in Card A, or the sand might be wetter than Card B. When objects crash into each other (like the jelly hitting the sand), these "guesses" get confused. The robot might think the sand is actually jelly because they are touching, leading to a simulation that looks wrong and falls apart quickly.

2. The Solution: MOSIV (The "Digital Twin" Maker)

The authors created a new system called MOSIV. Instead of guessing which card a material is, MOSIV acts like a master chef who tastes the food and measures the exact ingredients.

Step 1: The 4D Snapshot (The Camera)
MOSIV watches the video from many angles (like having 11 cameras around the table). It builds a super-detailed, moving 3D model of every object. Think of this as creating a "digital twin" of the scene that captures exactly how the jelly wobbles and the sand shifts.
Step 2: The Physics Engine (The Simulator)
Inside the computer, MOSIV runs a physics simulator. But instead of just guessing the material, it treats the physical properties (like stiffness, friction, and squishiness) as knobs it can turn.
Step 3: The "Tuning" Process (The Magic)
This is the core innovation. MOSIV runs the simulation and compares the result to the real video.
- Simulation says: "The jelly should bounce here."
- Video says: "No, it squished there."
- MOSIV's reaction: "Okay, I need to turn the 'stiffness' knob down a tiny bit and the 'friction' knob up a tiny bit."
  It does this millions of times, adjusting the "knobs" for each object individually, until the simulation matches the video perfectly.

3. Why is this special?

The paper highlights two main superpowers:

No More "Material Confusion": Because MOSIV looks at each object separately (even when they are touching), it doesn't get confused. It knows the sand is sand and the jelly is jelly, even when they are mashed together. It learns the exact recipe for that specific piece of sand and that specific blob of jelly.
Crystal Ball Prediction: Once MOSIV has figured out the exact "knobs" for the objects, it can predict the future.
- Example: If you watched a video of a rubber ball hitting a wall, MOSIV could tell you exactly what would happen if you threw a heavier ball, or if the wall was stickier, even though it never saw that specific scenario before. It creates a "physics engine" for that specific scene.

4. The "Kitchen Test" (The Experiment)

To prove it works, the researchers built a virtual kitchen with 45 different scenarios involving 10 different shapes (like apples, pawns, bananas) and 5 different materials (elastic, plastic, liquid, sand, snow).

They compared MOSIV against the old "multiple-choice" methods.

The Old Methods: The simulations looked blurry, the objects melted into each other, and the predictions drifted off course after a few seconds.
MOSIV: The simulations were sharp, the objects kept their shape, and the predictions stayed accurate for a long time. It was like comparing a blurry, low-resolution photo to a 4K movie.

The Bottom Line

MOSIV is a new tool that lets computers learn the "secret recipe" of physical objects just by watching them move. Instead of guessing what something is made of, it measures the exact physics of every single item in a chaotic scene. This means we can eventually build robots that can handle messy, real-world tasks (like a robot chef cooking with sticky dough and slippery vegetables) or create video games where the physics feel incredibly real and unpredictable.

1. Problem Statement

The paper addresses the challenging problem of Multi-Object System Identification (SysID) from videos. Unlike previous works that focus on single-object scenes or classify materials into discrete categories (e.g., "rubber" vs. "metal"), MOSIV aims to:

Reconstruct the 4D geometry (3D shape over time) of multiple interacting objects from multi-view RGB videos.
Identify continuous physical parameters (e.g., stiffness, plasticity, friction, viscosity) for each individual object in the scene.
Enable a physics simulator to reproduce observed interactions and accurately predict long-horizon future dynamics, even in complex scenarios involving contact, occlusion, and diverse material combinations.

Key Challenges:

Ambiguity: Distinguishing between similar physical properties (e.g., stiffness vs. friction) requires analyzing geometry and motion over time, not just appearance.
Contact Complexity: Interactions between objects create occlusions and abrupt motions that confuse standard reconstruction methods.
Discrete vs. Continuous: Existing methods often select from a fixed library of material models, whereas real-world materials exist on a continuous spectrum of parameters.

2. Methodology: MOSIV Framework

MOSIV proposes a three-stage pipeline that integrates geometric reconstruction with differentiable physics simulation.

A. Geometric Reconstruction (Object-Aware Dynamic Gaussians)

Input: Multi-view RGB videos and instance masks.
Technique: The system reconstructs the scene using 4D Gaussian Splatting (4DGS).
Object Awareness: Unlike scene-wide reconstruction, MOSIV uses pre-defined 2D material masks to partition Gaussians by object. This ensures that each object's unique material properties are tracked independently in 4D space.
Output: A temporally varying, high-fidelity geometric representation of each object.

B. Gaussian-to-Continuum Lifting

Conversion: The reconstructed Gaussians are converted into a simulation-ready particle set for a Material Point Method (MPM) simulator.
Process:
1. A thin occupancy field is generated for each object.
2. Particles are sampled within bounding boxes and filtered by depth alignment.
3. A density field is refined to ensure disjoint supports between objects (preventing interpenetration at initialization) and to align contact interfaces.
Result: A set of particles $\tilde{P}_k(0)$ for each object $k$ , carrying position, velocity, and material tags, ready for physics simulation.

C. Differentiable Simulation & Joint Optimization

Simulator: A differentiable MPM simulator is used to model complex inter-material physics, including contact, friction, and deformation.
Parameterization: Each object $k$ is assigned its own continuous parameter vector $\theta_k$ (e.g., Young's modulus $E$ , Poisson's ratio $\nu$ , yield stress $\tau_Y$ , friction angle $\theta_{fric}$ ).
Optimization Objective: The system minimizes a geometry-aligned loss by comparing the simulated state against the visual evidence:
- Surface Loss: Symmetric Chamfer Distance between simulated surfaces and reconstructed Gaussian surfaces.
- Silhouette Loss: $L_1$ loss between simulated and observed alpha-masks (silhouettes) per object and per view.
Training Strategy:
- Curriculum Learning: Gradually increases the rollout length as alignment improves.
- Alternating Updates: Interleaves parameter optimization with particle state re-synchronization to reduce drift.

3. Key Contributions

Task Formalization & Dataset: The authors formalize the task of multi-object system identification and release a new synthetic benchmark (MOSIV Dataset) generated using the Genesis physics engine. It features 45 multi-view videos with 10 unique geometries and 5 material types (elastic, plastic, fluid, sand, snow) interacting in contact-rich scenarios.
Novel Framework (MOSIV): A unified framework combining object-aware dynamic Gaussians with differentiable MPM. It moves beyond discrete material classification to directly optimize continuous, per-object constitutive parameters.
Object-Wise Supervision: The paper demonstrates that object-level supervision (computing losses per object rather than on the union of the scene) is critical for resolving ambiguities during contact, preventing "cross-object borrowing" where one object's deformation incorrectly explains another's.

4. Experimental Results

The method was evaluated against baselines including OmniPhysGS (adapted for video) and CoupNeRF.

Quantitative Performance:
- Observable State: MOSIV significantly outperforms baselines in PSNR, SSIM, Chamfer Distance (CD), and Earth Mover's Distance (EMD). For example, in CD (lower is better), MOSIV achieved 1.256 vs. 11.79 for OmniPhysGS.
- Future State Simulation: MOSIV maintains high fidelity in long-horizon predictions, whereas baselines suffer from drift, unrealistic fluid spreading, or sand dispersion.
- Ablation: Removing object-wise supervision (using scene-wise losses) caused a massive degradation in CD (from ~0.7 to ~22.13), proving the necessity of per-object constraints.
Qualitative Performance:
- MOSIV accurately preserves material-specific behaviors (e.g., sand clustering, fluid viscosity, plasticine deformation) during collisions.
- Baselines often exhibit shape erosion, contact leakage, or incorrect material behaviors (e.g., treating sand as a viscous fluid).
- Novel Interactions: MOSIV successfully generalizes to unseen scenarios by swapping material parameters (e.g., making a rigid object behave like fluid) while keeping initial conditions fixed, producing physically plausible outcomes.
Efficiency: Despite running on a single NVIDIA A6000 GPU, MOSIV achieves lower training time and memory usage compared to baselines that require H100 GPUs (e.g., OmniPhysGS).

5. Significance and Impact

Robust Physical Understanding: MOSIV establishes a new baseline for understanding complex, multi-object physical interactions, moving the field from single-object analysis to realistic, cluttered environments.
Digital Twins: The ability to create "digital twins" with accurate physical parameters enables advanced applications in robotic manipulation (planning in cluttered spaces), scene editing, and simulation-based training.
Generalization: By learning continuous parameters rather than discrete classes, the method generalizes better to novel material combinations and interaction scenarios.
Open Science: The authors release the source code and the synthetic dataset, fostering further research in physics-informed computer vision.

In summary, MOSIV represents a significant leap forward in physics-based vision, successfully bridging the gap between visual reconstruction and accurate physical simulation for complex, multi-object systems.