CRISP: Contact-Guided Real2Sim from Monocular Video with Planar Scene Primitives

Imagine you are watching a home video of your friend doing parkour in their living room. They jump over a coffee table, slide across the floor, and sit on a sofa.

Now, imagine you want to take that video and turn it into a video game or a robot training simulation. You want a digital character to do exactly what your friend did, interacting with the digital furniture in a way that obeys the laws of physics (no falling through the floor, no floating in mid-air).

The problem? Most current computer programs are terrible at this. They try to build a 3D model of the room by scanning every single pixel, which results in a messy, noisy, "glitchy" digital room. If you try to run a robot in that messy room, it trips over invisible bumps, gets stuck in "ghost walls," or falls through the floor because the digital geometry is too imperfect.

Enter CRISP.

The authors of this paper (from Carnegie Mellon University) built a new system called CRISP (Contact-Guided Real2Sim). Think of CRISP as a smart architect that looks at your messy video and builds a clean, simplified, and sturdy digital playground for a robot to run in.

Here is how it works, broken down into three simple steps:

1. The "Lego" Approach (Planar Primitives)

Instead of trying to recreate the room with millions of tiny, jagged triangles (which creates a messy, glitchy mess), CRISP looks at the room and says, "Okay, that's a flat floor, that's a flat wall, and that's a flat table top."

It breaks the complex scene down into about 50 simple, flat, box-like shapes (like giant Lego bricks).

The Analogy: Imagine trying to build a model of a house. One way is to sculpt every brick and window individually out of clay (messy and heavy). The other way is to use pre-made, smooth blocks to represent the walls and floor. CRISP uses the blocks. This makes the digital world "clean" and easy for a physics engine to understand, so the robot doesn't trip over digital dust.

2. The "Mind Reader" (Contact-Guided Completion)

In your video, your friend might sit on a chair, blocking the view of the chair's seat. A normal computer program would say, "I can't see the seat, so I'll leave a hole there." If a robot tries to sit on that hole, it will fall through.

CRISP uses a "mind reader" (an AI that understands human behavior) to guess what's hidden.

The Analogy: If you see a person sitting down, you know there is a chair underneath them, even if you can't see it. CRISP uses this logic. It sees the person's posture and says, "Ah, they are sitting, so there must be a flat surface right there." It fills in the missing parts of the room so the robot has a solid place to stand or sit.

3. The "Stunt Double" (Reinforcement Learning)

Once CRISP has built the clean room and the human motion, it doesn't just stop there. It hires a digital stunt double (a simulated robot) to try and copy the video.

The Analogy: Think of this like a dance instructor. The instructor (the AI) watches the video, then tries to teach a robot to dance. If the robot keeps tripping over a "ghost wall" in the simulation, the instructor knows the room model is wrong. The instructor tweaks the room model until the robot can dance perfectly without falling. This process ensures that the final 3D model is physically real.

Why is this a big deal?

It's 8x Better at Not Failing: Previous methods failed to simulate the motion correctly about 55% of the time (the robot would crash or glitch). CRISP only fails about 7% of the time.
It's Super Fast: Because CRISP uses simple "Lego blocks" instead of millions of tiny triangles, the computer can run the simulation 43% faster. This means robots can learn new skills much quicker.
It Works on "Wild" Videos: You don't need a special studio camera. You can use a shaky video from your phone, a video from the internet, or even a video generated by AI (like Sora), and CRISP can turn it into a working simulation.

In a nutshell:
CRISP takes a messy, real-world video and turns it into a clean, physics-perfect video game level. It does this by simplifying the room into flat blocks, guessing what's hidden behind people, and having a robot "test drive" the scene to make sure everything is solid. This opens the door for robots to learn from our daily lives and for us to create realistic AR/VR experiences instantly.

1. Problem Statement

The paper addresses the challenge of converting unconstrained, monocular videos of humans interacting with scenes into simulation-ready assets (Real-to-Sim).

The Gap: While significant progress exists in 3D human motion recovery (HMR) and scene reconstruction individually, joint reconstruction often fails in physical simulation.
Key Issues:
- Noisy Geometry: Prior methods (e.g., dense meshes from TSDF or neural fields) produce noisy, non-watertight, or artifact-ridden geometries. Even minor noise in ground planes can cause humanoid simulators to "trip," experience unstable contact forces, or suffer from inter-penetrations.
- Occlusion: Critical interaction surfaces (e.g., a chair seat under a sitting person) are often occluded in the video, leading to missing geometry that breaks physics simulations.
- Efficiency: Complex dense meshes are computationally expensive for collision detection and physics solvers, hindering Reinforcement Learning (RL) throughput.
Goal: To create a pipeline that recovers physically plausible, clean, and efficient 3D human motion and scene geometry from a single video, enabling stable RL training for embodied AI and robotics.

2. Methodology (CRISP Pipeline)

The CRISP framework integrates Human Mesh Recovery (HMR), 4D reconstruction, contact prediction, and physics-based refinement. The pipeline consists of four main stages:

A. Initialization (Human, Scene, and Camera)

Input: Casual monocular RGB video.
Camera & Depth: Uses MegaSAM to jointly recover camera intrinsics, poses, and dense depth maps. The depth estimator is upgraded with MoGe to improve geometry quality.
Human Pose: Uses GVHMR to estimate SMPL mesh parameters in camera space, then lifts them to the world frame using the recovered camera poses.
Scale: Rescales the point cloud to metric scale by aligning the depth of the human mesh with the known average human height.

B. Normal-Based Planar Primitive Fitting

Instead of generating dense meshes, CRISP decomposes the scene into a small set of convex planar primitives (approx. 50 primitives).

Process:
1. Clustering: Runs K-means on normal maps (derived from point clouds) to identify candidate planar segments.
2. Spatial Split: Uses DBSCAN to split segments based on 3D point proximity.
3. Temporal Merge: Merges segments across frames with similar planar fits and optical flow correspondences to ensure temporal consistency.
4. Fitting: Fits a plane to each merged region using RANSAC and defines a planar cuboid (default thickness 0.05m).
Benefit: This creates "watertight," convex geometry that is computationally efficient for collision detection and robust against low-level noise.

C. Contact-Guided Scene Completion

To handle occluded geometry (e.g., the part of a chair hidden by a person), the method uses human-scene contact modeling.

Contact Prediction: Uses InteractVLM to predict binary contact masks on SMPL vertices.
Filtering: Applies temporal-kinematic filtering (Non-Maximum Suppression over time) to reduce false positives from "near-contact" frames.
Hallucination: Uses these contact cues to infer and reconstruct occluded surfaces (e.g., inferring a chair seat exists because the human is in a sitting pose), ensuring the scene is complete for physics simulation.

D. Physics-Based Motion Tracking (RL)

The final stage validates and refines the reconstruction using Reinforcement Learning.

Policy: Trains a fully-constrained motion-tracking policy ( $\pi_{FC}$ ) using Proximal Policy Optimization (PPO) in Isaac Gym.
Objective: The policy drives a simulated humanoid to track the reference motion while interacting with the reconstructed scene.
Reward: Combines position, rotation, velocity, and root height tracking rewards with an energy penalty to encourage smooth motion.
Role: If the geometry is flawed, the RL agent will fail (e.g., fall through the floor or get stuck). Thus, the RL success rate serves as a proxy for the physical plausibility of the reconstruction.

3. Key Contributions

Planar Primitive Representation: A novel approach to scene reconstruction that fits compact, convex planar primitives to point clouds. This replaces noisy dense meshes, significantly improving simulation stability and efficiency.
Contact-Guided Occlusion Handling: A strategy to infer and reconstruct occluded scene geometry (like furniture under a person) using human posture and contact predictions, ensuring the scene is physically complete.
Physics-in-the-Loop Validation: Using RL to drive a humanoid controller as a verification step. This ensures the reconstructed assets are not just visually accurate but physically valid, filtering out artifacts that would cause simulation failure.
End-to-End Pipeline: A unified system that bridges the gap between casual video and high-fidelity simulation, achieving high throughput and success rates.

4. Experimental Results

The method was evaluated on EMDB (global motion) and PROX (human-scene interaction) benchmarks.

RL Success Rate: CRISP achieved a 93.1% success rate in RL motion tracking, compared to 44.8% for the state-of-the-art baseline (VideoMimic).
Simulation Throughput: CRISP supports 23K FPS, a 43% increase over dense-mesh approaches (16K FPS), due to the efficiency of planar primitives.
Reconstruction Quality:
- Chamfer Distance: Achieved significantly lower one-way Chamfer distance (Recon $\to$ GT) compared to baselines, indicating high accuracy where geometry exists.
- Non-Penetration: Achieved a 94.7% non-penetration score, drastically reducing simulation failures caused by geometry artifacts.
HMR Accuracy: After RL refinement, CRISP achieved the lowest joint position error (70.60 mm WA-MPJPE100) and root translational error, outperforming GVHMR, TRAM, and WHAM.
Ablation: Removing contact cues led to unstable simulations where agents fell or moved unnaturally due to missing support surfaces.

5. Significance

Scalability for Embodied AI: CRISP enables the generation of physically valid training environments from "in-the-wild" videos (including internet videos and Sora-generated content), removing the need for manual scene annotation or multi-view setups.
Robustness: By prioritizing physics-compliant geometry over visual completeness, the method solves the "simulation gap" where prior reconstructions looked good but failed in physics engines.
Efficiency: The use of planar primitives makes RL training feasible at scale, accelerating the development of robotics and AR/VR applications that require realistic human-scene interaction.

In summary, CRISP demonstrates that physical reasoning (via contact modeling and RL validation) is essential for high-fidelity video-to-simulation pipelines, outperforming purely data-driven reconstruction methods in both stability and efficiency.