SimRecon: SimReady Compositional Scene Reconstruction from Real Videos

Imagine you have a messy, cluttered video of a real living room. You want to turn this video into a perfect, playable 3D video game level where a robot can walk around, pick up a backpack, and sit on a chair without falling through the floor.

The problem is that current AI tools are like two different specialists who don't talk to each other:

The Photographer: Great at making things look real, but the objects are just hollow shells or floating ghosts.
The Architect: Great at building stable structures, but they only know how to build from a library of pre-made, generic furniture.

SimRecon is a new framework that acts as a master conductor, connecting these two worlds. It takes a messy video and builds a "simulation-ready" 3D world. It does this through a three-step pipeline: Perception → Generation → Simulation.

Here is how it works, using some creative analogies:

1. The Problem: The "Bad Angle" and the "Floating Chair"

If you try to build a 3D model of a backpack sitting on a chair just by looking at a photo, you might only see the front. The AI might guess the back is flat, or it might make the backpack look like it's melting.

The Visual Problem: If the AI doesn't see the whole object, it generates a weird, deformed version.
The Physics Problem: If you just drop the generated backpack into a game, it might float in mid-air or pass right through the chair because the AI didn't understand how things sit on top of each other.

2. The Solution: Two "Bridge" Modules

SimRecon builds two special bridges to fix these gaps.

Bridge #1: The "Smart Drone" (Active Viewpoint Optimization)

The Challenge: How do you get the perfect photo of a messy object to teach the AI how to build it?
The Old Way: The AI just picks a random photo from the video or looks at the object from a standard angle. If the object is hidden behind a lamp, the AI gets a bad photo and builds a broken backpack.
The SimRecon Way: Imagine a tiny, intelligent drone hovering around the object in the 3D space. Instead of just taking a picture, this drone asks: "Where should I stand to see the most hidden parts of this object?"
It mathematically calculates the best possible angle to maximize the information it gets. It finds a view that reveals the hidden back of the backpack, the side of the chair, etc. It then uses this "perfect photo" to instruct the 3D generator.

Result: The generated backpack is complete, detailed, and looks exactly like the real one, not a distorted guess.

Bridge #2: The "Master Builder's Blueprint" (Scene Graph Synthesizer)

The Challenge: Once you have perfect 3D models of a chair, a table, and a backpack, how do you put them together so they don't float or crash into each other?
The Old Way: You might try to drop them all into the game world and hope they land right, or use a "search" algorithm that tries millions of random positions until things stop crashing. This is slow and often results in weird physics (like a chair leaning at a 45-degree angle).
The SimRecon Way: Before building anything, SimRecon acts like a detective drawing a relationship map (a Scene Graph).

It looks at the scene and asks: "What is holding what?"
It learns: "The backpack is supported by the chair." "The picture is attached to the wall." "The chair is on the floor."
It builds this map piece-by-piece, checking for conflicts (e.g., "Wait, if the table is on the chair, and the chair is on the floor, is that stable?").
Result: When it finally builds the scene in the simulator, it follows this blueprint. It places the floor first, then the chair on the floor, then the backpack on the chair. It uses real physics to let the backpack "settle" naturally onto the chair, just like in real life.

3. The Final Result: From Video to Video Game

The whole process flows like this:

Perception: The system watches your messy video and identifies the objects (a chair, a table, a backpack).
Generation: The "Smart Drone" finds the best angles to generate perfect 3D models of those objects.
Simulation: The "Master Builder" reads the relationship map and assembles the objects in a physics engine, ensuring everything sits, leans, or hangs exactly where it should.

Why This Matters

Previously, turning a real video into a game level was like trying to build a house by gluing together photos of bricks. The result looked okay from a distance, but if you tried to walk through the door, you'd fall through the floor.

SimRecon changes the game. It creates a world that is not only visually faithful (it looks real) but also physically plausible (it acts real). This means robots can be trained in these AI-generated worlds and then sent to the real world with a much higher chance of success, because the "training ground" actually makes sense physically.

In short: SimRecon is the ultimate translator that turns a chaotic real-world video into a clean, stable, and playable 3D universe.

1. Problem Statement

The paper addresses the challenge of reconstructing simulation-ready, compositional 3D scenes from unstructured, cluttered real-world videos. While recent advances in neural rendering (e.g., 3D Gaussian Splatting) excel at holistic scene reconstruction, they lack explicit object boundaries and geometry, making them unsuitable for physical simulation or interaction. Conversely, existing compositional reconstruction methods often suffer from two critical failures when applied to complex real-world scenarios:

Visual Infidelity: Naive view selection for single-object generation often results in occluded or incomplete projections, leading to deformed 3D assets.
Physical Implausibility: Assembling generated objects directly into a simulator often results in physically impossible configurations (e.g., floating objects, interpenetrations) because the methods lack a principled understanding of physical support and attachment relationships.

2. Methodology: The "Perception-Generation-Simulation" Pipeline

SimRecon proposes a unified framework that transforms raw video input into a physically assembled 3D scene through three sequential stages, bridged by two novel modules to ensure fidelity and plausibility.

A. Object-Centric Scene Representation

Instead of representing a scene as a holistic collection of primitives, SimRecon defines the scene $S_{comp}$ as a structured set of discrete object primitives $\{o_i\}$ . Each object is defined by:

Intrinsic Attributes: Spatial pose (6-DoF), appearance (mesh + PBR textures), and physical properties (mass, center of mass, material).
Relational Attributes: Encoded in a Scene Graph ( $G$ ) that captures supportive and attached relationships between objects.

B. Stage 1: Semantic Reconstruction (Perception)

The pipeline begins by processing the input video to perform semantic reconstruction. Using 3D Gaussian Splatting (specifically 2DGS), the system segments the scene into individual object instances, providing initial spatial attributes (pose, scale) and semantic labels.

C. Bridging Module 1: Active Viewpoint Optimization (AVO)

Goal: To bridge the gap between Perception and Generation by providing optimal image conditions for 3D asset generation.
Mechanism: Instead of using heuristic view selection (e.g., input views or canonical sampling), AVO treats view selection as an information gain maximization problem.
- It formulates the objective to maximize the accumulated opacity ( $\alpha$ ) of the object's rendered projection, which serves as a differentiable proxy for information gain.
- It employs an iterative optimization strategy: finding the viewpoint that maximizes opacity, then applying a multiplicative decay to the opacity of observed regions to force subsequent iterations to focus on unobserved (occluded) areas.
- A depth regularization term prevents the camera from collapsing too close to the surface.
Outcome: This generates a set of complete, unoccluded projection images that serve as high-quality conditions for single-object 3D generation models (e.g., Rodin).

D. Stage 2: Single-Object Generation

Using the optimized views from AVO, the system generates complete 3D meshes and textures for each segmented object instance, filling in missing geometry and appearance details.

E. Bridging Module 2: Scene Graph Synthesizer (SGS)

Goal: To bridge the gap between Generation and Simulation by ensuring the final assembly is physically plausible.
Mechanism: SGS constructs a global scene graph progressively to model "supported by" and "attached to" relationships.
- Region-based Inference: The scene is clustered into spatial regions. For each region, an optimal view is captured, and a Vision-Language Model (VLM) infers local relationship triplets (Child, Relation, Parent).
- Online Merging & Conflict Resolution: Local subgraphs are merged into a global graph. If a new edge conflicts with the existing hierarchy (e.g., creating a cycle or inconsistent parent-child structure), the system triggers a conflict resolution step: it re-optimizes a specific viewpoint for the conflicting nodes, re-inferes the relationships via VLM, and updates the graph.
Outcome: A globally consistent scene graph that serves as a "native construction guideline."

F. Stage 3: Hierarchical Physical Assembly (Simulation)

The final stage assembles the generated assets within a physical simulator (e.g., Isaac Sim, Blender).

Process: The system performs a Breadth-First Search (BFS) starting from base nodes (Floor/Wall).
Physics:
- For support relations: Objects are placed slightly above the parent and allowed to settle via gravity and collision simulation.
- For attachment relations: Fixed constraints are applied to anchor objects.
This ensures the final scene is free of floating objects or penetrations, mirroring real-world construction principles.

3. Key Contributions

Unified Pipeline: A novel "Perception-Generation-Simulation" framework that converts cluttered videos into object-centric, simulation-ready scenes.
Active Viewpoint Optimization (AVO): A differentiable, information-theoretic approach to finding optimal 3D viewpoints for generation, overcoming the limitations of heuristic view selection in occluded scenes.
Scene Graph Synthesizer (SGS): A progressive, conflict-resolving graph synthesis method that guides hierarchical physical assembly, ensuring physical plausibility without post-hoc collision correction.
Object-Centric Representation: A structured representation that explicitly separates intrinsic object attributes from relational scene context, enabling robust interaction and reasoning.

4. Experimental Results

The authors evaluated SimRecon on the ScanNet dataset (20 real-world scenes) using raw RGB videos.

Quantitative Performance:
- Reconstruction Quality: Outperformed state-of-the-art baselines (DPRecon, InstaScene, Gen3DSR) significantly.
  - Chamfer Distance (CD): 4.34 (Lower is better) vs. 6.90 (InstaScene).
  - F-Score: 62.65 vs. 49.69.
  - Normal Consistency (NC): 87.37 vs. 82.55.
- Rendering Fidelity: Achieved superior PSNR (24.43), SSIM (0.924), and LPIPS (0.153) scores compared to competitors.
- Efficiency: Inference time was ~~21 minutes per scene, significantly faster than DPRecon (~~10 hours) and competitive with others.
Qualitative Performance:
- Visual Fidelity: Generated assets showed complete geometry and accurate textures, whereas baselines produced deformed or occluded objects.
- Physical Plausibility: The final simulated scenes were physically stable with correct contact relationships. In contrast, baselines like MetaScenes often resulted in floating objects or required inefficient "blind" optimization (MCMC) to resolve collisions.

5. Significance

SimRecon represents a significant step forward in Embodied AI and Digital Twin creation. By solving the "Real-to-Sim" gap, it enables the automatic generation of diverse, physically accurate simulation environments from arbitrary real-world videos. This eliminates the need for expensive manual scanning, specialized hardware, or labor-intensive manual annotation, making it possible to scale the creation of training environments for robotics, navigation, and manipulation tasks. The introduction of "bridging modules" (AVO and SGS) provides a new paradigm for connecting generative AI with physical simulation constraints.