Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization

Imagine you are an architect who has been handed a single, flat photograph of a messy living room. Your job is to build a perfect, 3D virtual replica of that room so you can walk around inside it.

The problem? In the photo, the coffee table is hiding half of the sofa, and a lamp is blocking the view of a bookshelf. If you just try to guess what's behind those objects, you might build a sofa that's missing a leg or a bookshelf that's floating in mid-air.

This paper presents a new "architect" (an AI system) that solves this problem using a three-step "Divide and Conquer" strategy. Here is how it works, explained simply:

Step 1: The "Puzzle Solver" (Instance Segmentation & Inpainting)

First, the AI looks at the photo and says, "Okay, I see a chair, a table, and a lamp." It cuts them out of the picture like puzzle pieces.

But here's the catch: some pieces are broken because other objects are covering them (occlusion).

The Analogy: Imagine trying to draw a picture of a person, but someone is standing in front of them, hiding their left arm. If you just trace what you see, the person will look like they have no left arm.
The Fix: Before building the 3D model, the AI uses a "smart painter" (an advanced AI tool) to fill in the missing parts. It guesses what the hidden arm looks like based on the rest of the body and the context. Now, it has a complete, perfect 2D drawing of every single object, with no holes.

Step 2: The "Sculptor's Choice" (3D Generation & Selection)

Now that the AI has perfect drawings of every object, it starts sculpting them in 3D.

The Analogy: Imagine you need a statue of a bear. You don't just make one; you make five different versions of the bear. One might be slightly too round, another too thin, and one might have the perfect pose.
The Fix: The AI generates multiple 3D candidates for each object. Then, it looks back at the original photo and asks, "Which of these five bears looks most like the one in the picture?" It picks the best match and discards the rest. This ensures the 3D object isn't just "okay," but a perfect match for the photo.

Step 3: The "Furniture Arranger" (Layout Optimization)

Now the AI has a pile of perfect 3D objects (a sofa, a table, a lamp), but they are all floating in a void. It needs to put them back together exactly as they were in the photo.

The Analogy: Imagine you have a 3D model of a room, but the furniture is floating in the air. You need to slide the table forward, rotate the chair, and scale the lamp up or down so it fits perfectly.
The Fix: The AI uses a "double-check" system.
1. 3D Check: It looks at the 3D shapes and tries to match them to the depth (distance) of the original photo.
2. 2D Check: It takes a "snapshot" of its 3D arrangement and compares it to the original 2D photo. If the shadow of the lamp falls in the wrong spot, or the table looks too big, it tweaks the position.
- It keeps adjusting the position, rotation, and size of every object until the 3D scene looks identical to the 2D photo from every angle.

Why is this a big deal?

Previous methods tried to build the whole room at once, like trying to bake a cake by throwing all the ingredients into a bowl and hoping it turns out right. This often resulted in "glitchy" rooms where objects merged into each other or looked flat.

This new method is like baking a cake layer by layer:

Fix the ingredients (repair the hidden parts of objects).
Bake the perfect layers (generate the best 3D models).
Stack them perfectly (optimize the layout).

The Result: A 3D scene that is not only geometrically accurate (the shapes are right) but also texturally perfect (the colors and details match), even when the original photo was full of overlapping objects. It turns a flat, confusing picture into a navigable, realistic 3D world.

1. Problem Statement

The paper addresses the significant challenge of generating high-quality, coherent 3D scenes from a single RGB image. While recent advancements in 3D generation have excelled at creating individual objects, they struggle with multi-object scenarios due to:

Geometric Ambiguity: Single-view inputs often lead to incomplete geometries and inconsistent textures, especially for occluded regions.
Multi-Object Entanglement: Current methods often treat occluded objects as a single entity or fail to separate distinct instances, leading to loss of detail and structural errors.
Layout Inconsistency: Existing compositional synthesis methods frequently suffer from incorrect depth estimation, resulting in abnormal object placement, orientation, and scale relative to the input image.
Lack of Explicit Geometry: Many approaches rely on implicit representations (like NeRFs) or lack precise control over the spatial arrangement of generated assets.

2. Methodology

The authors propose a novel three-stage framework that decomposes the complex scene generation task into collaborative subtasks: Instance Segmentation & Generation, Point Cloud Extraction, and Layout Optimization.

Stage 1: Instance Segmentation and Generation

Detection & Segmentation: The pipeline first uses object detection (Grounding DINO) and segmentation (SAM) to identify foreground objects, generating bounding boxes, semantic labels, and binary masks.
Inpainting: To handle occlusions, the system employs a Vision-Language Model (GPT-4o) to inpaint the segmented instance images. This repairs missing details caused by mutual occlusion, ensuring the input to the 3D generator is structurally complete.
3D Asset Generation: Using the inpainted images, a generative model (Trellis) produces multiple candidate 3D assets (meshes and point clouds) for each object, ensuring high-fidelity geometry and texture.

Stage 2: Point Cloud Extraction & Model Selection

Pseudo-Stereo Depth Estimation: To recover scene geometry without ground-truth depth, the method constructs a "pseudo-stereo" pair using the original image and a copy. It utilizes DUSt3R (a pre-trained deep learning module) to estimate camera parameters, a depth map, and a global scene point cloud.
Instance Point Clouds: The global point cloud is segmented using the masks from Stage 1 to extract independent point clouds ( $PC_i$ ) for each object instance.
Model Selection: Since Stage 1 generates $K$ candidate models per object, a selection strategy is employed. The system calculates the Chamfer Distance between the candidate model point clouds and the extracted instance point clouds ( $PC_i$ ). The model with the minimum distance is selected as the optimal asset ( $M_i$ ) for the scene.

Stage 3: Layout Optimization

Parameterization: Each selected 3D asset is parameterized by learnable translation ( $T$ ), rotation ( $R$ ), and scaling ( $S$ ) parameters.
Joint Optimization: The system optimizes these parameters to align the 3D assets with the original image layout. The loss function combines two constraints:
1. 3D Chamfer Distance ( $L_{3D}$ ): Minimizes the distance between the generated model point cloud and the extracted instance point cloud in 3D space.
2. 2D Projection Chamfer Distance ( $L_{2D}$ ): Projects both the model and the instance point clouds onto the 2D image plane using the estimated camera parameters and minimizes the distance between them.
Result: This dual-space optimization ensures that the final scene maintains both geometric accuracy in 3D and visual consistency with the 2D input image.

3. Key Contributions

Modular Three-Stage Framework: A novel pipeline that successfully decouples object generation from scene assembly, allowing for the extraction of multiple independent 3D assets with explicit geometry and high-quality textures from a single image.
Asset Generation-Selection Strategy: An integrated approach combining image inpainting (to fix occlusions) and a Chamfer-distance-based model selection mechanism. This ensures the generated 3D assets best match the reference image instances, overcoming reconstruction failures caused by occlusion.
3D-2D Joint Layout Optimization: A technique that leverages both 3D point cloud matching and 2D projection constraints. This effectively resolves the ambiguity of monocular depth estimation, ensuring precise spatial alignment and multi-view consistency.

4. Experimental Results

The method was evaluated on a custom dataset containing multi-object scenes with varying degrees of occlusion (real photos, VLM-generated images, and 3D-FRONT synthetic scenes).

Quantitative Performance: The proposed method outperformed state-of-the-art baselines (MIDI, Zhou et al., Gen3DSR, CAST) across all metrics:
- CLIP-Score: Highest scores for both geometry (0.8389) and color/texture (0.8990), indicating strong correlation with the reference image.
- Chamfer Distance: Lowest error in both 3D space (0.0127) and 2D projection (4.9264), demonstrating superior spatial accuracy.
- F-Score: Highest reconstruction accuracy (76.60 in 3D, 44.12 in 2D).
Qualitative Analysis: Visual comparisons showed that while other methods suffered from shape distortion, missing textures, or incorrect object placement, the proposed method maintained structural integrity and correct layout.
Ablation Studies:
- Removing Inpainting led to redundant geometries and pose errors.
- Removing Model Selection introduced incompatible assets.
- Removing either the 3D or 2D loss resulted in unstable convergence and misalignment, proving the necessity of the joint optimization strategy.
User Study: In a study with 40 volunteers, the proposed method was preferred in 55% of cases, outperforming existing approaches including CAST.

5. Significance and Future Work

Significance: This work bridges the gap between single-object 3D generation and complex scene synthesis. By explicitly handling occlusions through inpainting and enforcing geometric consistency through dual-space optimization, it enables the creation of immersive, photorealistic 3D environments from a single photo. This has high potential for applications in XR (Extended Reality), embodied intelligence, autonomous navigation, and digital content creation.
Limitations & Future Directions:
- Severe Occlusion: Performance degrades when object overlap (IoU) exceeds 25%.
- Background Handling: The current pipeline treats the background as non-interactive; future work aims to decouple background depth for complex scene modeling.
- Texture Refinement: While geometric consistency is high, future iterations will focus on optimizing texture mapping and material properties to address exposure issues.

In conclusion, the paper presents a robust solution for single-image 3D scene generation that prioritizes both the fidelity of individual objects and the coherence of the overall scene layout, setting a new standard for compositional 3D synthesis.

Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization

Step 1: The "Puzzle Solver" (Instance Segmentation & Inpainting)

Step 2: The "Sculptor's Choice" (3D Generation & Selection)

Step 3: The "Furniture Arranger" (Layout Optimization)

Why is this a big deal?

1. Problem Statement

2. Methodology

Stage 1: Instance Segmentation and Generation

Stage 2: Point Cloud Extraction & Model Selection

Stage 3: Layout Optimization

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration