Towards Geometric and Textural Consistency 3D Scene Generation via Single Image-guided Model Generation and Layout Optimization

This paper proposes a novel three-stage framework that generates geometrically and texturally consistent 3D scenes from a single RGB image by combining instance segmentation and inpainting for asset recovery, pseudo-stereo geometry for camera estimation, and layout optimization via Chamfer distance minimization to ensure precise alignment with the input.

Xiang Tang, Ruotong Li, Xiaopeng Fan

Published 2026-02-18
📖 4 min read☕ Coffee break read

Imagine you are an architect who has been handed a single, flat photograph of a messy living room. Your job is to build a perfect, 3D virtual replica of that room so you can walk around inside it.

The problem? In the photo, the coffee table is hiding half of the sofa, and a lamp is blocking the view of a bookshelf. If you just try to guess what's behind those objects, you might build a sofa that's missing a leg or a bookshelf that's floating in mid-air.

This paper presents a new "architect" (an AI system) that solves this problem using a three-step "Divide and Conquer" strategy. Here is how it works, explained simply:

Step 1: The "Puzzle Solver" (Instance Segmentation & Inpainting)

First, the AI looks at the photo and says, "Okay, I see a chair, a table, and a lamp." It cuts them out of the picture like puzzle pieces.

But here's the catch: some pieces are broken because other objects are covering them (occlusion).

  • The Analogy: Imagine trying to draw a picture of a person, but someone is standing in front of them, hiding their left arm. If you just trace what you see, the person will look like they have no left arm.
  • The Fix: Before building the 3D model, the AI uses a "smart painter" (an advanced AI tool) to fill in the missing parts. It guesses what the hidden arm looks like based on the rest of the body and the context. Now, it has a complete, perfect 2D drawing of every single object, with no holes.

Step 2: The "Sculptor's Choice" (3D Generation & Selection)

Now that the AI has perfect drawings of every object, it starts sculpting them in 3D.

  • The Analogy: Imagine you need a statue of a bear. You don't just make one; you make five different versions of the bear. One might be slightly too round, another too thin, and one might have the perfect pose.
  • The Fix: The AI generates multiple 3D candidates for each object. Then, it looks back at the original photo and asks, "Which of these five bears looks most like the one in the picture?" It picks the best match and discards the rest. This ensures the 3D object isn't just "okay," but a perfect match for the photo.

Step 3: The "Furniture Arranger" (Layout Optimization)

Now the AI has a pile of perfect 3D objects (a sofa, a table, a lamp), but they are all floating in a void. It needs to put them back together exactly as they were in the photo.

  • The Analogy: Imagine you have a 3D model of a room, but the furniture is floating in the air. You need to slide the table forward, rotate the chair, and scale the lamp up or down so it fits perfectly.
  • The Fix: The AI uses a "double-check" system.
    1. 3D Check: It looks at the 3D shapes and tries to match them to the depth (distance) of the original photo.
    2. 2D Check: It takes a "snapshot" of its 3D arrangement and compares it to the original 2D photo. If the shadow of the lamp falls in the wrong spot, or the table looks too big, it tweaks the position.
    • It keeps adjusting the position, rotation, and size of every object until the 3D scene looks identical to the 2D photo from every angle.

Why is this a big deal?

Previous methods tried to build the whole room at once, like trying to bake a cake by throwing all the ingredients into a bowl and hoping it turns out right. This often resulted in "glitchy" rooms where objects merged into each other or looked flat.

This new method is like baking a cake layer by layer:

  1. Fix the ingredients (repair the hidden parts of objects).
  2. Bake the perfect layers (generate the best 3D models).
  3. Stack them perfectly (optimize the layout).

The Result: A 3D scene that is not only geometrically accurate (the shapes are right) but also texturally perfect (the colors and details match), even when the original photo was full of overlapping objects. It turns a flat, confusing picture into a navigable, realistic 3D world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →