ZeroScene: A Zero-Shot Framework for 3D Scene Generation from a Single Image and Controllable Texture Editing

ZeroScene is a novel zero-shot framework that leverages large vision models to reconstruct coherent 3D scenes from a single image and enables controllable, multi-view consistent texture editing through joint optimization of 3D/2D losses and mask-guided diffusion strategies.

Xiang Tang, Ruotong Li, Xiaopeng Fan

Published 2026-02-18
📖 5 min read🧠 Deep dive

Imagine you have a single photograph of a messy room: a teddy bear sitting on a wooden table, a cup nearby, and a poster on the wall. Now, imagine you want to turn that flat, 2D photo into a fully playable 3D video game level where you can walk around, pick up the bear, and even change the cup's design to look like a golden trophy.

That is exactly what ZeroScene does, but it does it with a magic trick called "Zero-Shot" learning. It doesn't need to be taught with thousands of examples; it just looks at your picture and figures it out on the fly.

Here is how the system works, broken down into simple steps with some creative analogies:

1. The "Cut and Paste" Detective (Scene Reconstruction)

When you show ZeroScene a photo, it doesn't just see a flat image. It acts like a super-smart detective that can see through the clutter.

  • The Problem: In a photo, objects often hide behind each other (occlusion). If a teddy bear is half-hidden by a table, a normal computer might think the bear is broken or missing a leg.
  • The ZeroScene Solution: It first "cuts out" every object it sees (the bear, the cup, the table). Then, it uses a powerful AI "imagination engine" (a Large Vision Model) to fill in the missing parts. It's like looking at a torn puzzle piece and instantly knowing what the rest of the picture looks like, so it can "heal" the image of the bear before turning it into 3D.
  • The Background: Most 3D tools ignore the background (the walls and floor) because they are hard to model. ZeroScene treats the background as a solid foundation. It removes the objects, figures out the shape of the room, and then puts the objects back in their exact correct spots.

2. The "3D Jigsaw Puzzle" (Layout Optimization)

Once the AI has created 3D models of the bear, the cup, and the table, they might be floating in space or sized incorrectly. They need to be arranged just like they were in the original photo.

  • The Analogy: Imagine you have a bunch of 3D printed toys and you want to arrange them on a table to match a photo. You might guess the position, but it's hard to get the depth right.
  • The ZeroScene Solution: It uses a "double-check" system. It projects the 3D models back onto a 2D screen and compares them to the original photo. If the bear looks too big or too far away, the system nudges it until the 3D scene perfectly matches the 2D photo from every angle. It's like tuning a radio until the static disappears and the music is crystal clear.

3. The "Magic Paintbrush" (Texture Editing)

Now that you have a 3D scene, what if you want to change the bear's fur to look like a shiny metal robot, or turn the wooden table into a glass surface?

  • The Problem: If you just paint a 3D object, the texture often looks weird when you walk around it (like a sticker peeling off). It might look great from the front but blurry from the side.
  • The ZeroScene Solution: It uses a "Mask-Guided Progressive" strategy. Think of this as painting a mural on a curved wall. Instead of painting the whole wall at once, you paint a small section, then move to the next, making sure the new paint blends perfectly with the old paint.
  • The Result: You can type "Make the cup look like a pink heart," and ZeroScene will repaint the cup from every single angle simultaneously, ensuring the heart pattern wraps around the cup smoothly without any seams or glitches.

4. The "Realism Booster" (PBR Materials)

Finally, to make the scene look like a movie and not a cartoon, ZeroScene adds Physically Based Rendering (PBR).

  • The Analogy: Regular 3D models are like paper cutouts; they look flat even when they spin. PBR is like giving the objects a "skin" that reacts to light. It calculates how shiny the metal is, how rough the wood feels, and how light bounces off the glass.
  • The Result: When you shine a virtual light on your generated scene, the metal cup reflects the light realistically, and the wooden table casts a soft shadow, just like in the real world.

Why Does This Matter?

ZeroScene is like a universal translator between the 2D world (photos) and the 3D world (games, VR, robots).

  • For Gamers: You could take a photo of your living room and instantly turn it into a level for a video game.
  • For Robots: Robots need to practice in virtual worlds before entering real ones. ZeroScene can turn a photo of a messy kitchen into a perfect training simulation for a robot chef.
  • For Designers: You can take a sketch of a furniture arrangement and instantly see it as a realistic 3D room with editable textures.

In short, ZeroScene takes a single snapshot, understands the hidden 3D world inside it, fixes the missing pieces, and lets you remix the textures, all without needing a team of 3D artists or a massive database of training data. It turns a flat picture into a living, breathing 3D world.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →