SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

Imagine you are an architect trying to design a room using a magical AI painter. You give the AI a list of instructions: "Put a sofa here, a lamp there, and a cat on the rug."

In the past, if you asked the AI to draw this, it might put the lamp inside the sofa, or make the cat float in mid-air because it didn't understand that objects take up space and can block each other. It was like playing with 2D paper cutouts on a flat table; the AI didn't "get" that a sofa is a 3D block that hides things behind it.

SeeThrough3D is a new invention that teaches the AI how to see the world in 3D, specifically understanding occlusion (when one object hides another). Here is how it works, broken down into simple concepts:

1. The "Ghost Box" Blueprint (OSCR)

The core idea is a new way of giving instructions to the AI, called OSCR (Occlusion-Aware 3D Scene Representation).

The Old Way: Imagine giving the AI a flat map with depth numbers. It's like trying to explain a sandwich by telling someone how thick the bread is, but not showing them where the cheese is. The AI gets confused about what is in front and what is behind.
The SeeThrough3D Way: Instead of a flat map, we give the AI a 3D blueprint made of "ghost boxes."
- Imagine you are building a scene with translucent (see-through) cardboard boxes.
- You place a box for the "sofa" and a box for the "cat."
- Because the boxes are see-through, the AI can see the cat through the sofa box, but it knows the sofa is physically in front.
- The Color Trick: To help the AI know which way the sofa is facing, the front of the box is painted orange, the left side blue, and the top green. This is like giving the AI a compass so it knows exactly how to rotate the object.

2. The "Name Tag" System (Attention Masking)

Sometimes, when you have a crowded room with a dog, a chair, and a table, the AI might get confused and paint the dog's face on the chair.

The Solution: The researchers added a "Name Tag" system.
Imagine every ghost box has a tiny invisible string attached to it. This string is tied to the specific word in your text prompt (e.g., the "dog" box is tied to the word "dog").
Even if the boxes overlap heavily, the AI looks at the string and says, "Ah, this part of the image belongs to the word 'dog,' and that part belongs to 'chair'." This prevents the AI from mixing up attributes (like giving the chair a tail).

3. The "Camera Operator"

Most AI art tools let you type a prompt, but they decide where the camera is.

SeeThrough3D lets you be the camera operator. Because the "ghost boxes" are placed in a virtual 3D space, you can tell the AI, "Take the picture from a low angle looking up," or "Zoom in from the side." The AI renders the scene exactly from that viewpoint, keeping the perspective correct.

4. Training the AI: The "Virtual Sandbox"

You might wonder, "How did they teach the AI this?" They didn't just show it millions of photos.

They built a Virtual Sandbox (using a 3D software called Blender).
They programmed a robot to randomly throw 3D objects (chairs, cars, animals) into a room, making sure they crashed into each other and blocked one another.
They took photos of these messy, overlapping scenes and taught the AI: "This is what a 'dog behind a chair' looks like."
Even though the training data was made of 3D models, the AI learned the rules of physics and hiding, so it can now draw realistic photos of real-world objects doing the same thing.

Why is this a big deal?

Think of it like the difference between stacking flat cards and building with LEGOs.

Old methods were like stacking cards; if you put a card on top of another, the bottom one disappears completely.
SeeThrough3D is like LEGOs. You can build complex structures where parts are hidden, but the AI knows exactly how the pieces fit together in 3D space.

In summary: SeeThrough3D gives the AI a "see-through" 3D map with color-coded directions and name tags. This allows it to draw complex, crowded scenes where objects realistically hide behind one another, all while letting you control exactly where the camera is looking. It turns the AI from a flat painter into a 3D director.

1. Problem Statement

Current text-to-image (T2I) generation methods have made significant strides in 2D spatial control (e.g., bounding boxes, segmentation maps). However, they struggle with inherently 3D scene properties, specifically:

Precise 3D Layout: Controlling object size, orientation, and placement in 3D space.
Camera Viewpoint: Generating images from specific camera angles.
Occlusion Reasoning: The ability to synthesize partially hidden objects with depth-consistent geometry and scale.

Existing 3D-aware methods often rely on depth maps derived from 3D bounding boxes or 2D object layers. These approaches fail to model complex inter-object occlusions accurately, leading to geometric inconsistencies, incorrect object placements, or the collapse of 3D structure into flat planes. There is a critical lack of methods that can reason about what is hidden behind what while maintaining a coherent 3D scene.

2. Methodology

The authors propose SeeThrough3D, a framework that conditions a pre-trained flow-based T2I model (FLUX) on a novel scene representation to achieve occlusion-aware 3D control.

A. Occlusion-Aware 3D Scene Representation (OSCR)

The core innovation is the OSCR, a visual representation that encodes 3D layout and camera viewpoint in a single image:

Translucent 3D Boxes: Objects are represented as 3D bounding boxes rendered as translucent volumes. This transparency allows occluded objects to remain partially visible, providing the model with explicit cues about hidden regions and depth ordering.
Color-Coded Orientation: The faces of each 3D box are color-coded according to a predefined mapping (e.g., orange for front, blue for left). This encodes the 3D orientation of the object directly in the image space.
Camera Control: The entire OSCR layout is rendered from a specific camera viewpoint $C$ , embedding the desired camera pose directly into the condition image.

B. Model Architecture & Conditioning

Base Model: The method builds upon FLUX, a state-of-the-art Diffusion Transformer (DiT) based T2I model.
Token Integration: The rendered OSCR image is encoded via a VAE to produce OSCR tokens. These are concatenated with text prompt tokens and noisy image tokens.
LoRA Adaptation: To adapt the model to OSCR without destroying its pre-trained text-to-image priors, the authors train LoRA (Low-Rank Adaptation) adapters only on the projection matrices associated with the OSCR tokens.
Attention Masking (Object Binding): A key challenge is ensuring that specific text descriptions (e.g., "dog") bind to specific 3D boxes in the OSCR. The authors introduce masked self-attention:
- OSCR tokens within a specific bounding box are masked to attend only to the corresponding noun tokens in the text prompt.
- This prevents "attribute mixing" (e.g., a dog taking on the color of a car) and ensures precise object placement.
- In overlapping regions, tokens attend to multiple object tokens, but the model's latent space naturally disentangles features to preserve occlusion boundaries.

C. Dataset Construction

Since real-world datasets with precise 3D annotations and heavy occlusions are scarce, the authors created a synthetic dataset:

Procedural Generation: 3D assets are placed in a virtual environment (Blender) with controlled camera viewpoints to induce strong inter-object occlusions.
Filtering: Scenes are filtered to ensure heavy occlusion (visibility ratio constraints) while maintaining object visibility.
Realistic Augmentation: To prevent overfitting to synthetic backgrounds, the authors use a depth-to-image pipeline (FLUX.1-Depth-dev) to generate realistic backgrounds. They apply CLIP-based filtering to ensure the generated backgrounds adhere to the original layout and object identities.
Scale: The dataset comprises 25k rendered images and 25k realistic augmentations.

3. Key Contributions

OSCR Representation: A novel, efficient 3D scene representation using translucent, color-coded boxes that explicitly encodes occlusion and orientation, enabling the model to "see through" occlusions.
Attention-Based Object Binding: A mechanism using attention masking to semantically bind text descriptions to specific 3D spatial regions, solving the problem of attribute mixing in multi-object scenes.
Occlusion-Aware Training: A synthetic data pipeline specifically designed to train models on heavy occlusion scenarios, a gap in previous literature.
Generalization: The method generalizes to unseen object categories, complex layouts (many objects), and personalized objects (via reference images) without retraining the base model.

4. Results

The authors evaluated SeeThrough3D against state-of-the-art baselines (LooseControl, Build-A-Scene, LaRender, VODiff) on a new benchmark, 3DOcBench.

Quantitative Performance:
- Depth Ordering: Achieved the highest score (1.46), significantly outperforming baselines (e.g., 0.82 for LooseControl), indicating superior occlusion consistency.
- Objectness Score: Highest score (22.86), showing better adherence to the layout.
- Angular Error: Lowest error (47.92), demonstrating precise orientation control compared to baselines which often suffer from 180° flips.
- Image Quality (KID): Best score (5.43), indicating high-fidelity generation.
Qualitative Results:
- Successfully generates complex scenes with 6+ objects and heavy overlaps.
- Preserves the "prior" of the base model, allowing for realistic transparent objects, text rendering, and natural interactions (e.g., a dog riding a bike).
- Personalization: Can generate personalized objects (e.g., a specific dog or chair) adhering to 3D layouts.
User Study: Participants preferred SeeThrough3D over baselines in 95%+ of cases across realism, layout adherence, and prompt alignment.

5. Significance

SeeThrough3D addresses a fundamental limitation in generative AI: the inability to reason about 3D occlusions. By introducing a representation that makes hidden geometry visible to the model, it enables:

Design & Architecture: Precise control over scene composition for visualization.
Gaming & Metaverse: Rapid prototyping of 3D scenes with consistent physics and geometry.
Future Research: Establishes a new paradigm for 3D-aware generation that moves beyond simple depth maps to explicit occlusion reasoning, paving the way for more controllable and physically consistent synthetic data generation.

The method demonstrates that with the right representation (OSCR) and binding mechanism (attention masking), pre-trained 2D diffusion models can be effectively guided to perform complex 3D reasoning tasks.