SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

Imagine you are an architect trying to build a detailed 3D model of a bustling city street, but you only have a single, flat photograph to work from.

The Problem: The "Blob" Builder
Current AI tools are like enthusiastic but clumsy construction crews. When they look at your photo, they can generate a 3D shape that looks like the street. However, they treat the whole scene as one giant, melted blob of clay.

If you want to move a specific tree, the AI can't do it because the tree is fused with the sidewalk and the building next to it.
If you want to change the color of a car, the AI might accidentally paint the whole street red because it can't tell where the car ends and the road begins.
They often get confused, creating "ghost" trees that appear twice (redundancy) or splitting a single house into three separate, floating pieces (mispartition).

The Insight: The "Confused Librarian"
The authors of this paper, SceneTransporter, realized that the AI's brain (its internal "assignment mechanism") was missing a rulebook. It knew what objects were in the picture, but it didn't know which pixel belonged to which object. It was like a librarian who had all the books but kept shuffling them randomly between shelves, making it impossible to find a specific story.

They discovered that the AI was trying to do too much at once without a clear plan, leading to a messy, tangled 3D world.

The Solution: The "Traffic Controller"
To fix this, the team introduced a new system called Optimal Transport (OT). Think of this as a highly efficient Traffic Controller or a Logistics Manager.

Here is how it works, using a simple analogy:

The Patches (The Cargo): Imagine the input photo is cut into thousands of tiny puzzle pieces (patches). Each piece is a piece of cargo that needs to be delivered.
The Parts (The Warehouses): The AI is trying to build different 3D objects (a house, a car, a tree). Think of these as different warehouses that need to receive cargo.
The Old Way: Previously, the AI was like a chaotic delivery service where every warehouse grabbed whatever cargo it wanted. The "House" warehouse might grab a piece of the "Tree," and the "Tree" warehouse might grab a piece of the "Road." This caused the mess.
The SceneTransporter Way: The new system uses Optimal Transport to calculate the perfect delivery route.
- One-to-One Rule: It enforces a strict rule: "One puzzle piece goes to exactly one warehouse." No sharing, no double-dipping. This ensures the tree stays a tree and the house stays a house.
- The Edge Guard: The system also looks at the "edges" in the photo (like the sharp line between a car and the sky). It acts like a border guard, ensuring that cargo from the "sky" side of the line never gets delivered to the "car" warehouse. This keeps the boundaries crisp.

The Result: A Clean, Editable World
By using this "Traffic Controller" inside the AI's brain, the system generates a 3D scene where every object is distinct and separate.

No More Melting: The tree doesn't melt into the building.
No More Ghosts: You don't get two trees where there should be one.
Editability: Because the AI now knows exactly which 3D part belongs to which object, you can now move, resize, or recolor individual items in the scene just like you would in a video game.

In Summary
SceneTransporter is like giving the AI a pair of scissors and a glue stick. Instead of melting the whole photo into a 3D blob, it carefully cuts out every object and glues them together in a way that respects their individual boundaries. This turns a messy, uneditable 3D model into a clean, structured, and fully interactive digital world, all from a single picture.

1. Problem Statement

The paper addresses the challenge of generating structured 3D scenes from a single input image. While existing methods can generate 3D objects or unstructured monolithic meshes, they struggle to organize these components into distinct, coherent instances in open-world environments.

The authors identify two critical failure modes in current "compositional" 3D generators (which attempt to generate scenes as a collection of part-level latent tokens):

Structural Mispartition: Semantic instances (e.g., a single chair or house) fail to form disjoint parts; instead, their geometry is scattered across multiple latent tokens.
Geometric Redundancy: Multiple latent tokens compete to describe the same geometric area, leading to overlapping or "entangled" geometry.

The core insight revealed by the authors is that these failures stem from a lack of structural constraints in the model's internal assignment mechanism. Current models rely on implicit learning to associate image patches with 3D parts, which is insufficient for complex, open-world scenes.

2. Methodology: SceneTransporter

The authors propose SceneTransporter, an end-to-end framework that reframes structured 3D scene generation as a global correlation assignment problem. The solution is integrated directly into the denoising loop of a compositional Diffusion Transformer (DiT).

A. Debiased Clustering Probe

Before proposing the solution, the authors developed a diagnostic tool to understand the latent structure:

They used Canonical Correlation Analysis (CCA) to identify shared subspaces (nuisance factors like global style or ground planes) between part-level latents.
By projecting tokens onto the orthogonal complement of this shared subspace, they isolated object-specific variations.
Finding: While the necessary information for correct instance grouping exists within the latent space, the model fails to explicitly organize it. This necessitates an explicit structural constraint mechanism.

B. Optimal Transport (OT) Formulation

SceneTransporter formulates the routing of visual evidence (image patches) to 3D part tokens as an Entropic Optimal Transport (OT) problem.

Objective: Minimize the cost of transporting mass from $L$ image patch features to $N$ part-level latent tokens.
Constraints:
- Exclusivity: Enforces a one-to-one routing where each patch contributes to only one part (preventing feature entanglement).
- Coverage: Ensures every part receives a sufficient "budget" of information (preventing parts from being starved).
Solver: The OT plan is computed efficiently using Sinkhorn iterations within the denoising steps.

C. Key Components

The framework introduces two novel mechanisms to enforce the OT constraints:

OT Plan–Gated Cross-Attention:
- The computed OT transport plan is used to generate a gating signal.
- This signal multiplicatively gates the Keys (K) and Values (V) in the cross-attention mechanism.
- Effect: It enforces a "hard" routing where image patches are strictly assigned to specific 3D parts, preventing leakage and ensuring that attention is focused exclusively on the assigned region.
Edge-Regularized Assignment Cost:
- To prevent the OT solver from merging distinct objects that are spatially adjacent (e.g., a fence touching a post), the cost matrix is regularized using an image edge map.
- The cost function penalizes assignments that cross salient image edges.
- Effect: This encourages region-wise consistency within objects while enforcing sharp boundaries between different instances, even without explicit segmentation masks.

3. Key Contributions

Diagnostic Insight: The design of a Debiased Clustering probe based on CCA, which reveals that the primary bottleneck in compositional 3D generation is the lack of explicit structural constraints in the assignment mechanism, not a lack of feature information.
Novel Framework: The introduction of SceneTransporter, which reframes scene generation as an OT-guided correlation assignment problem. It uniquely combines:
- OT Plan–Gated Cross-Attention for exclusive, one-to-one routing.
- Edge-Regularized Assignment Cost for coherent structure grouping and boundary preservation.
State-of-the-Art Performance: The method achieves superior results in open-world 3D scene generation, significantly improving instance-level coherence and geometric fidelity compared to existing baselines.

4. Experimental Results

The authors evaluated SceneTransporter on a dataset of 74 diverse open-world scene images, comparing it against state-of-the-art methods like MIDI, PartCrafter, and PartPacker.

Quantitative Metrics:
- Geometry Fidelity: Achieved the highest scores on ULIP, ULIP-2, and Uni3D benchmarks, indicating better alignment with the input image and 3D consistency.
- Part Disentanglement: Achieved the second-lowest IoU (Intersection over Union) between parts (lower is better), indicating significantly less overlap between generated objects compared to baselines.
- Efficiency: Inference time is comparable to PartPacker and significantly faster than MIDI and PartCrafter.
Qualitative Results:
- Visual comparisons show that SceneTransporter generates complete, coherent objects (e.g., whole houses, trees, sofas) with sharp boundaries.
- Baselines often exhibit "semantic fragmentation" (e.g., a roof split across multiple parts) or "feature entanglement" (e.g., ground textures leaking into buildings).
User Study:
- In a blind ranking study with 30 participants, SceneTransporter received the highest preference across Geometry Quality, Layout Coherence, and Segmentation Plausibility.
Ablation Studies:
- Removing the OT gating leads to chaotic attention maps and corrupted geometry.
- Removing the edge regularization results in merged objects at contact boundaries.
- The method converges rapidly (within 3–5 Sinkhorn iterations) and remains stable.

5. Significance

This work represents a paradigm shift in 3D scene generation. Instead of relying on brittle multi-stage pipelines (segmentation $\to$ generation $\to$ assembly) or implicit learning that fails in complex scenes, SceneTransporter introduces mathematically rigorous structural constraints (Optimal Transport) directly into the generative process.

By solving the assignment problem globally and explicitly, it ensures that the generated 3D scene is not just a fused mesh but a structured collection of distinct, editable instances. This capability is crucial for downstream applications in embodied AI, simulation, and asset management, where object-level disentanglement is a prerequisite for functionality. The method sets a new standard for single-image 3D scene generation, particularly in open-world, unstructured environments.

SceneTransporter: Optimal Transport-Guided Compositional Latent Diffusion for Single-Image Structured 3D Scene Generation

1. Problem Statement

2. Methodology: SceneTransporter

A. Debiased Clustering Probe

B. Optimal Transport (OT) Formulation

C. Key Components

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation