ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing

Imagine you are an interior designer, but instead of sketching on paper or dragging and dropping furniture in a video game, you simply talk to your room.

This paper introduces ReSpace, a new AI system that lets you design and edit 3D indoor rooms using plain English. It's like having a magical assistant that understands not just what you want (e.g., "a cozy sofa"), but exactly where it should go, how big it should be, and how it fits with everything else in the room.

Here is a breakdown of how it works, using some everyday analogies:

1. The Problem: The "Blind" and the "Rigid"

Before ReSpace, existing AI tools for designing rooms had two main flaws:

The "Blind" Architect: Some AIs could generate rooms, but they were like architects who couldn't see the walls. They would put a sofa floating in mid-air or a table sticking out of the ceiling because they didn't truly understand the room's boundaries.
The "Rigid" Builder: Other tools were like builders who could only work with perfect square boxes. If your room had a weird shape (like an L-shape or a bay window), they couldn't handle it. Also, if you wanted to swap a chair for a table, they often couldn't do it; they usually had to start the whole room from scratch.

2. The Solution: ReSpace (The "Smart Translator")

ReSpace acts as a translator between your natural language and the 3D world. It breaks the process down into three clever steps:

Step A: The "Blueprint" (Structured Scene Representation)

Instead of trying to guess the room from scratch, ReSpace uses a special digital "blueprint" (called SSR). Think of this like a JSON recipe card.

It lists the room's exact shape (the walls and floor).
It lists every object currently in the room with a description (e.g., "a modern gray sofa"), its size, and its exact coordinates.
Why this matters: Because the room is written as a clear list of facts, the AI can easily read it, understand the boundaries, and know exactly where there is empty space.

Step B: The "Next Word" Game (Autoregressive Generation)

ReSpace treats designing a room like playing a game of fill-in-the-blanks.

You give it a prompt: "Add a dark gray tufted sofa."
The AI looks at the current "recipe" (the room), reads the prompt, and predicts the next few words needed to describe the new sofa's position and size.
It's like a text adventure game where the computer writes the next sentence of the story for you. It does this one object at a time, building the scene piece by piece.

Step C: The "Furniture Catalog" (Asset Sampling)

Once the AI decides where the sofa goes and how big it is, it needs a real 3D model.

ReSpace doesn't just pick the first sofa it finds. It acts like a personal shopper. It looks through a massive catalog of 3D furniture and picks the one that best matches your description ("dark gray," "tufted") and fits the size constraints it just calculated.
If you say "remove the plant," the AI simply deletes that item from the recipe card.

3. The Secret Sauce: "Voxel" Checking

One of the biggest innovations in this paper is how ReSpace checks for mistakes.

Old Way: Imagine checking if a chair fits under a table by looking at a rough cardboard box around the chair. If the box fits, the AI thinks it's good. But in reality, the chair might still be hitting the table leg.
ReSpace's Way (Voxelization): ReSpace turns the room into a giant 3D grid of tiny cubes (like a Minecraft world). It checks every single tiny cube to see if the chair is actually touching the table or the floor.
This allows it to catch "fine-grained" errors, like a lamp sticking out of the wall or a chair slightly overlapping a rug, ensuring the final scene looks physically realistic.

4. Learning from Feedback (The "Taste Test")

The team didn't just teach the AI to follow rules; they taught it to have good taste.

They used a technique called Preference Alignment. Imagine showing the AI two different ways to place a lamp. One looks weird (too close to the wall), and one looks great.
They told the AI, "The second one is better."
Over time, the AI learned to prioritize arrangements that humans find pleasing, not just arrangements that technically fit in the box.

Summary: What Can You Do With It?

With ReSpace, you can:

Add: "Put a modern spherical lamp in the corner."
Remove: "Take away the plant with the black pot."
Swap: "Replace the bookcase with a wooden wardrobe."
Handle Complex Shapes: It works perfectly in rooms with weird angles, not just perfect squares.

In a nutshell: ReSpace is like having a super-smart, patient interior designer who listens to your voice commands, understands the physical limits of your room, checks for collisions with microscopic precision, and arranges your furniture to look beautiful—all without you ever having to drag a mouse or move a 3D model yourself.

1. Problem Statement

The paper addresses the challenges in 3D indoor scene synthesis and editing. Existing methods suffer from several critical limitations:

Semantic Oversimplification: Many approaches rely on one-hot class encodings (e.g., just "chair") rather than rich natural language descriptions.
Lack of Editing Capabilities: Most models focus on one-shot generation from scratch, lacking the ability to iteratively add, remove, or swap objects based on user instructions.
Geometric Constraints: Methods often ignore complex room boundaries (assuming rectangular layouts) or rely on fixed-resolution floor plans that fail to capture non-convex geometries.
Weak Spatial Reasoning: LLM-based agents often rely on implicit world models that struggle with precise 3D placement, leading to layout violations (e.g., objects floating or intersecting).
Asset Rigidity: Many systems are tightly coupled to specific 3D asset catalogs, making them difficult to deploy across different libraries.

The goal is to create a framework that supports fine-grained, text-driven editing (addition, removal, swapping) and full scene synthesis while maintaining explicit room boundaries and asset-agnostic flexibility.

2. Methodology: ReSpace Framework

ReSpace frames 3D scene manipulation as a next-token prediction task using a specialized Large Language Model (LLM). The pipeline consists of four core components:

A. Structured Scene Representation (SSR)

Instead of using raw meshes or voxel grids, ReSpace uses a lightweight, editable JSON-based SSR. This representation explicitly encodes:

Room Boundaries: Defined as ordered 3D point sets forming rectilinear polygons for the floor (bounds_bottom) and ceiling (bounds_top), supporting non-rectangular layouts.
Object Semantics: Each object is described by a natural language string (desc) capturing style, material, and color, alongside precise numerical values for size, position, and rotation (quaternions).
Asset Agnosticism: The SSR separates the description and geometry of an object from the specific 3D mesh asset. This allows the system to swap assets from different catalogs without altering the underlying scene logic.

B. Autoregressive Scene Synthesis (SG-LLM)

The core engine is SG-LLM (Scene Graph LLM), a model trained to predict the next object in a scene sequence.

Input: An existing SSR (partial scene) + a natural language prompt for the next object (e.g., "add a modern spherical lamp").
Output: A token sequence representing the new object's attributes (description, size, position, rotation) in JSON format.
Training: The model undergoes Supervised Fine-Tuning (SFT) on a dataset derived from 3D-FRONT, where scenes are converted into instruction-response pairs.

C. Zero-Shot Decomposition & Removal

Object Addition: A zero-shot LLM (e.g., GPT-4o) acts as an interface, decomposing complex user instructions into a sequence of atomic "add" commands.
Object Removal/Swapping: The zero-shot LLM directly edits the SSR JSON to remove objects or swap them, leveraging its text-editing capabilities without needing a specialized training step for removal.
Full Scene Generation: For synthesis from scratch, the zero-shot LLM generates a list of object prompts, which SG-LLM processes autoregressively to build the scene step-by-step.

D. Stochastic Asset Sampling

Once SG-LLM generates an object description and size, a Stochastic Asset Sampler retrieves a concrete 3D mesh from a catalog.

It uses a probabilistic distribution $g_\phi$ combining semantic similarity (SigLIP embeddings for text-to-asset matching) and geometric similarity (Gaussian matching of 3D bounding box sizes).
This allows for diverse, realistic asset selection rather than deterministic greedy matching.

E. Preference Alignment & Evaluation (RLVR & VBL)

Voxelization-Based Loss (VBL): The authors introduce a novel evaluation metric that goes beyond 3D bounding boxes. It voxelize the scene boundaries and object meshes to detect fine-grained geometric violations, such as:
- Out-of-Bounds (OOB): Objects intersecting walls or floors.
- Mesh Boundary Loss (MBL): Objects intersecting each other (e.g., a chair leg passing through a table).
Reinforcement Learning with Verifiable Rewards (RLVR): SG-LLM is further refined using Rejection Sampling Fine-Tuning (RFT). The VBL and a Prompt Matching Score (PMS) serve as verifiable rewards to filter high-quality generations, aligning the model with human preferences for realistic layouts.

3. Key Contributions

ReSpace Framework: A novel, text-driven autoregressive system for 3D indoor scene synthesis and editing that supports addition, removal, and swapping via natural language.
Structured Scene Representation (SSR): A lightweight, JSON-based abstraction with explicit room boundaries and natural language object descriptions, enabling asset-agnostic deployment and direct editing.
Specialized SG-LLM: A model trained via SFT and preference alignment (RFT) that outperforms state-of-the-art methods in object placement accuracy and scene coherence.
Voxelization-Based Loss (VBL): A new evaluation metric that captures fine-grained geometric interactions (mesh-level collisions) rather than just bounding box overlaps, providing a more realistic measure of layout quality.
Superior Performance: Demonstrates state-of-the-art results on object addition and achieves higher human-perceived quality in full scene synthesis compared to end-to-end trained baselines, despite not being trained on full scene generation directly.

4. Experimental Results

The method was evaluated on the 3D-FRONT dataset across three splits: bedrooms ('bed'), living/dining rooms ('liv'), and a combined set ('all').

Object Addition: ReSpace significantly outperformed baselines (ATISS, Mi-Diff, LayoutGPT) in layout violation metrics.
- VBL Reduction: Achieved a VBL of 14.66 (bed) and 10.62 (liv) compared to ATISS (111.24 and 75.30) and Mi-Diff (78.31 and 56.75).
- Prompt Matching: Achieved high Prompt Matching Scores (PMS > 0.87), indicating strong adherence to user instructions.
Full Scene Synthesis:
- While baselines had slightly better FID/KID scores (indicating closer distribution to training data), ReSpace achieved substantially lower layout violations (e.g., VBL of 134.8 vs. 514.1 for ATISS on 'bed').
- Human Evaluation: In a study with 334 participants comparing 10,307 scene pairs, ReSpace won 75.3% of comparisons against the second-best method (Mi-Diff at 56.7%), validating that lower geometric violations translate to better perceived quality.
Ablation Studies:
- Test-Time Compute: Using Best-of-N (BoN=8) and rotation/shuffling augmentation further reduced VBL and improved human win rates.
- Preference Alignment: RFT (Rejection Sampling Fine-Tuning) was found to be the most effective alignment strategy for improving human-perceived quality, outperforming DPO and GRPO.

5. Significance and Impact

Bridging LLMs and 3D Graphics: ReSpace demonstrates that LLMs can be effectively specialized for complex 3D spatial reasoning tasks when paired with structured representations and verifiable geometric rewards.
Iterative Design Workflow: By enabling fine-grained editing (add/remove/swap), the framework aligns with real-world interior design workflows, moving beyond "one-shot" generation to interactive refinement.
Asset Agnosticism: The decoupling of scene logic from specific 3D assets makes the system highly adaptable to different catalogs, a crucial feature for real-world applications like virtual staging or robotics where inventory varies.
New Evaluation Standard: The introduction of VBL provides a more rigorous standard for evaluating 3D scene generation, highlighting the limitations of bounding-box-based metrics in capturing realistic object interactions.

In conclusion, ReSpace establishes a new state-of-the-art for text-driven 3D scene synthesis, proving that combining structured representations, autoregressive language modeling, and geometric verification can produce high-quality, editable, and semantically rich indoor environments.