RelaxFlow: Text-Driven Amodal 3D Generation

Imagine you are looking at a mysterious object in a dark room. You can only see the top of it. It looks like a wooden backboard.

Your brain immediately tries to guess what the rest of the object is. Is it a bed? A sofa? Or maybe a dressing table? Because you can't see the rest, your brain is stuck guessing.

In the world of AI, existing 3D generators are like a person with very rigid memory. If they see that wooden backboard, they might say, "I've seen this before! It's definitely a bed," and they will build a bed, even if you wanted a sofa. They get "stuck" on what they can see and ignore what you want.

RelaxFlow is a new AI method that solves this problem. It lets you tell the AI exactly what you want the hidden parts to be, while making sure the parts you can see stay exactly the same.

Here is how it works, using some simple analogies:

1. The Problem: The "Over-Fitted" Artist

Imagine an artist who is so obsessed with copying the few brushstrokes you gave them that they refuse to imagine the rest of the painting. If you show them a tiny bit of a cat's ear, they will draw a whole cat, but it might be a tiger, a lion, or a house cat, depending on what they "usually" see. They can't handle the ambiguity.

Current AI models do this. They are "over-fitted" to the visible pixels. If you want them to draw a sofa behind that wooden board, they just can't do it; they are too busy copying the board.

2. The Solution: The "Dual-Track" System

RelaxFlow acts like a construction crew with two specialized teams working on the same house, but with different rules:

Team A (The Strict Inspector): Their only job is to look at the visible parts (the wooden board) and say, "Do not touch this! Keep these pixels exactly as they are." They are rigid and strict.
Team B (The Dreamer): Their job is to imagine the rest of the house based on your text prompt (e.g., "Build a sofa"). But here's the catch: The Dreamer is usually too specific. They might try to draw a specific red sofa with a specific scratch on the armrest, which might clash with the wooden board.

RelaxFlow's Secret Sauce: It tells Team B (The Dreamer) to relax.

3. The "Low-Pass Filter": Blurring the Details

This is the most clever part. The paper uses a concept called a "Low-Pass Filter."

Imagine you are listening to a song on the radio, but there is a lot of static noise (high-pitched hissing).

The High Frequencies are the specific details: the exact color of the sofa, the specific pattern on the fabric, the tiny scratches.
The Low Frequencies are the big picture: "It's a sofa," "It has a back," "It has arms."

RelaxFlow takes Team B's "Dream" and puts a blur over the high-frequency details. It says to the AI: "Forget the specific red color or the scratch. Just focus on the general shape of a sofa."

By blurring out the specific details, the AI stops fighting with Team A (the Strict Inspector). The Dreamer now provides a "soft guide" that says, "The shape should be a sofa," without trying to force a specific texture that might ruin the wooden board.

4. The "Consensus" Trick

To make sure the "Dreamer" gets the right idea, RelaxFlow doesn't just ask one image. It asks for multiple examples (a consensus).

If you say "Sofa," the AI looks at 3 or 4 different pictures of sofas.

One is red, one is blue, one is leather, one is cloth.
The AI looks at all of them and realizes: "Okay, they all have a back and arms, but the colors and textures are different."
It keeps the common shape (the sofa structure) and ignores the conflicting details (the colors).

This creates a "safe zone" for the AI to build the hidden parts without messing up the visible parts.

5. The Result: A Perfect Blend

Finally, RelaxFlow mixes the two teams' work:

Where you can see the object, it uses the Strict Inspector's work (keeping the original pixels perfect).
Where the object is hidden, it uses the Dreamer's "blurred" guide to build the rest of the shape.

In summary:
RelaxFlow is like a smart editor who knows how to listen to your instructions ("Make it a sofa!") without erasing the original photo you gave them. It does this by telling the AI to stop worrying about tiny details and just focus on the big shape, ensuring the final 3D object looks real, matches your text, and respects the original image.

Why is this a big deal?

Before this, if you wanted to change an object in a photo (e.g., turn a hidden bed into a hidden sofa), you had to choose between:

Keeping the photo perfect but getting the wrong object.
Getting the right object but ruining the photo.

RelaxFlow lets you have both. It's a major step forward for Virtual Reality (VR) and Robotics, where machines need to understand that a hidden object could be many different things, and they need to be able to guess the right one based on what you tell them.

1. Problem Definition

The paper addresses a critical limitation in current Image-to-3D generation: semantic ambiguity under occlusion.

The Challenge: When an object is partially occluded, standard feedforward models (e.g., SAM3D, TRELLIS) rely solely on visible pixels. Without semantic guidance, they often "collapse" into a single, overfitted shape based on the most likely dataset prior (e.g., interpreting a visible wooden backboard as a bed, even if the user intended a sofa).
The Gap: Existing methods struggle to balance two conflicting constraints:
1. Observation Fidelity: Strictly preserving the pixel-level details of the visible, unoccluded regions.
2. Prompt Following: Completing the occluded regions according to a user's text prompt.
Current Limitations: Optimization-based methods enforce prompt adherence but often distort the visible evidence. Feedforward models preserve evidence but lack controllability for the unseen parts. Both approaches fail to decouple the control granularity required for these distinct objectives.

2. Methodology: RelaxFlow

The authors propose RelaxFlow, a training-free, dual-branch inference framework designed to decouple control granularities. It treats the generation as a dual-objective Ordinary Differential Equation (ODE) flow.

Core Concept: Decoupling Control Granularity

The framework recognizes that the input observation requires rigid control (hard constraints), while the text prompt serves as relaxed structural control (soft constraints).

Key Components:

Dual-Branch Architecture:
- Observation Branch: Driven by the input image ( $c_{obs}$ ). It preserves high-frequency details and strictly adheres to visible pixels.
- Semantic-Prior Branch: Driven by the text prompt ( $c_{prior}$ ). It guides the completion of occluded regions but is intentionally "relaxed" to avoid conflicting with the observation.
Multi-Prior Consensus Module:
- Since modern 3D generators are typically visual-token conditioned, the text prompt is converted into a set of $N$ reference images (priors) via retrieval or text-to-image generation.
- These priors share the semantic category but vary in instance-specific details (texture, style).
- By feeding these into a cross-attention mechanism, the model learns a consensus of the global structure while suppressing conflicting, high-frequency instance details.
Low-Pass Relaxation Mechanism (Theoretical Core):
- Implementation: The authors apply a Gaussian blur to the cross-attention logits within the Semantic-Prior Branch before the softmax operation.
- Theoretical Justification: The paper proves that this smoothing is mathematically equivalent to applying a low-pass filter to the generative vector field.
  - Effect: It suppresses high-frequency noise (instance-specific hallucinations and texture conflicts) while preserving low-frequency signals (global geometric structure).
  - Result: The semantic branch provides a "coarse corridor" for the shape (e.g., "sofa") without dictating fine details, allowing the Observation Branch to anchor the visible pixels without interference.
Visibility-Aware Fusion:
- The two branches are fused via a time-dependent and spatially-aware interpolation.
- Temporal: Early generation steps rely more on the Semantic-Prior to establish the global mode; later steps rely on the Observation Branch to refine details.
- Spatial: A visibility mask estimates which voxels are occluded. The Semantic-Prior only influences occluded regions, while the Observation Branch strictly governs visible surfaces.

3. Key Contributions

Formalization of Text-Driven Amodal 3D Generation: A new task setting where text prompts explicitly resolve occlusion-induced ambiguity while strictly preserving input observation.
RelaxFlow Framework: A training-free, dual-branch inference strategy that decouples rigid observation control from relaxed semantic control.
Theoretical Proof: A rigorous proof demonstrating that the proposed relaxation mechanism (logit smoothing) acts as a low-pass filter on the generative vector field, reducing semantic estimation error and tightening the Wasserstein distance bound to the ground truth.
New Benchmarks: Introduction of two diagnostic datasets:
- ExtremeOcc-3D: Focuses on extreme occlusion where visible evidence is insufficient to determine object category.
- AmbiSem-3D: Focuses on semantic branching, where a single image supports multiple valid semantic interpretations (e.g., a shape that could be a chair or a lamp) resolved by text.

4. Experimental Results

The method was evaluated on SAM3D and TRELLIS backbones.

Quantitative Performance:
- On ExtremeOcc-3D, RelaxFlow significantly outperformed baselines. For SAM3D, it improved CLIP-text score (24.08 $\to$ 27.26) and reduced Point-FID (100.38 $\to$ 81.11), indicating better semantic alignment and 3D quality.
- Crucially, it maintained high CLIP-image and LPIPS scores, proving it did not degrade the fidelity of the visible input.
Qualitative Performance:
- In AmbiSem-3D, RelaxFlow successfully generated distinct 3D shapes (e.g., a sofa vs. a bed) from the same occluded image based on text prompts, whereas baselines collapsed to a single overfitted shape.
Ablation Studies:
- Removing the Low-Pass Relaxation degraded performance (Point-FID increased), confirming its role in stabilizing semantic guidance.
- Removing the Visibility Mask caused significant drops, highlighting the necessity of spatially isolating occluded regions.
- The method is robust to the number of priors ( $N$ ) and the smoothing strength ( $\sigma$ ).

5. Significance

Bridging the Gap: RelaxFlow solves the tension between "observation overfitting" and "semantic hallucination" without requiring model retraining or fine-tuning.
Theoretical Insight: The work provides a novel theoretical link between attention smoothing and low-pass filtering in generative flows, offering a principled way to extract structural guidance while suppressing noise.
Practical Application: It enables robust 3D asset creation for AR/VR and robotics in scenarios where objects are partially hidden, allowing users to explicitly define the hidden structure via natural language.
Efficiency: As a plug-and-play module, it adds negligible computational overhead, making it applicable to existing state-of-the-art feedforward 3D generators.

RelaxFlow: Text-Driven Amodal 3D Generation

1. The Problem: The "Over-Fitted" Artist

2. The Solution: The "Dual-Track" System

3. The "Low-Pass Filter": Blurring the Details

4. The "Consensus" Trick

5. The Result: A Perfect Blend

Why is this a big deal?

1. Problem Definition

2. Methodology: RelaxFlow

Core Concept: Decoupling Control Granularity

Key Components:

3. Key Contributions

4. Experimental Results

5. Significance

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection