CRAG: Can 3D Generative Models Help 3D Assembly?

Imagine you are a detective trying to solve a mystery, but instead of finding clues on the ground, you are handed a pile of shattered glass from a broken vase. Your job is twofold:

The Puzzle: Figure out exactly how to glue the pieces you have back together.
The Imagination: Figure out what the entire vase looked like before it broke, including the pieces that are completely missing.

For a long time, computer scientists tried to solve this by just doing the first part. They built algorithms that were great at sliding the pieces around until they fit, but if a piece was missing, the algorithm just gave up or left a gaping hole. It was like trying to finish a jigsaw puzzle while staring at a blank wall where half the picture should be.

Enter CRAG (Coupled ReAssembly and Generation). This new paper proposes a smarter way to think about the problem. Instead of treating the "puzzle solving" and the "imagining" as two separate tasks, CRAG does them at the same time, letting them help each other.

Here is how it works, using some everyday analogies:

1. The "Two-Way Street" of Thinking

Think of CRAG as a conversation between two experts in a room:

Expert A (The Assembler): Looks at the physical shards on the table. "This piece has a curve that matches this other piece. They must go here."
Expert B (The Generator): Has a mental image of the whole vase. "I know this vase is round and has a handle. If I see a flat edge here, that means the handle must be on the other side, even if I can't see it."

The Magic: In older methods, these experts didn't talk to each other. Expert A would try to fit pieces, get confused, and fail. Expert B would try to draw the vase, but without the pieces, the drawing might be wrong.
In CRAG: They talk constantly.

Expert A says, "Hey, these two pieces fit perfectly!" -> Expert B updates their mental image: "Ah, so the vase is wider than I thought."
Expert B says, "I know this vase has a handle here." -> Expert A says, "Oh! That explains why this piece is floating in mid-air; it's actually part of the handle!"

This back-and-forth conversation allows the computer to hallucinate (or generate) the missing parts of the object while simultaneously locking the existing pieces into the correct position.

2. The "Shared Language" (The VAE)

For these two experts to talk, they need to speak the same language. The paper uses a pre-trained model called TripoSG as a shared dictionary.

Imagine you have a library of millions of 3D objects (chairs, bones, vases). TripoSG has "read" all of them and understands the general "shape" of the world.
CRAG uses this library as a foundation. When it sees a broken bone fragment, it doesn't just see "bone"; it sees "a piece that belongs to a bone that usually looks this way." This gives it a huge head start.

3. Why This Matters (The Real-World Impact)

The paper tests this on things like broken pottery, shattered glass, and even ancient fossilized bones.

The Old Way: If you lost a piece of a dinosaur bone, the computer would leave a gap. You'd have a skeleton with a missing leg.
The CRAG Way: The computer looks at the remaining leg bones, realizes "This is a T-Rex," and uses its knowledge of T-Rex anatomy to grow back the missing leg, then snaps the existing pieces into place.

The "Secret Sauce": Joint Flow

The technical term they use is "Joint Flow Matching." Think of it like a river flowing toward a destination.

In the past, the river tried to flow to the "Assembly" destination and the "Generation" destination separately, often crashing into rocks (errors).
CRAG creates a single riverbed where the water flows toward both goals simultaneously. As the water (the data) moves, it smooths out the path for both tasks. If the assembly gets stuck, the generation pulls it forward. If the generation gets lost, the assembly anchors it.

In Summary

CRAG is like giving a robot a brain that doesn't just look at the pieces in front of it, but also holds a strong, flexible memory of what the whole object should look like. By letting the "pieces" and the "whole picture" argue and agree with each other, the robot can fix broken things even when parts are missing, creating a complete, plausible 3D object out of thin air.

It's the difference between trying to fix a broken clock by only looking at the gears you have, versus having a master clockmaker who knows exactly how the clock should tick, allowing them to rebuild the missing gears and fix the ones you found.

1. Problem Definition

3D Assembly involves reconstructing a complete 3D object from a set of observed parts or fractured fragments.

Current Limitations: Most existing methods treat assembly as a pure pose estimation problem. They predict rigid transformations (SE(3)) for observed parts to align them. While effective for complete sets, these methods fail when parts are missing, eroded, or partially scanned. They cannot synthesize new geometry to fill gaps, leading to incomplete or structurally incoherent reconstructions.
Human Intuition: Human experts do not just align local fragments; they iteratively hypothesize the unseen whole to resolve ambiguities in fragment placement. They use a global shape hypothesis to guide local alignment and fill missing regions.
Core Challenge: How to unify assembly (local pose estimation) and generation (holistic shape synthesis) so they mutually reinforce each other, particularly in the presence of missing data.

2. Methodology: CRAG Framework

The authors propose CRAG (Coupled ReAssembly and Generation), a unified framework based on joint flow-matching.

A. Core Architecture

CRAG employs a Mixture-of-Transformers architecture with two parallel branches that share a latent space:

Assembly Branch: Predicts the SE(3) pose (rotation and translation) for each input fragment. It models the assembly process as a continuous flow on the manifold $SO(3) \times \mathbb{R}^3$ .
Generation Branch: Synthesizes the complete 3D shape in a latent space. It uses a flow-matching approach to denoise from Gaussian noise to a clean shape latent.
Shared VAE: Both branches utilize the pre-trained TripoSG VAE. This provides a shared "language" (latent space) where variable-size fragment sets and whole-shape generations can interact. Fragments are encoded into this space, allowing gradients and uncertainty to flow between the two tasks.

B. Key Components

Joint Adapter: A critical module inserted at each transformer layer to enable bidirectional information exchange:
- Assembly $\to$ Generation: Fragment features inform the generation branch about structural constraints, helping to disambiguate the global shape.
- Generation $\to$ Assembly: The imagined whole provides holistic shape priors to guide the alignment of fragments, resolving ambiguities where local cues are insufficient.
- Mechanism: Uses bi-directional cross-attention. To ensure training stability, the adapter's output projection layers are initialized with zero weights (acting as an identity mapping initially).
Two-Stage Training Strategy:
1. Warm-up: Train only the Assembly Branch (100k steps) to learn pose estimation.
2. Joint Fine-tuning: Activate the Generation Branch and Joint Adapters, training the entire model jointly (150k steps). This allows the model to learn the coupling between assembly and generation.

C. Input Handling

Fragments: Variable numbers of fragments with varying sizes are sampled and encoded via the shared VAE.
Optional Reference Image: The framework can condition on a reference image (via DINOv2 features) to further guide generation, though it remains robust without one.

3. Key Contributions

New Capability: CRAG is the first framework to simultaneously assemble input fragments and synthesize plausible complete shapes, making it robust to missing parts.
New Formulation: It reformulates 3D assembly as a coupled reassembly-and-generation objective, moving beyond pure pose estimation to a joint flow-matching framework.
State-of-the-Art (SOTA) Performance: Achieves SOTA results on standard benchmarks (PartNeXt, Breaking Bad) and introduces a new bone fragment dataset from MorphoSource for future research.
Bidirectional Refinement: Demonstrates that part-level evidence can disambiguate image-conditioned generation, while global shape priors improve assembly accuracy.

4. Experimental Results

The authors evaluated CRAG on PartNeXt (semantic parts) and Breaking Bad (fractured objects), including a challenging "Missing Part" setting.

Quantitative Performance:
- Complete Parts: CRAG outperforms baselines (GARF, RPF, Assembler) in Rotation Error (RE), Translation Error (TE), Part Accuracy (PA), and Chamfer Distance (CD).
- Missing Parts: CRAG shows significant robustness. For example, on the Breaking Bad dataset with missing parts, CRAG achieves 92.03% Part Accuracy and 0.52 CD, vastly outperforming GARF (85.55% PA, 3.43 CD) and RPF.
- Comparison: CRAG reduces Chamfer Distance by 91.4% compared to the image-conditioned Assembler on PartNeXt.
Qualitative Results:
- CRAG produces coherent assemblies where baselines often result in floating or tilted parts.
- In missing-part scenarios, CRAG successfully "hallucinates" missing geometry to create a complete, structurally sound object.
- Real-World Validation: Tested on the FRACTURA dataset (real scanned bone fragments), demonstrating robustness in real-world noise and erosion.
Ablation Studies:
- Using the shared TripoSG VAE (vs. task-specific encoders) significantly improves performance.
- The coupled Assembly+Generation model outperforms Assembly-only models even when images are provided, proving the value of the holistic prior.

5. Significance and Impact

Scientific & Archaeological: Enables the reconstruction of fragmented artifacts, fossils, and bones where pieces are missing or eroded, facilitating morphometric analysis.
Medical: Supports preoperative planning and surgical guidance by reconstructing multi-fragment fractures from CT scans, even when data is incomplete.
Robotics: Enhances robot manipulation in unstructured environments by reasoning about spatial relations under occlusion and ambiguity.
Theoretical Shift: Moves the field from "rearranging observed points" to "reasoning about the whole to guide the parts," bridging the gap between discriminative assembly and generative modeling.

6. Limitations

Data Bias: Performance relies on the distribution of training data (e.g., PartNeXt over-represents canonical structures), limiting generalization to long-tail categories.
Metric Limitations: Current metrics (CD, PA) measure geometric accuracy but may miss semantic correctness (e.g., swapping identical parts).
Failure Modes: Struggles with extremely thin shell fragments (weak constraints) or slender components where TSDF encoding merges surfaces.
Control: Currently relies on image conditioning; future work aims to support sketches and language for more hypothesis-driven reconstruction.