MessyKitchens: Contact-rich object-level 3D scene reconstruction

Imagine you are trying to build a 3D movie scene or program a robot to clean a kitchen. You take a single photo of a messy counter full of cups, bowls, and spoons piled up. The challenge? Getting a computer to not just see the photo, but to understand exactly where every single object is in 3D space, how they are touching, and ensuring they don't magically pass through each other like ghosts.

This paper, "MessyKitchens," tackles that exact problem with two main moves: creating a perfect "training ground" (a dataset) and inventing a smarter "brain" (an algorithm) to solve the puzzle.

Here is the breakdown in simple terms:

1. The Problem: The "Ghostly" Kitchen

Current AI is great at guessing how deep a single object is in a photo. But when you have a whole room full of stuff, things get messy.

The Ghost Problem: Old AI models often make objects float or, worse, phase right through each other (like a cup sinking halfway into a table). This is bad for robots (they can't grab a cup that's half-inside a table) and bad for animation (it looks fake).
The Missing Map: To teach AI to fix this, researchers needed a "Gold Standard" map of a messy kitchen where they knew exactly where every object was and how they touched. Previous maps were either too clean (like a museum) or had too many errors (like a sketch).

2. The Solution Part A: The "MessyKitchens" Dataset

The authors built a new, super-accurate dataset called MessyKitchens. Think of this as the "Olympic Training Camp" for 3D vision.

How they made it: They didn't just use a computer to fake it. They went into real kitchens, scanned 130 different kitchen items (cups, bowls, etc.) with a high-tech laser scanner, and then physically arranged them into 100 different messy piles.
The "Magic" Trick: To get perfect 3D models of the objects, they scanned them from the top and the bottom while they were sitting on a clear piece of glass. This let them see the whole object without moving it, creating a perfect digital twin.
The Result: They have 100 scenes ranging from "Easy" (a few items spaced out) to "Hard" (items stacked, nested inside bowls, and touching everywhere). Crucially, they measured the "contact" between objects so precisely that the digital models don't have any "ghostly" overlaps. It's the most physically realistic messy kitchen dataset ever made.

3. The Solution Part B: The "Multi-Object Decoder" (MOD)

Having a perfect map is great, but you need a smart driver to read it. The authors took an existing AI model (called SAM 3D) that is good at guessing the shape of one object and gave it a new brain upgrade called MOD.

The Old Way (SAM 3D): Imagine a student looking at a pile of Legos. They look at the red brick, guess its shape, then look at the blue brick, guess its shape, and so on. They do this one by one, ignoring how the bricks are actually stacked. Sometimes, they guess the blue brick is floating or inside the red one.
The New Way (MOD): This new brain looks at the whole pile at once. It asks: "If the red brick is here, where must the blue brick be to balance on top of it?"
How it works: It uses a "Multi-Object Decoder." Think of it as a group of detectives talking to each other. Instead of solving the crime alone, they share clues. If one detective sees a cup, they tell the others, "Hey, there's a bowl right under it, so the cup can't be floating!" This forces the AI to fix the positions so everything sits naturally and touches correctly.

4. Why This Matters

The authors tested their new "brain" (MOD) on their new "training camp" (MessyKitchens) and other existing datasets.

The Results: The new method was significantly better. It reduced the "ghostly" overlaps (penetration) and made the 3D scenes look much more realistic.
The Analogy: If previous methods were like a child trying to build a tower by guessing where blocks go, this new method is like a master architect who understands gravity and physics.

Summary

In short, this paper says: "To teach computers to see messy 3D worlds, we need better training data (MessyKitchens) and a smarter way to think about how objects relate to each other (MOD)."

This is a huge step forward for:

Robots: So they can actually pick up a cup from a cluttered table without knocking everything over.
Animation & VR: So virtual worlds look and feel physically real, with objects resting naturally on top of each other.
Digital Twins: So we can create accurate 3D copies of real-world environments for inspection or design.

The authors have made their data and code public, so other researchers can now use this "Olympic Training Camp" to build even smarter robots and virtual worlds.

1. Problem Statement

The paper addresses the challenge of monocular 3D scene reconstruction, specifically focusing on object-level reconstruction in cluttered, real-world environments. While recent advances in depth estimation (e.g., DepthAnything, VGGT) have improved single-image depth prediction, decomposing a scene into individual 3D objects remains difficult due to:

Object Variety and Occlusion: Scenes contain diverse shapes with frequent occlusions.
Physical Plausibility: Applications in robotics and animation require reconstructions that obey physical principles, specifically non-penetration (objects not passing through each other) and realistic contacts.
Data Limitations: Existing benchmarks (e.g., GraspNet, HouseCat6D) often suffer from low registration accuracy, high inter-object penetration, and a lack of physically realistic contact data, making them unsuitable for training models that need to reason about object interactions.

2. Key Contributions

The authors propose two main contributions to advance the field:

A. The MessyKitchens Benchmark

A new dataset designed to provide high-fidelity, physically accurate ground truth for cluttered scenes.

Real Data: 100 real-world scenes featuring 130 distinct kitchen objects across three difficulty levels (Easy, Medium, Hard).
- High-Fidelity Scanning: Objects are scanned using an Einstar Vega 3D scanner on a transparent acrylic turntable with reflective markers. This allows for capturing top and bottom views without moving the object, ensuring dense, complete geometry.
- Precise Registration: A two-stage automatic registration pipeline (Distance-based + Normal-aware) aligns individual object scans to the scene scan. This minimizes inter-object penetration and ensures geometric consistency.
- Contact-Rich: Scenes are constructed to feature complex interactions, including stacking, nesting (e.g., a cup in a bowl), and equilibrium states.
Synthetic Data: A training set (MessyKitchens-synthetic) containing 1.8k scenes and 10.8k rendered images. It uses Blender physics simulations with concave mesh collisions to replicate the real-world difficulty levels and contact dynamics.
Quality Metrics: The dataset demonstrates significantly lower registration error (mean depth error ~1.62mm) and a much lower penetration-to-contact area ratio (0.14) compared to state-of-the-art datasets like GraspClutter6D (0.66).

B. Multi-Object Decoder (MOD)

A novel method for joint object-level scene reconstruction that extends the SAM 3D framework.

Motivation: SAM 3D processes objects as independent tokens, often leading to spatial inconsistencies and inter-object penetrations because it lacks reasoning about the global scene layout.
Architecture: MOD introduces a Multi-Object Decoder that takes the shape and pose tokens from SAM 3D for $N$ $N$ objects and refines their poses.
- Mechanism: It employs a stack of $K$ $K$ blocks containing:
  1. Multi-Object Self-Attention: Correlates the poses of all objects to understand global layout.
  2. Multi-Object Cross-Attention: Grounds pose tokens against shape tokens of all objects to enforce geometric consistency.
- Output: The decoder predicts residual pose updates ( $\tilde{P}$ ) which are added to the original SAM 3D predictions ( $P + \tilde{P}$ ), refining the final pose and scale while keeping the geometry unchanged.
Training: Trained on MessyKitchens-synthetic using a weighted loss function combining Chamfer Distance (geometry), Quaternion alignment (rotation), and regression losses for translation and scale.

3. Methodology Details

Data Acquisition Pipeline:

Object Scanning: Objects are placed on a transparent acrylic surface. The scanner captures top and bottom views. Reflective markers on the acrylic allow for precise alignment of the two scans.
Scene Construction: Objects are manually assembled into scenes (Easy: 4 objects, minimal contact; Medium: 6 objects, some stacking; Hard: 8 objects, nested/stacked).
Scene Scanning & Registration: The assembled scene is scanned. A two-stage optimization aligns individual object meshes to the scene mesh:
- Stage 1: Minimizes point-to-mesh distance.
- Stage 2: Adds a normal consistency constraint to prevent the optimizer from placing object surfaces "inside" thin walls (a common failure mode for thin objects).

MOD Training & Inference:

Input: RGB image and object masks (from SAM 3).
Process: SAM 3D generates initial shape and pose tokens. MOD processes these tokens through $K=3$ transformer blocks to compute residual pose corrections.
Loss Function:
$\mathcal{L} = 0.1 \mathcal{L}_{CD} + 100 \mathcal{L}_{t} + 100 \mathcal{L}_{s} + 10 \mathcal{L}_{ip}$
Where $\mathcal{L}_{CD}$ is Chamfer Distance, $\mathcal{L}_{ip}$ is quaternion alignment loss, and $\mathcal{L}_{t}, \mathcal{L}_{s}$ are translation/scale regression losses.

4. Experimental Results

The authors evaluated MOD on MessyKitchens, GraspNet-1B, and HouseCat6D, comparing it against baselines like PartCrafter, MIDI, and SAM 3D.

Quantitative Performance:
- MessyKitchens: MOD achieves the highest IoU (0.445 object-level, 0.472 scene-level) and lowest Chamfer Distance (0.061 object-level, 0.050 scene-level), outperforming SAM 3D by a significant margin.
- Generalization: Despite being trained only on synthetic data, MOD shows strong zero-shot generalization to real-world datasets (GraspNet-1B, HouseCat6D), consistently outperforming baselines.
- Ablation: Removing the multi-object attention mechanisms (using single-object attention) degrades performance, proving the necessity of scene-level context. $K=3$ blocks were found to be optimal.
Qualitative Results:
- MOD significantly reduces inter-object penetration and corrects "floating" objects that appear in baseline reconstructions.
- It successfully reconstructs complex contact scenarios (e.g., nested cups) where SAM 3D fails to align objects correctly.

5. Significance and Impact

New Standard for Physical Consistency: MessyKitchens sets a new benchmark for evaluating 3D reconstruction in terms of physical plausibility, specifically addressing the critical issue of inter-object penetration which has been overlooked in previous datasets.
Bridge Between Synthetic and Real: The work demonstrates that high-quality synthetic data, generated with realistic physics and contact constraints, can effectively train models for real-world cluttered scenes.
Enabling Downstream Applications: By providing physically plausible reconstructions with accurate contacts, this work facilitates advancements in robotic manipulation (grasping in clutter), simulation, and 3D animation, where object interactions are critical.
Open Source: The authors commit to releasing the dataset, code, and pre-trained models, fostering further research in physics-consistent 3D computer vision.

MessyKitchens: Contact-rich object-level 3D scene reconstruction

1. The Problem: The "Ghostly" Kitchen

2. The Solution Part A: The "MessyKitchens" Dataset

3. The Solution Part B: The "Multi-Object Decoder" (MOD)

4. Why This Matters

Summary

1. Problem Statement

2. Key Contributions

A. The MessyKitchens Benchmark

B. Multi-Object Decoder (MOD)

3. Methodology Details

4. Experimental Results

5. Significance and Impact

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents