Point-based Instance Completion with Scene Constraints

Imagine you are a robot entering a messy room. Your eyes (sensors) can only see the front of a chair, the side of a table, and maybe a bit of a lamp. The rest of these objects are hidden behind other things or out of your view. To do your job—like picking up a cup or navigating around furniture—you need to know what the entire object looks like, not just the part you can see.

This paper presents a new "brain" for robots that can fill in the missing 3D pieces of objects in a room, while being smart enough not to crash into things.

Here is the breakdown using simple analogies:

1. The Problem: The "Magic Box" vs. The Real Room

Previous AI models were like magic box fillers. If you showed them a broken toy car, they would magically know how to fix it, but only if you held the car in a perfect, standardized pose (like a toy on a shelf).

The Flaw: In a real room, objects are tilted, rotated, and sitting on floors. If you give these old models a real-world chair that's leaning sideways, they get confused. They also don't know about the rest of the room. They might try to "grow" the missing part of the chair through a wall or inside another table because they don't understand the rules of the room.

2. The Solution: The "Architect with a Blueprint"

The authors (Wesley Khademi and Li Fuxin) built a new system that acts like a skilled architect who can look at a half-built house and finish it, even if the house is tilted or surrounded by other buildings.

They did this with three main tricks:

A. The "Center-First" Strategy (No More Magic Boxes)

Instead of trying to guess the whole shape at once, the AI first asks: "Where is the center of this object?"

Analogy: Imagine trying to draw a perfect circle. It's hard if you don't know where the center is. But if you pin a piece of paper at the center, drawing the circle becomes easy.
How it works: The AI predicts the center of the object first, then builds the rest of the shape as "offsets" (distances) from that center. This allows it to handle objects in any position or size without getting confused.

B. The "Ghost Walls" (Scene Constraints)

This is the paper's biggest innovation. The AI doesn't just look at the object; it looks at the whole room.

Analogy: Imagine you are sculpting clay in a room full of other sculptures. You wouldn't push your clay into someone else's sculpture, right? You need to know where the "free space" is and where the "occupied space" is.
How it works: The AI creates a sparse map of "Ghost Walls." It marks areas where it knows there is empty air (free space) and areas where it knows there is another object (occluded space). When it fills in the missing parts of the chair, it checks these ghost walls to make sure the new chair legs don't grow through the table or into the wall.

C. The "Watertight" Dataset (The Training Ground)

To teach their AI, they needed a perfect practice ground. Existing datasets were like badly edited photos where the background didn't match the foreground, or where objects were already crashing into each other.

The Fix: They built a new dataset called ScanWCF (Scan Watertight and Collision-Free).
Analogy: Think of it like a video game level where the physics engine is perfect. Every wall is sealed (watertight), and no two objects are overlapping. This allows the AI to learn the true rules of 3D space without learning from bad examples.

3. The Results: A Better Robot

When they tested this new system against the old ones:

Old AI: Often made "ghost" objects that floated in the air or grew legs that went straight through the floor.
New AI: Created realistic objects that fit perfectly into the scene. If a chair leg was missing, it grew it in the right spot, stopped exactly where the floor was, and didn't crash into the table next to it.

Summary

Think of this paper as teaching a robot to be a good neighbor.

Old robots: "I see a chair! I will make a chair!" (And accidentally put the chair inside the wall).
New robot: "I see a chair. I know where the center is. I know the wall is over there. I know the table is next to it. I will finish the chair so it fits perfectly without bumping into anything."

This makes robots much safer and more useful in our actual, messy, real-world homes and offices.

Here is a detailed technical summary of the paper "Point-Based Instance Completion with Scene Constraints" (ICLR 2025).

1. Problem Statement

The paper addresses the challenge of Instance Scene Completion in indoor environments. While existing methods excel at completing isolated 3D objects (Object-Level Completion) or completing entire scenes at a voxel level (Semantic Scene Completion), there is a gap in methods that can complete specific object instances within a complex scene while respecting scene constraints.

Key limitations of current state-of-the-art (SOTA) approaches include:

Canonical Coordinate Assumption: Most point-based object completion methods assume inputs are normalized to a canonical pose (centered at origin, aligned axes, unit scale). This fails in real scenes where objects have arbitrary poses and scales.
Lack of Scene Context: Existing methods often ignore known scene constraints (e.g., other observed surfaces, free space, or occluded regions), leading to completions that collide with other objects or violate visibility constraints (e.g., generating geometry behind a visible wall).
Data Quality Issues: Existing datasets (Scan2CAD, ScanARCW) suffer from misalignment between partial scans and ground truth meshes or contain collisions in the ground truth, making reliable evaluation of plausibility difficult.

2. Methodology

The authors propose a novel Point-Based Instance Completion Framework that operates directly on point clouds without requiring canonical alignment. The architecture consists of three main stages:

A. Instance Segmentation

The pipeline begins by using a state-of-the-art 3D instance segmentation model (Mask3D) to decompose the partial scene scan into individual object instances. Each instance is processed independently but with awareness of the global scene.

B. Partial Encoder

Input: Partial object point cloud ( $P$ ) and estimated surface normals ( $N$ ).
Architecture: Based on a downsampling block design, but replaces standard PointConv layers with VI-PointConv (Viewpoint-Invariant PointConv).
Innovation: VI-PointConv uses a mix of non-invariant, scale-invariant, and rotation-invariant position embeddings. This allows the network to learn filter weights that are robust to arbitrary object poses and scales, eliminating the need for canonical normalization.

C. Seed Generator (Core Innovation)

Instead of directly regressing seed coordinates, the model predicts the object center and seed offsets.

Object Center Prediction: A learnable token is processed via transformer blocks to predict the 3D center of the object ( $O$ ).
Seed Offset Prediction: The model generates "Patch Seeds" as offsets from the predicted center.
Scene-Aware Cross-Attention: To incorporate scene context, the model introduces a sparse set of scene constraints represented as point clouds:
- Free Space Points: Points just outside observed surfaces.
- Occluded Space Points: Points just inside observed surfaces (behind walls/objects).
- These constraints are integrated into the seed generator via Cross-Attention, allowing the object completion to "reason" about where it cannot place geometry (avoiding collisions) and where it must respect visibility.

D. Coarse-to-Fine Decoder

Upsampling: Uses a hierarchical upsampling strategy (similar to SeedFormer) but augments it with Global Attention layers before local attention. This ensures global coherence and allows the model to infer missing geometry based on symmetries or distant parts of the object.
Mesh Reconstruction: A dedicated module predicts surface normals for the dense point cloud completion. These normals are used with NKSR (Neural Kernel Surface Reconstruction) to generate watertight meshes.

3. Key Contributions

Robust Point-Based Completion Model: A novel architecture that handles arbitrary object poses and scales without canonical alignment, utilizing VI-PointConv and a center-offset prediction strategy.
Scene Constraint Integration: The first point-based completion method to explicitly integrate sparse scene constraints (free/occluded space) via cross-attention, significantly reducing geometric collisions and improving plausibility.
ScanWCF Dataset: A new dataset for indoor instance scene completion containing:
- 1,202 scenes with aligned partial scans and ground truth meshes.
- Watertight and Collision-Free (WCF) ground truth, addressing the flaws in previous datasets (Scan2CAD, ScanARCW).
- Labeled instance information for training and evaluation.

4. Experimental Results

The method was evaluated on the new ScanWCF dataset against SOTA methods RfD-Net and DIMR.

Instance Scene Completion Quality:
- The proposed method achieved significantly higher mAP (Mean Average Precision) across all metrics (IoU, Chamfer Distance, Light Field Distance, Point Coverage Ratio).
- It showed much smaller performance drops when moving from easy to difficult thresholds, indicating superior ability to capture fine-grained geometric details.
Scene Plausibility (Collision Metrics):
- The method drastically reduced collisions. Compared to RfD-Net and DIMR, it reduced the percentage of points in collision (%COL) by roughly 2-3% and reduced the average collision distance (COL) by 3-4x.
- It successfully avoided generating geometry that penetrated other objects or violated free space constraints.
Partial Reconstruction Fidelity:
- The method maintained high fidelity to the observed partial scans (low One-Sided Chamfer Distance and Unidirectional Hausdorff Distance), ensuring it did not hallucinate geometry that contradicted the input.
Ablation Studies:
- Removing scene constraints led to a 29% relative increase in collision depth.
- Pre-training the object completion model on ShapeNet improved the ability to hallucinate missing structures (e.g., chair legs) when the partial input was highly occluded.

5. Significance

This work represents a significant step forward in 3D scene understanding for robotics and AR/VR applications. By moving away from canonical coordinate assumptions and explicitly modeling scene constraints, the method enables robots to interact with objects in real-world environments more safely and accurately. The introduction of the ScanWCF dataset provides a reliable benchmark for future research, solving the evaluation reliability issues caused by misaligned or colliding ground truths in previous datasets. The ability to generate collision-free, watertight meshes directly from partial scans is crucial for downstream tasks like navigation, grasp planning, and physical simulation.