ScanDP: Generalizable 3D Scanning with Diffusion Policy

Imagine you have a precious, intricate statue in your living room, and you want to create a perfect digital 3D copy of it.

The Old Way (The Clumsy Tourist):
Usually, if you tried to do this yourself, you'd walk around the statue, taking photos from every angle. But humans make mistakes. We might get tired, miss a spot behind the statue's ear, or accidentally bump into it. If we use a robot to do this, older robots are like tourists with a strict, rigid checklist. They might walk in a perfect zigzag pattern, but if the statue has a weird shape, the robot gets confused, bumps into things, or keeps staring at the same spot over and over again, wasting time.

The New Way (ScanDP): The "Intuitive Artist"
This paper introduces ScanDP, a new robot brain that learns to scan objects like a skilled human artist, but with the precision of a machine. Here is how it works, broken down into simple concepts:

1. The "Mental Map" vs. The "Pixel Photo"

Most robots try to scan by looking at a raw 3D cloud of dots (a point cloud). It's like trying to understand a forest by looking at a million individual leaves scattered on the ground. If a leaf is missing or blurry, the robot gets confused.

ScanDP is different. Instead of looking at individual dots, it builds an Occupancy Grid Map (OGM).

The Analogy: Imagine a giant 3D checkerboard surrounding the object. Instead of seeing "dots," the robot sees the probability of something being in each square. Is a square definitely empty? Is it definitely full? Or is it a "maybe"?
Why it helps: This is like the robot having a "mental map" of the room. Even if the camera gets a little blurry (noise) or the lighting changes, the robot remembers, "I know there's a wall there from three seconds ago." This makes it incredibly tough against errors.

2. The "Diffusion" Brain (Learning by Un-Blurring)

The robot learns using something called a Diffusion Policy.

The Analogy: Think of a photo that has been covered in static noise (like an old TV). A diffusion model is like a smart artist who knows how to take that noisy, messy picture and slowly "denoise" it until a clear image appears.
How it applies here: The robot starts with a random, messy idea of where to move next. Then, it uses its training to "clean up" that idea, step-by-step, until it finds the perfect, smooth path to the next best angle. It learns this by watching a human scan a simple object (like a bunny) just five times. That's it! It doesn't need thousands of hours of data.

3. The "Bubble" Safety Check

One of the biggest risks in 3D scanning is the robot crashing into the object it's trying to scan.

The Analogy: Imagine the robot is holding a fragile vase. As it moves, it surrounds itself with an invisible, inflatable bubble.
How it works: Before the robot takes a step, it checks its "Mental Map" (the OGM). If the bubble touches a square marked as "Occupied" (an obstacle), the robot knows, "Whoa, too close!" It instantly adjusts its path to stay safe. This ensures the robot never bumps into the object, even if the object has weird, hidden shapes.

4. The "Smoothie" Path (Optimization)

Sometimes, the robot's "intuition" might suggest a path that is safe but a bit jerky or redundant (going back and forth).

The Analogy: Imagine a hiker who takes a safe route but keeps doubling back. A path optimizer is like a GPS that says, "Hey, you can cut through this field to get there faster and smoother."
The Result: ScanDP takes the robot's suggested path and smooths it out, removing unnecessary wiggles. This means the robot scans the object faster and with less movement, saving battery and time.

The Big Results

The researchers tested this new robot brain on objects it had never seen before (like a dragon, a dragon, or a dog) and even objects that were much bigger or smaller than what it was trained on.

Coverage: It saw almost 100% of the object, whereas other robots missed hidden spots.
Efficiency: It took a much shorter path to get the job done.
Robustness: Even when they added "noise" (simulating a dirty camera lens or bad lighting), ScanDP kept working perfectly, while other robots failed.

In a nutshell:
ScanDP is like giving a robot a human-like intuition for scanning, combined with a super-safe bubble and a smart GPS. It learns quickly, never crashes, and gets the job done efficiently, even on objects it has never met before. It turns the chaotic task of 3D scanning into a smooth, reliable dance.

Here is a detailed technical summary of the paper "ScanDP: Generalizable 3D Scanning with Diffusion Policy."

1. Problem Statement

3D scanning is critical for robotics, autonomous driving, and digital archiving, but manual scanning is time-consuming and error-prone. While automated approaches exist, they face significant limitations:

Rule-based methods (e.g., frontier-based exploration) rely on heuristics and struggle in complex environments or with unseen object geometries.
Reinforcement Learning (RL) methods require massive training datasets and extensive reward engineering, often failing to generalize to new object categories.
Existing Imitation Learning (IL) methods (including standard Diffusion Policies) often rely on point clouds or RGB images. These inputs can be noisy, lack explicit spatial uncertainty modeling, and lead to suboptimal, redundant, or collision-prone paths when applied to unseen objects.

The core challenge is to develop a 3D scanning policy that is data-efficient, generalizable to unseen objects (varying in shape and scale), robust to sensor noise, and capable of generating safe, efficient paths.

2. Methodology: ScanDP

The authors propose ScanDP, a framework combining Diffusion Policy with Occupancy Grid Mapping (OGM) and Path Optimization. The system operates in two main stages:

A. Observation Representation: Occupancy Grid Map (OGM)

Instead of using raw point clouds or images (common in prior works like 3D Diffusion Policy), ScanDP uses an OGM to represent the environment.

Integration: The OGM ( $O_t$ ) is updated incrementally using a Bayesian update rule based on depth maps ( $D_t$ ) and camera poses ( $x_t$ ).
Probabilistic Encoding: Unlike traditional OGMs that classify cells as binary (Free/Occupied), ScanDP retains raw occupancy probabilities ( $p$ ). This allows the policy to learn the uncertainty of the environment, not just the geometry.
Feature Extraction: A Sparse Convolution encoder processes the OGM to extract features ( $e_{ogm}$ ), which are more efficient for sparse data than dense 3D convolutions.

B. Path Generation: Diffusion Policy

The core policy is a Conditional Diffusion Model (Denoising Diffusion Probabilistic Model - DDPM).

Input: The model conditions on the concatenated features of the OGM ( $e_{ogm}$ ) and the history of camera poses ( $e_{cam}$ ).
Output: It generates a sequence of future camera poses (actions) $a_{t:t+N-1}$ .
Training: The model is trained via imitation learning on a small dataset of human scanning demonstrations (only 5 trajectories of a Stanford Bunny). It learns to denoise random actions into expert-like scanning trajectories.

C. Path Optimization

To ensure safety and efficiency, the raw output from the Diffusion Policy undergoes two refinement steps:

Bubble-based Collision Filter: The system verifies that the camera path is collision-free by checking if a "bubble" (sphere) around the camera can fit without intersecting occupied grids in the OGM. Only viewpoints with a radius $r \ge r_{min}$ are kept.
Viewpoint Extraction (Trajectory Optimization): Using dynamic programming, the system minimizes the number of waypoints in the trajectory while ensuring the reconstructed path stays within a distance threshold ( $\eta$ ) of the original safe path. This removes redundant movements and smooths the trajectory.

3. Key Contributions

High Generalizability: The method achieves high scanning coverage on completely unseen object categories (e.g., Armadillo, Dragon, Spot) and scales (1.0x to 1.5x) using training data from only a single object (Stanford Bunny).
Data Efficiency: By leveraging Diffusion Policy and OGM, the system requires minimal expert demonstrations (5 trajectories) to learn effective strategies, overcoming the data hunger of RL.
Robustness to Noise: The use of OGM with Bayesian updates allows the system to average out sensor noise, making it significantly more robust than point-cloud-based methods in noisy conditions.
Safety and Efficiency: The hybrid approach of Diffusion Policy + Bubble filtering + Path optimization ensures collision-free, smooth, and shorter paths compared to baselines.

4. Experimental Results

Simulation Experiments

Coverage: ScanDP consistently achieved the highest coverage (e.g., 97.84% on the Bunny, 99% on the Dragon) compared to baselines like Random Hemisphere, standard Diffusion Policy (DP), and 3D Diffusion Policy (DP3).
Generalization: While DP3 performance dropped significantly on unseen objects (e.g., dropping to ~73% coverage on scaled objects), ScanDP maintained high coverage (87–97%).
Path Efficiency: ScanDP achieved high coverage with significantly shorter path lengths. Path optimization reduced total travel distance by an average of 32% compared to unoptimized versions.
Noise Robustness: Under Gaussian noise ( $\sigma=0.1$ ) in depth inputs, ScanDP maintained 88.91% coverage, whereas DP3 dropped to 74.20%.
Field of View (FoV) Generalization: ScanDP performed stably across different camera FoVs (L515, D435, D415), while DP3 showed lower and more variable performance.

Real-World Experiments

Setup: A 6-DoF manipulator with a turntable and an Intel RealSense L515 sensor.
Results: ScanDP achieved 95% coverage with low variance, significantly outperforming DP3 (33% coverage). The system successfully handled real-world sensor noise and partial occlusions, demonstrating stable operation.

5. Significance and Conclusion

ScanDP represents a significant advancement in autonomous 3D scanning by shifting the observation space from raw point clouds to probabilistic Occupancy Grid Maps. This change enables the model to:

Reason about uncertainty, leading to better exploration of occluded areas.
Generalize effectively to new shapes and scales without retraining.
Operate safely in real-world environments through explicit collision filtering.

The work demonstrates that combining generative models (Diffusion Policy) with classical probabilistic mapping (OGM) and optimization techniques creates a robust, data-efficient solution for complex robotic perception tasks. Future work aims to extend this to large-scale environments and multi-object scanning.