Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Imagine you are trying to draw a detailed 3D map of a room, but you only have a few scattered dots of light on the wall telling you where the furniture is. This is what computers face when they try to understand depth from a standard camera: they get a flat picture with just a few "sparse" depth clues, and they need to fill in the rest of the map.

This paper introduces a new tool called Marigold-SSD to solve this problem. Here is the breakdown of how it works, using some everyday analogies.

The Problem: The "Slow but Smart" vs. "Fast but Dumb" Dilemma

In the world of computer vision, there are two main types of tools for this job:

The Discriminative Models (The Fast Workers): These are like a seasoned construction crew that can guess the shape of a room very quickly. They are fast, but if they encounter a room they've never seen before (like a weirdly shaped cave or a futuristic house), they often get confused and make mistakes.
The Diffusion Models (The Slow Perfectionists): These are like a master artist who has seen millions of rooms in their life. They have an incredible "intuition" about how rooms should look. However, to draw the picture, they have to start with a blank canvas full of static noise and slowly, step-by-step, erase the noise to reveal the image. This takes a long time (like 50 to 100 steps). If you ask them to do this in real-time (like for a self-driving car), they are too slow.

The Solution: Marigold-SSD (The "One-Shot" Genius)

The authors wanted to keep the intuition of the master artist but get the speed of the construction crew.

The Old Way (Marigold-DC):
Previously, to use the "Master Artist" (a diffusion model) for this task, the computer had to run a "test-time optimization." Imagine asking the artist to sketch the room, then stop, check the few dots you gave them, erase the sketch, and redraw it. They have to do this 50 times for every single image to get it right. It's accurate, but it takes forever.

The New Way (Marigold-SSD):
The authors realized they didn't need the artist to redraw the picture 50 times every time they saw a new room. Instead, they decided to train the artist once to be able to do it in one single step.

Think of it like this:

The Old Way: You give the artist a puzzle, and they have to try 50 different solutions before finding the right one.
The New Way: You spend a few days (4.5 GPU days) teaching the artist a special trick. Now, when you give them the puzzle, they look at the few dots and the picture, and instantly (in one step) produce the perfect solution.

How It Works: The "Late Fusion" Trick

To make this "one-step" magic work, they had to change how the artist receives the instructions.

Early Fusion (The Bad Idea): Imagine trying to whisper the instructions to the artist before they even pick up their pencil. The artist gets confused because the instructions interfere with their natural flow.
Late Fusion (The Marigold-SSD Way): The artist starts by using their "intuition" to guess the whole room based on the photo. Then, at the very last moment, they look at your few dots of light and gently nudge their drawing to match the reality.

This "Late Fusion" is like a chef tasting a soup at the very end of cooking and adding a pinch of salt. It's much more effective than trying to guess the salt amount before you've even started cooking.

Why This Matters

Speed: The new method is 66 times faster than the previous "Master Artist" method. It's now fast enough to be used in real-time applications like self-driving cars or robots.
Smarts: Even though it's fast, it still keeps the "super-intuition" of the diffusion model. It works great on rooms it has never seen before (Zero-Shot), whereas the fast construction crews usually fail in new environments.
Efficiency: They did all the hard work during the training phase (the "4.5 days" of teaching), so the actual usage is instant.

A Reality Check: The "Interpolation" Surprise

The authors also did a fun experiment. They asked: "What if we just connect the dots with a straight line (interpolation)?"

They found that if you have lots of dots (high density), a simple line-drawing trick works almost as well as the super-smart AI. But when you only have a few dots (low density), the simple trick fails miserably, and the AI (Marigold-SSD) shines. This proves that the AI is most valuable when the data is sparse and messy, which is exactly what happens in the real world.

Summary

Marigold-SSD is a breakthrough that teaches a super-smart, slow AI to think fast. It moves the heavy lifting to the training phase so that, in the real world, it can instantly turn a flat photo with a few depth clues into a perfect 3D map, making it ready for robots and cars to use right now.

Here is a detailed technical summary of the paper "Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion" (Marigold-SSD).

1. Problem Statement

Depth Completion aims to recover a dense depth map from sparse depth measurements (e.g., from LiDAR) given an input RGB image. This is critical for autonomous driving, robotics, and 3D reconstruction.

The Challenge: Existing discriminative models often struggle with domain shifts and varying sparsity patterns. While recent diffusion-based methods (like Marigold) offer superior zero-shot generalization and robustness, they suffer from prohibitive computational costs.
The Bottleneck: Standard diffusion approaches require iterative denoising (typically 50+ steps) and often test-time ensembling (running the model 10+ times) to achieve high accuracy. This makes them impractical for real-time applications (embodied AI) where low latency is required.

2. Methodology: Marigold-SSD

The authors propose Marigold-SSD, a framework that shifts the computational burden from inference to fine-tuning, enabling high-quality depth completion in a single step.

Core Architecture & Strategy

Base Model: Built upon Marigold, a foundation model that uses Stable Diffusion for monocular depth estimation.
Single-Step Formulation: Instead of iterative denoising, the model is fine-tuned to predict the final depth latent in one step ( $t=T$ ). The authors fix the timestep to $T$ and set noise to zero, effectively training the model to approximate the result of a full diffusion process in a single forward pass.
Late-Fusion Conditional Decoder:
- To incorporate sparse depth measurements ( $C$ ), the authors introduce a conditional decoder ( $D_{C,\phi}$ ).
- Mechanism: The sparse condition is processed by a trainable feature extractor ( $F$ ) that mirrors the multi-scale structure of the frozen VAE encoder. Features are extracted at 5 levels and fused with the depth latent features via late fusion (concatenation followed by $1\times1$ convolutions) during the decoding phase.
- Initialization: The fusion layers are initialized as zero convolutions (inspired by ControlNet). This ensures that at the start of training, the conditioning path outputs zero, preserving the pre-trained diffusion prior. As training progresses, the network learns to integrate the sparse measurements.
Training:
- Datasets: Trained on synthetic data (Hypersim and Virtual KITTI) with a 9:1 ratio.
- Loss: Uses an $L_1$ loss to match the predicted dense depth against ground truth, encouraging consistency with the sparse condition.
- Efficiency: The entire fine-tuning process takes only 4.5 GPU days on a single NVIDIA H100.

Inference Pipeline

Encode RGB image ( $M$ ) into latent space ( $m$ ).
Predict the clean depth latent ( $\hat{x}_0$ ) in a single step using the denoiser.
Pass $\hat{x}_0$ and the sparse condition $C$ through the Conditional Decoder to generate the dense relative depth map.
Apply a global scale and shift ( $a, b$ ) via least-squares alignment to the sparse measurements to recover metric depth.
No ensembling is required.

3. Key Contributions

First Single-Step Diffusion Depth Completion: Introduces a method that achieves performance comparable to state-of-the-art iterative diffusion methods but is orders of magnitude faster.
Late-Fusion Strategy: Proposes and validates a late-fusion conditional decoder for sparse measurements, demonstrating its superiority over early-fusion approaches (like freezing the VAE encoder or using a conditional encoder) through ablation studies.
Efficiency vs. Performance Trade-off: Successfully narrows the gap between slow, robust diffusion models and fast discriminative models, achieving near-discriminative inference speeds while retaining diffusion priors.
Critical Evaluation of Protocols: Challenges standard evaluation benchmarks by analyzing performance under varying sparsity levels, revealing that simple interpolation can outperform complex models at high density levels (e.g., on DDAD).

4. Results & Performance

The method was evaluated on four indoor (ScanNet, IBims-1, VOID, NYUv2) and two outdoor (KITTI, DDAD) datasets in a zero-shot setting.

Speed:
- Achieves an average 66 $\times$ speedup compared to Marigold-DC (without ensembling).
- Compared to Marigold-DC with ensembling (10 runs), the speedup is approximately 660 $\times$ .
- Inference time on KITTI: 0.53s (Marigold-SSD) vs. 35.1s (Marigold-DC).
Accuracy (KITTI Zero-Shot):
- RMSE: 1.496 (Marigold-SSD) vs. 1.676 (Marigold-DC w/o ensemble) vs. 1.469 (Marigold-DC with ensemble).
- MAE: 0.454 (Marigold-SSD) vs. 0.558 (Marigold-DC w/o ensemble).
- Marigold-SSD outperforms the non-ensemble version of Marigold-DC and remains competitive with the ensemble version despite being 660x faster.
Generalization: Demonstrates strong robustness across diverse indoor and outdoor domains without dataset-specific retraining.
Sparsity Analysis:
- At low sparsity (500 points), Marigold-SSD significantly outperforms both Marigold-DC and simple interpolation.
- At high sparsity (e.g., 5000 points on DDAD), simple barycentric interpolation performs competitively, suggesting that standard evaluation densities for some datasets may be too high to justify complex models.

5. Significance

Real-Time Viability: Marigold-SSD makes diffusion-based depth completion practical for real-time applications (e.g., autonomous driving, robotics) by eliminating the need for iterative optimization and ensembling.
Cost-Effective Deployment: The low training cost (4.5 GPU days) and single-step inference make the approach accessible and efficient for deployment in resource-constrained environments.
Paradigm Shift: The paper argues that iterative paradigms are not strictly necessary for high-quality results if the model is properly fine-tuned, offering a new direction for efficient generative perception.
Evaluation Insight: The authors highlight a critical flaw in current evaluation protocols, showing that "sophisticated" models are often tested at density levels where trivial interpolation is sufficient, urging a re-evaluation of benchmark standards.