LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

Imagine you are driving a self-driving car. To navigate safely, the car needs to know exactly how far away everything is—the stop sign, the pedestrian, the car in front. This is called depth estimation.

For a long time, computers have struggled to get this right. They either guess the distance (and get the scale wrong, thinking a toy car is a real one) or they get confused when the car stops moving or the road is featureless (like a long, empty highway).

The paper you shared introduces a new system called DriveMVS. Think of it as giving the self-driving car a "superpower" to see the world in perfect 3D, even in tricky situations. Here is how it works, explained with simple analogies:

1. The Problem: The "Guessing Game"

Current methods are like a person trying to guess the distance to a mountain while wearing foggy glasses.

Monocular AI (Single Camera): It's like looking at a painting. It can tell you what things look like, but it's bad at knowing exactly how far they are. It might think a small car is a giant truck because it lacks a reference point.
Standard 3D Vision: It tries to use two eyes (two cameras) to judge depth. But if you are driving straight down a highway with no turns (low "parallax"), your eyes can't tell how far away the road is. It gets dizzy and the image starts to flicker.
LiDAR (The Laser Scanner): This is the car's "laser eyes." It gives perfect distance measurements, but it's sparse. It's like a net with huge holes in it; it catches the big objects but misses the details in between. Also, sometimes the net gets blocked by rain or dirt.

2. The Solution: DriveMVS (The "Smart Detective")

DriveMVS is a new framework that combines the best of all worlds. It uses three main tricks to solve the problem:

A. The "Anchor" (LiDAR Prompts)

Imagine you are trying to draw a map of a city, but you only have a few GPS coordinates from a friend.

Old way: You try to guess the rest of the map based on the drawing style. You might get the shape right, but the scale is wrong (your city is too big or too small).
DriveMVS way: It takes those few GPS coordinates (the LiDAR data) and uses them as anchors. It says, "Okay, this specific point is definitely 50 meters away." It locks the entire map to that real-world scale. Even if the rest of the map is fuzzy, the scale is now 100% correct.

B. The "Triple-Threat Team" (Triple-Cues Combiner)

To fill in the gaps between the GPS points, DriveMVS doesn't just rely on one source of information. It hires a team of three experts who talk to each other:

The Geometer: Looks at the geometry from multiple camera angles (Multi-View Stereo).
The Artist: Looks at the picture and understands the scene's structure and context (Monocular AI).
The Measurer: Looks at the sparse LiDAR data for hard, factual distance numbers.

Instead of letting them argue, DriveMVS uses a special "translator" (a Transformer) to blend their opinions. If the Geometer is confused because the road is empty, the Measurer steps in with a hard fact. If the Measurer has a blind spot, the Artist fills in the gap based on what a road usually looks like.

C. The "Time-Traveler" (Spatio-Temporal Decoder)

Self-driving cars move, so the view changes every second.

Old way: The car looks at the road, calculates the distance, then looks again a split-second later and calculates it again. Sometimes the numbers jump around, making the car's "vision" flicker like a bad video.
DriveMVS way: It remembers the past. It looks at the current frame and the previous frames together. It understands that the car is moving, so it uses that motion to smooth out the depth map. It's like watching a movie instead of a slideshow; the depth feels continuous and stable, not jittery.

3. Why This Matters

The authors tested DriveMVS on real-world driving datasets (like KITTI and Waymo) and found it to be the best so far.

It's accurate: It knows the exact distance in meters, not just "close" or "far."
It's stable: The 3D view doesn't flicker when the car stops or drives straight.
It's tough: It works even when the LiDAR is blocked, when it's raining, or when the road has no texture.

The Bottom Line

DriveMVS is like giving a self-driving car a 3D vision system that never gets dizzy. It combines the "hard facts" from laser scanners with the "intuition" of AI, and it remembers what it saw a second ago to keep the picture smooth. This makes self-driving cars safer, more reliable, and ready for the real world, where conditions are rarely perfect.

1. Problem Statement

Accurate metric depth estimation is critical for autonomous driving perception and simulation. However, existing approaches face a "trilemma" of competing objectives:

Monocular Foundation Models: Offer strong cross-domain generalization but suffer from scale ambiguity (cannot determine absolute metric depth) and limited temporal consistency.
General Multi-View Stereo (MVS): Provides geometric consistency but often fails in low-parallax scenarios (e.g., traffic jams, ego-static scenes) or textureless regions, leading to scale collapse and temporal flickering.
Sparse LiDAR Prompts: While providing absolute metric scale, LiDAR data is inherently sparse, intermittent, and unevenly distributed due to occlusions and sensor limitations. Relying solely on current-frame cues makes systems fragile when prompts are missing or degraded.

The Core Challenge: How to build a depth estimation system that simultaneously achieves metric-scale accuracy, temporal consistency, robustness to prompt dropout, and zero-shot cross-domain generalization under minimalist LiDAR configurations.

2. Methodology: DriveMVS

DriveMVS is a novel Multi-View Stereo (MVS) framework designed to reconcile these objectives through three core components:

A. Prompt-Anchored Cost Volume (PACV)

Traditional MVS cost volumes rely heavily on relative geometric cues (feature matching), which fail in ambiguous scenarios. DriveMVS introduces a dual-pathway mechanism:

Disentanglement: It separates the learning of relative consistency (via standard feature matching) from absolute scale anchoring (via LiDAR prompts).
Mechanism: For each depth hypothesis, the network computes a relative cost volume ( $CV_{rel}$ ) and an absolute metric cost volume ( $CV_{abs}$ ) by comparing depth hypotheses against sparse LiDAR prompts. These are concatenated and processed by an MLP to produce an anchored cost volume.
Benefit: This prevents the cost volume from collapsing in low-parallax or textureless regions by explicitly forcing the network to reason about absolute scale using available LiDAR data.

B. Triple-Cues Combiner (TCC)

To fuse heterogeneous information effectively, DriveMVS employs a Transformer-based aggregation module that processes three distinct cue streams:

CV Cues ( $F_{cv}$ ): Dense, geometrically anchored cues from the Cost Volume.
Mono Cues ( $F_{mono}$ ): Structural priors and global context from a DINOv2 encoder (initialized with Depth-Anything-V2 weights).
Metric Cues ( $F_{metric}$ ): High-fidelity, sparse absolute constraints from a sparsity-aware prompt encoder.

Architecture: The TCC uses a Mask Transformer with a "Cross-Cue Merging" strategy. It first allows each cue to refine its internal representations via self-attention (using masks to ignore invalid LiDAR pixels). Then, it fuses the geometric and monocular cues via element-wise summation and interacts them with the metric cues via Cross-Attention. This ensures the metric prompts guide the dense prediction without being overwhelmed by sparse noise.

C. Spatio-Temporal Decoder

To ensure smooth, flicker-free depth across video sequences:

Motion-Aware Temporal Layer: The decoder (based on DPT) incorporates a temporal self-attention mechanism that processes adjacent frames.
Relative Pose Encoder: Crucially, it explicitly embeds relative camera poses (rotation and translation) into the feature stream. This allows the model to understand 3D geometric relationships and pixel correspondences across frames, rather than relying on blind positional encodings.
Output: The decoder produces continuous, metrically accurate depth maps by rescaling the sigmoid output to match the absolute metric bounds defined by the cost volume.

D. Training Strategy

Data: Trained on diverse synthetic datasets (TartanAir, VKITTI2, etc.) with perfect ground truth, simulating LiDAR sparsity and noise.
Robustness: A random modality dropout strategy (50% probability) is used during training. This forces the model to learn resilient representations, enabling it to function even when LiDAR prompts are partially missing or absent during inference.

3. Key Contributions

Unified Framework: DriveMVS is the first MVS pipeline to unify absolute scale accuracy, cross-domain generalization, and robust temporal consistency in a single model.
Metric Embedding Mechanism: The introduction of the Prompt-Anchored Cost Volume explicitly disentangles relative consistency learning from absolute scale anchoring, solving the scale ambiguity problem in low-parallax scenarios.
Triple-Cue Fusion: The Triple-Cues Combiner effectively integrates dense geometric cues, structural priors, and sparse metric prompts using a novel Transformer architecture.
Spatio-Temporal Consistency: The Motion-Aware Spatio-Temporal Decoder leverages explicit camera pose information to ensure stable depth propagation across video sequences.

4. Experimental Results

The authors evaluated DriveMVS on three major autonomous driving benchmarks (KITTI, DDAD, and Waymo) using a zero-shot setting (trained on synthetic data, tested on real-world data).

Metric Accuracy: DriveMVS achieves State-of-the-Art (SOTA) performance.
- On KITTI, it achieves an MAE of 0.49m and AbsRel of 2.56%, significantly outperforming MVSAnywhere (1.78m MAE) and PromptDA (2.40m MAE).
- On Waymo, it achieves an MAE of 1.24m and AbsRel of 4.46%.
Temporal Consistency: It demonstrates superior temporal stability with a Temporal Alignment Error (TAE) of 0.296 on KITTI, outperforming video depth baselines like VideoDA-B (0.767).
Robustness to Extreme Cases:
- Low Parallax/Static Scenes: In ego-static scenarios where MVS fails (AbsRel ~55%), DriveMVS maintains high accuracy (AbsRel ~4.9%).
- Prompt Dropout: The model remains robust even when LiDAR beams are reduced from 64 to 4 lines or when occlusion rates reach 50%.
- Blind Spots: It successfully infers metric depth for rear-view cameras (blind spots) using only front-view LiDAR prompts, a task where baselines fail completely.

5. Significance

Practical Deployment: DriveMVS addresses the industry trend toward minimalist LiDAR configurations (fewer sensors) by proving that sparse, intermittent prompts can be effectively leveraged for dense, metrically accurate depth estimation.
Reliability: By solving the issues of scale ambiguity and temporal flickering, it provides a more reliable perception system for safety-critical autonomous driving applications.
Scalability: The zero-shot generalization capabilities suggest that such systems can be deployed across diverse environments without extensive retraining, reducing the cost of scaling autonomous fleets.

In summary, DriveMVS represents a significant leap forward in 3D perception for autonomous driving, successfully bridging the gap between the geometric rigor of MVS and the robustness of foundation models guided by sparse metric prompts.