LiDAR Prompted Spatio-Temporal Multi-View Stereo for Autonomous Driving

DriveMVS is a novel multi-view stereo framework for autonomous driving that leverages sparse LiDAR observations as geometric prompts and employs a spatio-temporal decoder to achieve state-of-the-art metric accuracy, temporal consistency, and cross-domain generalization.

Qihao Sun, Jiarun Liu, Ziqian Ni, Jianyun Xu, Tao Xie, Lijun Zhao, Ruifeng Li, Sheng Yang

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are driving a self-driving car. To navigate safely, the car needs to know exactly how far away everything is—the stop sign, the pedestrian, the car in front. This is called depth estimation.

For a long time, computers have struggled to get this right. They either guess the distance (and get the scale wrong, thinking a toy car is a real one) or they get confused when the car stops moving or the road is featureless (like a long, empty highway).

The paper you shared introduces a new system called DriveMVS. Think of it as giving the self-driving car a "superpower" to see the world in perfect 3D, even in tricky situations. Here is how it works, explained with simple analogies:

1. The Problem: The "Guessing Game"

Current methods are like a person trying to guess the distance to a mountain while wearing foggy glasses.

  • Monocular AI (Single Camera): It's like looking at a painting. It can tell you what things look like, but it's bad at knowing exactly how far they are. It might think a small car is a giant truck because it lacks a reference point.
  • Standard 3D Vision: It tries to use two eyes (two cameras) to judge depth. But if you are driving straight down a highway with no turns (low "parallax"), your eyes can't tell how far away the road is. It gets dizzy and the image starts to flicker.
  • LiDAR (The Laser Scanner): This is the car's "laser eyes." It gives perfect distance measurements, but it's sparse. It's like a net with huge holes in it; it catches the big objects but misses the details in between. Also, sometimes the net gets blocked by rain or dirt.

2. The Solution: DriveMVS (The "Smart Detective")

DriveMVS is a new framework that combines the best of all worlds. It uses three main tricks to solve the problem:

A. The "Anchor" (LiDAR Prompts)

Imagine you are trying to draw a map of a city, but you only have a few GPS coordinates from a friend.

  • Old way: You try to guess the rest of the map based on the drawing style. You might get the shape right, but the scale is wrong (your city is too big or too small).
  • DriveMVS way: It takes those few GPS coordinates (the LiDAR data) and uses them as anchors. It says, "Okay, this specific point is definitely 50 meters away." It locks the entire map to that real-world scale. Even if the rest of the map is fuzzy, the scale is now 100% correct.

B. The "Triple-Threat Team" (Triple-Cues Combiner)

To fill in the gaps between the GPS points, DriveMVS doesn't just rely on one source of information. It hires a team of three experts who talk to each other:

  1. The Geometer: Looks at the geometry from multiple camera angles (Multi-View Stereo).
  2. The Artist: Looks at the picture and understands the scene's structure and context (Monocular AI).
  3. The Measurer: Looks at the sparse LiDAR data for hard, factual distance numbers.

Instead of letting them argue, DriveMVS uses a special "translator" (a Transformer) to blend their opinions. If the Geometer is confused because the road is empty, the Measurer steps in with a hard fact. If the Measurer has a blind spot, the Artist fills in the gap based on what a road usually looks like.

C. The "Time-Traveler" (Spatio-Temporal Decoder)

Self-driving cars move, so the view changes every second.

  • Old way: The car looks at the road, calculates the distance, then looks again a split-second later and calculates it again. Sometimes the numbers jump around, making the car's "vision" flicker like a bad video.
  • DriveMVS way: It remembers the past. It looks at the current frame and the previous frames together. It understands that the car is moving, so it uses that motion to smooth out the depth map. It's like watching a movie instead of a slideshow; the depth feels continuous and stable, not jittery.

3. Why This Matters

The authors tested DriveMVS on real-world driving datasets (like KITTI and Waymo) and found it to be the best so far.

  • It's accurate: It knows the exact distance in meters, not just "close" or "far."
  • It's stable: The 3D view doesn't flicker when the car stops or drives straight.
  • It's tough: It works even when the LiDAR is blocked, when it's raining, or when the road has no texture.

The Bottom Line

DriveMVS is like giving a self-driving car a 3D vision system that never gets dizzy. It combines the "hard facts" from laser scanners with the "intuition" of AI, and it remembers what it saw a second ago to keep the picture smooth. This makes self-driving cars safer, more reliable, and ready for the real world, where conditions are rarely perfect.