PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

Imagine you are trying to figure out how far away objects are in a room just by looking at two photos of it (one from your left eye, one from your right). This is called stereo matching. It's the technology that lets self-driving cars "see" depth and avoid crashing.

For a long time, computers were terrible at this unless they were trained specifically on the exact type of room or street they were looking at. If you trained a car on city streets, it would get confused in a forest. This is the "Zero-Shot" problem: making a model work on new things it has never seen before.

Recently, scientists discovered a "magic eye" (called a Monocular Depth Foundation Model) that is incredibly good at guessing depth from just one photo. It has seen millions of images and learned the general rules of how the world looks.

The Problem with the Old Way
Current methods try to use this "magic eye" to help the stereo matching. They take the "magic eye's" guess and feed it into a standard update engine (called a GRU) to refine the answer.

Think of the "magic eye" as a wise, experienced architect who knows how buildings should look. The GRU is like a construction foreman who is very rigid.

The Issue: The foreman (GRU) is too small and stubborn. When the architect tries to whisper a complex idea to him, the foreman can't hold the whole thought in his head. He gets confused, distorts the architect's advice, and ends up making a mess. He also can't handle extreme changes in the building's shape.

The Solution: PromptStereo
The authors of this paper, PromptStereo, decided to fire the rigid foreman and replace him with a super-intelligent, flexible assistant (called the Prompt Recurrent Unit or PRU).

Here is how they did it, using simple analogies:

1. The New Assistant (PRU)

Instead of using a small, rigid foreman, they built their update engine directly out of the "magic eye's" own brain (the decoder).

Analogy: Imagine the "magic eye" is a master chef. Instead of asking a sous-chef to guess the recipe, you let the master chef refine the dish themselves. Because the assistant is part of the master chef, it already knows all the secret recipes (priors) and doesn't need to be taught from scratch. It's huge, flexible, and can handle any ingredient.

2. The "Prompts" (Structure & Motion)

Since the assistant is now part of the chef, how do we tell it what to do with the two photos? We use "Prompts."

Structure Prompt (SP): This is like handing the chef a blueprint of the room's shape. It says, "Hey, look at the walls and corners; make sure the depth matches the structure."
Motion Prompt (MP): This is like showing the chef how the objects moved between the two photos. It says, "Look, this car shifted slightly to the left; use that to calculate the distance."
Why it's better: In the old days, the foreman tried to force these clues into his tiny head, which distorted the information. With the new assistant, these clues are gently "prompted" into the system, guiding it without breaking its existing knowledge.

3. The "Affine-Invariant Fusion" (The Translator)

The "magic eye" gives a depth guess that is correct in shape but might be the wrong size (like a model car that looks like a real car but is tiny). The stereo camera gives a guess that is the right size but might be shaky.

Analogy: Imagine you have a map drawn on a rubber sheet (the magic eye) and a ruler (the stereo camera). They don't match perfectly. The paper uses a special translator (Affine-Invariant Fusion) to stretch and shrink the rubber sheet so it fits the ruler perfectly before they start working together. This ensures they start on the same page.

The Result

When they tested this new system:

It's a genius at guessing: It works amazingly well on things it has never seen before (like driving in the rain or looking at transparent glass), which usually breaks other computers.
It's fast: Even though it's smarter, it's not slower. In fact, because it doesn't have to "re-learn" everything, it often finishes the job faster.
It's flexible: You can swap this new assistant into almost any existing stereo matching system, and it instantly makes that system smarter.

In a Nutshell:
The paper says, "Stop trying to force a giant, smart brain into a tiny, rigid box. Instead, build the refinement process out of the smart brain itself, and just give it gentle hints (prompts) about what to look for." This results in a computer vision system that sees the world with human-like intuition, even in situations it's never encountered before.

1. Problem Statement

Stereo matching aims to estimate dense pixel-wise disparities from a pair of rectified images, a critical task for 3D scene understanding in applications like autonomous driving. While modern deep learning methods have improved, zero-shot generalization (performing well on unseen domains without retraining) remains a challenge.

Recent approaches leverage monocular depth foundation models (e.g., Depth Anything) to provide strong priors for better generalization. However, existing methods face significant limitations in the iterative refinement stage:

Limited Capacity: Most methods use GRU-based recurrent units (popularized by RAFT-Stereo) to refine disparities. These GRUs are independent of vision foundation models, must be trained from scratch, and lack the ability to inherit strong monocular priors.
Restricted Representation: GRUs constrain hidden states within a narrow range, making it difficult to handle extreme disparity variations or complex geometric structures.
Information Distortion: GRUs fuse inputs and hidden states via direct convolutions, which can distort original state information and lead to ambiguous guidance when integrating monocular depth priors.

2. Methodology: PromptStereo

The authors propose PromptStereo, a novel framework that replaces the standard GRU with a Prompt Recurrent Unit (PRU). This approach rethinks iterative refinement from the perspective of vision foundation models.

A. Core Architecture

Baseline: The method builds upon MonSter, utilizing a pre-trained Depth Anything V2 for feature extraction. The monocular branch is frozen to preserve robust depth priors.
Affine-Invariant Fusion (AIF): Before refinement, the initial disparity (from cost volume) and relative depth (from monocular model) are fused. Since monocular depth is scale/shift ambiguous, the authors normalize both inputs using median and mean absolute deviation (affine-invariant normalization). They then project the monocular depth into the disparity space and fuse it with the initial disparity using a confidence map. This ensures global geometric consistency.

B. Prompt Recurrent Unit (PRU)

The PRU is the core innovation, designed to replace the GRU.

Foundation-Based Design: Instead of a custom GRU, PRU utilizes the decoder architecture of monocular depth foundation models (specifically the multi-resolution refinement layers of DPT/Depth Anything). This allows the unit to naturally inherit rich monocular depth priors.
Prompting Mechanism: To integrate stereo-specific information without disrupting the inherited priors, the authors introduce two types of "prompts" added via residual connections:
1. Structure Prompt (SP): Encodes the geometric discrepancy between the current stereo disparity and the normalized monocular relative depth. This guides the model to correct structural misalignments.
2. Motion Prompt (MP): Encodes stereo motion cues (local cost volume and current disparity), similar to traditional flow-based methods but integrated into the foundation model decoder.
Update Strategy: Unlike GRUs which use reset and update gates, PRU employs a simpler, more flexible update strategy:
- It removes the reset gate, using only an update gate ( $z_k$ ).
- It avoids constraining hidden states to a narrow range, allowing for more flexible representation.
- It injects prompts only at the highest resolution to reduce computational complexity while maintaining efficiency.

3. Key Contributions

Prompt Recurrent Unit (PRU): A novel recurrent unit built upon the decoder of monocular depth foundation models. It directly inherits monocular priors, offering superior representation capacity and scalability compared to traditional GRUs.
Structure and Motion Prompts (SP & MP): A mechanism to inject monocular structure and stereo motion cues into the PRU. This avoids distorting state information and provides clear, independent guidance for iterative refinement.
Affine-Invariant Fusion (AIF): A robust initialization strategy that aligns initial disparity and monocular depth under a normalized scale, improving convergence and geometric consistency.
State-of-the-Art Performance: The proposed PromptStereo achieves SOTA zero-shot generalization across multiple datasets while maintaining comparable or faster inference speeds.

4. Experimental Results

The authors evaluated PromptStereo on zero-shot generalization benchmarks, training on datasets like Scene Flow or unlimited mixed datasets and testing on unseen real-world datasets (KITTI, Middlebury, ETH3D, DrivingStereo, Booster).

Basic Benchmarks (Scene Flow Training): PromptStereo achieved SOTA performance on most metrics. Notably, it reduced the error on Middlebury 2021 by nearly 50% compared to the baseline MonSter.
Unlimited Training Setting: When trained on a massive mixed dataset, PromptStereo outperformed even FoundationStereo (which uses large-scale data augmentation) and BridgeDepth.
Advanced Benchmarks (Challenging Scenarios):
- On Booster (containing reflective and transparent surfaces), PromptStereo surpassed the second-best method (MGStereo) by over 50% in the unlimited training setting.
- It demonstrated superior performance in handling specular reflections and transparent objects, areas where traditional methods often fail.
Efficiency: Despite the complex architecture, PromptStereo maintains inference speeds comparable to or faster than GRU-based methods (e.g., 0.36s vs 0.64s for MonSter on KITTI).
Ablation Studies:
- Replacing GRU with PRU alone yielded significant gains.
- Adding SP and AIF further improved accuracy with minimal time cost.
- Removing pre-trained weights caused a performance drop, confirming the importance of inheriting priors.
- The model converges significantly faster than baselines, reaching near-optimal performance in fewer iterations.

5. Significance

This paper marks a paradigm shift in stereo matching by moving away from designing custom recurrent units (GRUs) and instead leveraging the iterative refinement capabilities of pre-trained vision foundation models.

Prompt-Guided Refinement: It establishes that "prompts" (structure and motion cues) can effectively guide foundation models for stereo tasks, bridging the gap between monocular depth priors and stereo geometry.
Scalability: The method demonstrates that foundation models can be adapted for dense stereo tasks with better scalability and generalization than previous approaches.
Practical Impact: The ability to handle extreme conditions (transparency, reflections) and achieve strong zero-shot performance makes this approach highly relevant for real-world autonomous systems where retraining on every new environment is impractical.

In conclusion, PromptStereo proves that prompt-guided iterative refinement is a promising direction for achieving robust, generalizable, and efficient stereo matching.

PromptStereo: Zero-Shot Stereo Matching via Structure and Motion Prompts

1. The New Assistant (PRU)

2. The "Prompts" (Structure & Motion)

3. The "Affine-Invariant Fusion" (The Translator)

The Result

1. Problem Statement

2. Methodology: PromptStereo

A. Core Architecture

B. Prompt Recurrent Unit (PRU)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Model2Kernel: Model-Aware Symbolic Execution For Safe CUDA Kernels

Algorithmic Barriers to Detecting and Repairing Structural Overspecification in Adaptive Data-Structure Selection

Zero-Cost NDV Estimation from Columnar File Metadata

Persistence-based topological optimization: a survey

Multi-LLM Query Optimization