StereoAdapter-2: Globally Structure-Consistent Underwater Stereo Depth Estimation

Imagine you are trying to navigate a submarine through a murky, dark ocean. You have two eyes (cameras) to judge how far away things are, but the water is playing tricks on you. The light bends, colors fade, and particles in the water scatter everything, making it look like you're looking through a dirty, foggy window. This is the challenge of underwater stereo depth estimation: teaching a robot to "see" distance accurately when the water is messing with its vision.

The paper introduces a new system called StereoAdapter-2. Here is how it works, explained through simple analogies:

1. The Old Problem: The "Slow, Local" Detective

Previous systems tried to solve this by using a method similar to a detective who only looks at the immediate neighborhood. They would look at a small patch of the image, guess the distance, and then slowly refine that guess over and over again.

The Flaw: Because they only looked locally, it took them a long time to connect the dots between two distant points (like a fish far away and a rock on the other side). In the murky underwater world, where textures are often missing (like a blank wall of blue water), this "slow detective" got confused and gave up.

2. The New Solution: The "Super-Scanning" Radar

The authors replaced the old detective with a new tool called ConvSS2D. Think of this as a high-tech radar that doesn't just look at neighbors; it scans the entire room in four directions at once (up, down, left, right).

The Magic: Instead of taking 10 small steps to understand a long distance, this new radar sees the whole path in a single step. It respects the "rules of the road" for stereo vision (called epipolar geometry), meaning it knows exactly how to scan horizontally to find matching points, while also scanning vertically to make sure the structure makes sense.
The Result: It's faster, smarter, and can figure out distances in "blank" blue water where the old methods failed.

3. The Data Dilemma: The "Virtual Aquarium"

To teach a robot how to see underwater, you need thousands of examples of underwater images with perfect "answer keys" (knowing exactly how far away every pixel is).

The Problem: Real underwater data is rare, expensive to collect, and dangerous to get.
The Fix: The team built a Virtual Aquarium called UW-StereoDepth-80K.
- Step 1: They took normal photos of the world (like a city or a forest).
- Step 2: They used an AI artist to "paint" these photos to look like they were taken underwater (adding fog, color shifts, and bubbles).
- Step 3: They used a "time machine" AI to generate a second camera angle from the first one, ensuring the 3D geometry remained perfect.
- The Outcome: They created 80,000 perfect underwater training pairs without ever leaving the lab.

4. The "Smart Adapter": Learning Without Forgetting

The system uses a pre-trained AI brain (a "Foundation Model") that is already very good at seeing the world. Instead of retraining the whole brain from scratch, they used a technique called LoRA (Low-Rank Adaptation).

The Analogy: Imagine a master chef who knows how to cook any cuisine. Instead of teaching them how to cook again, you just give them a special "underwater spice kit" (the adapter). Now, the chef can instantly cook perfect underwater meals without forgetting how to cook land meals. This makes the system efficient and adaptable.

5. The Real-World Test: The BlueROV2

The team didn't just test this on a computer; they put it on a real robot submarine (BlueROV2) in a giant indoor water tank.

The Result: The robot navigated through obstacles with much higher accuracy than previous models. It didn't get confused by the foggy water or the lack of texture. It was like giving the robot "glasses" that could see through the murk.

Summary

StereoAdapter-2 is like giving a robot submarine a new pair of super-eyes.

It uses a fast, all-seeing radar (ConvSS2D) instead of a slow, local detective.
It was trained in a massive, AI-generated virtual aquarium because real underwater data is too scarce.
It uses a smart adapter to learn quickly without forgetting its original intelligence.

The result? A robot that can see depth clearly in the deep, dark, and murky ocean, making underwater exploration safer and more autonomous.

1. Problem Statement

Underwater stereo depth estimation is critical for autonomous underwater vehicles (AUVs) and remotely operated vehicles (ROVs) but faces significant challenges compared to terrestrial environments:

Domain Shifts: Underwater imaging suffers from wavelength-dependent light attenuation, scattering, and refraction, which violate the photometric consistency assumptions standard stereo pipelines rely on.
Limitations of Current Methods: Recent approaches (e.g., StereoAdapter) leverage monocular foundation models with GRU-based iterative refinement. However, ConvGRUs rely on local convolutional kernels and sequential gating. This necessitates multiple iterations to propagate disparity information over long ranges, leading to poor performance in large-disparity regions and textureless underwater areas where long-range dependencies are crucial.
Data Scarcity: There is a lack of diverse, real-world underwater stereo datasets with accurate ground truth, making it difficult to train robust models that generalize across varying optical conditions (turbidity, attenuation) and camera configurations.

2. Methodology

The authors propose StereoAdapter-2, a framework that addresses these issues through architectural innovation and large-scale data synthesis.

A. Architectural Innovation: ConvSS2D

The core novelty is the replacement of the traditional ConvGRU update module with a ConvSS2D operator based on Selective State Space Models (SSM).

Selective SSM: Unlike GRUs which use complex non-linear gating, SSMs utilize linear recurrence. The model dynamically generates parameters ( $\Delta, B, C$ ) from the input features, enabling "input-dependent selectivity" to adaptively modulate information flow.
Four-Directional Scanning Strategy: To adapt SSMs (typically 1D sequences) to 2D stereo images, the authors employ a four-directional scanning strategy:
1. Horizontal Scans: Align naturally with epipolar geometry to propagate disparity information efficiently along scan lines.
2. Vertical Scans: Capture vertical structural consistency, which is vital for resolving ambiguities in textureless regions.
Efficiency: This design enables long-range spatial propagation within a single update step at linear computational complexity, eliminating the need for multiple iterations to gather global context.

B. Data Synthesis: UW-StereoDepth-80K

To overcome data scarcity, the authors constructed a large-scale synthetic dataset (80,000 pairs) using a two-stage generative pipeline:

Semantic-Aware Style Transfer (Atlantis): Uses Stable Diffusion to transform terrestrial RGB-D images into underwater scenes. It hallucinates realistic optical effects (attenuation, scattering, turbidity) while preserving the original semantic content and geometric structure.
Geometry-Consistent Novel View Synthesis (NVS-Solver): A video diffusion model that generates the right-view stereo pair from the stylized left view. It conditions on specific camera baselines (20cm, 30cm, 40cm, 50cm) to simulate diverse ROV configurations, ensuring geometric consistency.

C. Training Framework

Foundation Model: Utilizes Depth Anything 3 (ViT-B) as the feature encoder and monocular depth estimator.
Adaptation: Employs LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning, inheriting the adaptation strategy from StereoAdapter.
Initialization: Uses monocular depth estimates to initialize disparity, accelerating convergence.

3. Key Contributions

ConvSS2D Operator: A novel update mechanism replacing ConvGRU. It leverages selective SSMs with a four-directional scanning strategy to capture both horizontal epipolar constraints and vertical structural consistency, enabling efficient long-range propagation in a single step.
UW-StereoDepth-80K Dataset: A large-scale, diverse synthetic underwater stereo dataset generated via a two-stage pipeline, covering varied baselines and optical parameters to bridge the synthetic-to-real gap.
State-of-the-Art Performance: Achieves superior zero-shot generalization on underwater benchmarks without target-domain fine-tuning.
Real-World Validation: Successfully deployed on a BlueROV2 platform with onboard computing (NVIDIA Jetson Orin NX), demonstrating practical applicability.

4. Experimental Results

Quantitative Performance (Zero-Shot)

TartanAir-UW Benchmark: StereoAdapter-2 achieved a 17% improvement over the previous best (StereoAdapter) in Relative Error (REL: 0.0440 vs 0.0527) and RMSE (2.4038 vs 2.8947).
SQUID Benchmark (Real-world data): Achieved a 7.2% improvement in RMSE (1.7481) compared to StereoAdapter, with the highest accuracy across all thresholds ( $\delta_1, \delta_2, \delta_3$ ).
Comparison: Outperformed other leading methods like FoundationStereo, Stereo Anywhere, and RAFT-Stereo significantly in zero-shot settings.

Real-World Deployment (BlueROV2)

Tested in a controlled indoor tank with cluttered environments.
Results: Achieved an REL of 0.1023 and RMSE of 1.7164, outperforming all baselines including StereoAdapter and FoundationStereo.
Efficiency: On the Jetson Orin NX, the model achieved an end-to-end latency of 1102 ms per frame, which is faster than competitors (e.g., FoundationStereo at 1933 ms) due to the efficient ConvSS2D operator.

Ablation Studies

ConvSS2D vs. ConvGRU: Replacing ConvGRU with ConvSS2D significantly reduced REL and RMSE.
Scanning Strategy: The Cross-Scan (four-directional) strategy outperformed unidirectional and bidirectional scans, confirming the importance of 2D spatial context aggregation.
Hyperparameters: Optimal performance was found with a state dimension ( $d_{state}$ ) of 16 and an SSM expansion ratio of 1.0, balancing accuracy and throughput.

5. Significance and Impact

Architectural Shift: The paper demonstrates that Selective State Space Models (SSMs) are superior to RNNs (GRUs) for stereo matching tasks requiring long-range dependency modeling, particularly in challenging, textureless environments.
Data-Centric AI: It highlights the viability of using advanced generative pipelines (Diffusion + NVS) to create high-fidelity, geometrically consistent synthetic datasets for domains where real data is scarce.
Practical Robotic Perception: The successful deployment on a physical ROV validates that the method is not just a theoretical improvement but a robust solution for real-world underwater navigation and manipulation tasks.
Efficiency: By reducing the need for multiple refinement iterations, the method offers a more computationally efficient path to high-accuracy depth estimation, crucial for edge devices on underwater robots.