Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

Imagine you are trying to navigate a car through a thick fog while driving at night. You have a GPS (the depth estimation model) that tells you how far away objects are. But because of the fog, the GPS sometimes guesses wrong. If the GPS is wrong, you might crash.

Now, imagine if that GPS didn't just give you a distance, but also a "Confidence Meter" that said, "I'm 90% sure this tree is 10 meters away, but I'm only 20% sure that rock is 5 meters away because the fog is too thick."

That is exactly what this paper is about, but instead of a car, it's a robotic surgeon, and instead of fog, it's the messy, smoky, blurry environment inside a human body during surgery.

The Problem: The "Blurry Fog" of Surgery

Minimally Invasive Surgery (MIS) uses tiny cameras (endoscopes) to see inside the body. It's like trying to do delicate surgery through a straw. But the view is often terrible:

Smoke: From burning tissue.
Reflections: Shiny wet organs that confuse the camera.
Blur: When the surgeon moves the camera too fast.
Occlusions: Surgical tools blocking the view.

Current AI models try to guess how far away things are (depth estimation), but when the image is messy, they guess blindly. If they guess wrong, the surgeon might cut the wrong thing or the robot might crash into a vital organ. The old models didn't know when they were guessing; they just gave an answer and hoped for the best.

The Solution: A "Trustworthy GPS" for Surgeons

The authors (a team from Medtronic and UCL) built a new system that teaches the AI to know when it doesn't know. They call this a "Confidence-Aware" framework.

Here is how they did it, using three simple steps:

1. The "Panel of Experts" (Creating the Truth)

To teach the AI what "confidence" looks like, the researchers didn't just use one camera. They used a team of five different AI models (an ensemble) acting like a panel of experts.

They showed these experts a clear, high-quality 3D view of the surgery (using stereo cameras, like human eyes).
If all five experts agreed on the distance, the area is clear and trustworthy.
If the experts argued (one said 5cm, another said 10cm), the area is foggy and unreliable.
They turned this disagreement into a "Confidence Map"—a heat map showing exactly which parts of the image are safe to trust and which are risky.

2. The "Smart Teacher" (Training the Model)

Usually, when you train an AI, you show it a picture and say, "This is the answer." If the picture is blurry, the AI still tries to learn from it, which makes it confused.

The Old Way: The teacher forces the student to memorize the blurry picture.
The New Way (Confidence-Aware Loss): The teacher looks at the "Confidence Map" first.
- "Hey, this part of the image is clear! Focus hard here."
- "Hey, this part is covered in smoke and the experts disagree! Ignore this part for now."
This teaches the AI to prioritize learning from clear, reliable data and to be careful about noisy data.

3. The "Self-Check" (The Confidence Head)

Finally, they added a special little module (a "head") to the AI's brain.

When the AI looks at a new surgery video, it doesn't just output a depth map.
It also outputs a Confidence Map alongside it.
Now, the surgeon (or the robot) can see: "Here is the depth, and here is a red zone where I am not sure. Don't trust the red zone."

The Results: Smarter, Safer Surgery

They tested this on real surgical videos (some from labs, some from real surgeries).

Accuracy: The new model was about 8% more accurate than the old models, especially in the messy, smoky parts of the surgery.
Reliability: It successfully identified the "foggy" areas and told the system to be careful.
Real-world impact: On a specific dataset with surgical tools, the error in measuring distance dropped significantly. This means the robot is less likely to accidentally poke a vital organ because it "knew" the view was too blurry to trust.

The Big Picture

Think of this paper as giving the surgical robot a gut feeling.
Before, the robot was like a confident idiot: it gave an answer even when it was blind.
Now, the robot is like a cautious expert: it gives an answer, but if the view is bad, it raises its hand and says, "I'm not sure about this part, please be careful."

This doesn't just make the robot smarter; it makes surgery safer by ensuring that when the computer says "move forward," the surgeon knows it's a safe bet.

1. Problem Statement

Monocular Depth Estimation (MDE) is critical for 3D scene understanding in Minimally Invasive Surgery (MIS), enabling tasks like surgical navigation, autonomous tissue manipulation, and safety monitoring. However, standard MDE models struggle in clinical endoscopic environments due to:

Visual Artifacts: Smoke, specular reflections (shiny tissue), fluid, blur, and occlusions from surgical instruments.
Unreliable Assumptions: These artifacts violate the uniform reflectance and steady illumination assumptions required by many depth models.
Lack of Reliability Metrics: Existing models output depth values but lack a mechanism to quantify confidence. In surgery, knowing when a prediction is unreliable is as critical as the prediction itself to prevent surgical errors.

Current uncertainty estimation methods often rely on binary masking (ignoring bad pixels) or post-hoc calibration without probabilistic meaning, failing to provide graded, per-pixel reliability scores during training or inference.

2. Methodology

The authors propose a Confidence-aware MDE Framework that integrates explicit pixel-wise confidence into both the training and inference phases. The pipeline consists of three main stages:

A. Confidence Target Generation (Ensemble-based)

To generate ground-truth confidence labels without manual annotation:

Ensemble of Stereo Models: An ensemble of $K$ stereo-matching models (fine-tuned on MIS data with different random seeds) processes stereo video frames to produce $K$ disparity maps.
Variance Calculation: The per-pixel variance ( $D_v$ ) across the ensemble predictions is calculated. High variance indicates ambiguity (e.g., smoke, blur), while low variance indicates consensus.
Probability Conversion: Variance is converted into a continuous confidence probability ( $P_c$ ) using a Gaussian-like function:
$P_c(i) = \exp\left(-\frac{D_v(i)}{2\sigma^2}\right)$
Here, $\sigma$ controls the sensitivity of the confidence mapping.

B. Confidence-Aware Training

The framework trains a standard Monocular Depth Estimation backbone (DepthAnything v1-Base) with two heads: a Depth Head and a Confidence Head.

Confidence-Aware Loss (CAL): Standard loss functions (Scale-invariant Logarithmic, Gradient Matching, Edge-aware Smoothness) are weighted by the generated confidence map ( $P_c$ ).
$L_{conf} = \frac{1}{N} \sum_{i=1}^{N} P_c(i) \cdot l_i$
This ensures that reliable pixels dominate the gradient updates, while noisy/uncertain regions are down-weighted.
Confidence Head: A lightweight module (two convolutional layers) is attached to the decoder to predict per-pixel confidence during inference. It is supervised directly using the ensemble-derived confidence labels via Binary Cross-Entropy loss.

C. Inference

At inference time, the model outputs both the depth map and a confidence map. The confidence map allows downstream systems to identify and potentially ignore unreliable regions (e.g., areas with smoke or occlusion).

3. Key Contributions

Calibrated Confidence Targets: A novel method to generate pixel-wise confidence probabilities by leveraging the variance of a stereo-matching ensemble, converting it into a continuous probability score.
Confidence-Aware Loss Function: A training strategy that weights loss contributions based on confidence, forcing the model to learn from reliable data while suppressing the influence of noisy artifacts.
Inference-Time Confidence Prediction: The introduction of a dedicated confidence head that enables the model to output reliability maps alongside depth, facilitating risk-aware clinical applications.

4. Experimental Results

The framework was validated on internal and public datasets, including StereoKP (internal clinical/pre-clinical), MicroCT-SE/PK (lab-based with MicroCT ground truth), Hamlyn, and DaVinci.

Performance on StereoKP (Challenging Clinical Data):
- The proposed method (DAv1-B-CA) significantly outperformed the baseline (DAv1-B).
- Absolute Relative Error (ARE): Improved from 12.41% to 8.86% (approx. 8% improvement).
- Accuracy within 2mm (Acc@2mm): Increased from 72.4% to 77.9%.
- $\delta_1 < 1.25$ : Improved from 85.83% to 94.14%.
- Qualitative results showed the model produced more stable depth estimates in regions with occlusions and specular reflections.
Performance on Clean Data (MicroCT/Hamlyn/DaVinci):
- On high-quality, low-noise datasets (MicroCT), improvements were marginal but consistent, confirming the model's accuracy in ideal conditions.
- On the Hamlyn dataset (pre-cleaned), gains were smaller, validating that the method specifically targets noisy/ambiguous regions.
Ablation Study:
- Using Confidence-Aware Loss (CAL) alone improved metrics significantly.
- Using the Confidence Head (CH) alone also improved performance.
- Combining both yielded the best results, demonstrating that the components are complementary.

5. Significance and Conclusion

This work addresses a critical gap in surgical computer vision: the lack of reliability quantification in depth estimation. By integrating confidence estimation directly into the training loop via ensemble-derived targets and weighted losses, the authors have created a model that is not only more accurate but also self-aware of its limitations.

Clinical Impact:

Safety: Surgeons or autonomous systems can use the confidence map to identify regions where depth data is unreliable (e.g., due to smoke or blood), preventing unsafe instrument navigation.
Robustness: The framework effectively handles the "in-the-wild" noise of real surgical videos, making MDE more viable for real-time clinical assistance.
Generalization: The method demonstrates strong generalization across diverse datasets, from controlled lab environments to complex, unfiltered clinical recordings.

In summary, this paper presents a robust, confidence-aware framework that enhances the safety and reliability of 3D perception in minimally invasive surgery, moving beyond simple depth prediction to trustworthy depth estimation.