Confidence-aware Monocular Depth Estimation for Minimally Invasive Surgery

This paper proposes a novel confidence-aware monocular depth estimation framework for minimally invasive surgery that leverages calibrated confidence targets and a specialized loss function to improve depth accuracy and provide reliable per-pixel confidence maps, thereby addressing challenges posed by endoscopic image artifacts like smoke and blur.

Muhammad Asad, Emanuele Colleoni, Pritesh Mehta, Nicolas Toussaint, Ricardo Sanchez-Matilla, Maria Robu, Faisal Bashir, Rahim Mohammadi, Imanol Luengo, Danail Stoyanov

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are trying to navigate a car through a thick fog while driving at night. You have a GPS (the depth estimation model) that tells you how far away objects are. But because of the fog, the GPS sometimes guesses wrong. If the GPS is wrong, you might crash.

Now, imagine if that GPS didn't just give you a distance, but also a "Confidence Meter" that said, "I'm 90% sure this tree is 10 meters away, but I'm only 20% sure that rock is 5 meters away because the fog is too thick."

That is exactly what this paper is about, but instead of a car, it's a robotic surgeon, and instead of fog, it's the messy, smoky, blurry environment inside a human body during surgery.

The Problem: The "Blurry Fog" of Surgery

Minimally Invasive Surgery (MIS) uses tiny cameras (endoscopes) to see inside the body. It's like trying to do delicate surgery through a straw. But the view is often terrible:

  • Smoke: From burning tissue.
  • Reflections: Shiny wet organs that confuse the camera.
  • Blur: When the surgeon moves the camera too fast.
  • Occlusions: Surgical tools blocking the view.

Current AI models try to guess how far away things are (depth estimation), but when the image is messy, they guess blindly. If they guess wrong, the surgeon might cut the wrong thing or the robot might crash into a vital organ. The old models didn't know when they were guessing; they just gave an answer and hoped for the best.

The Solution: A "Trustworthy GPS" for Surgeons

The authors (a team from Medtronic and UCL) built a new system that teaches the AI to know when it doesn't know. They call this a "Confidence-Aware" framework.

Here is how they did it, using three simple steps:

1. The "Panel of Experts" (Creating the Truth)

To teach the AI what "confidence" looks like, the researchers didn't just use one camera. They used a team of five different AI models (an ensemble) acting like a panel of experts.

  • They showed these experts a clear, high-quality 3D view of the surgery (using stereo cameras, like human eyes).
  • If all five experts agreed on the distance, the area is clear and trustworthy.
  • If the experts argued (one said 5cm, another said 10cm), the area is foggy and unreliable.
  • They turned this disagreement into a "Confidence Map"—a heat map showing exactly which parts of the image are safe to trust and which are risky.

2. The "Smart Teacher" (Training the Model)

Usually, when you train an AI, you show it a picture and say, "This is the answer." If the picture is blurry, the AI still tries to learn from it, which makes it confused.

  • The Old Way: The teacher forces the student to memorize the blurry picture.
  • The New Way (Confidence-Aware Loss): The teacher looks at the "Confidence Map" first.
    • "Hey, this part of the image is clear! Focus hard here."
    • "Hey, this part is covered in smoke and the experts disagree! Ignore this part for now."
  • This teaches the AI to prioritize learning from clear, reliable data and to be careful about noisy data.

3. The "Self-Check" (The Confidence Head)

Finally, they added a special little module (a "head") to the AI's brain.

  • When the AI looks at a new surgery video, it doesn't just output a depth map.
  • It also outputs a Confidence Map alongside it.
  • Now, the surgeon (or the robot) can see: "Here is the depth, and here is a red zone where I am not sure. Don't trust the red zone."

The Results: Smarter, Safer Surgery

They tested this on real surgical videos (some from labs, some from real surgeries).

  • Accuracy: The new model was about 8% more accurate than the old models, especially in the messy, smoky parts of the surgery.
  • Reliability: It successfully identified the "foggy" areas and told the system to be careful.
  • Real-world impact: On a specific dataset with surgical tools, the error in measuring distance dropped significantly. This means the robot is less likely to accidentally poke a vital organ because it "knew" the view was too blurry to trust.

The Big Picture

Think of this paper as giving the surgical robot a gut feeling.
Before, the robot was like a confident idiot: it gave an answer even when it was blind.
Now, the robot is like a cautious expert: it gives an answer, but if the view is bad, it raises its hand and says, "I'm not sure about this part, please be careful."

This doesn't just make the robot smarter; it makes surgery safer by ensuring that when the computer says "move forward," the surgeon knows it's a safe bet.