Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

This paper introduces PRISM, a self-supervised framework for monocular endoscopic depth and pose estimation that leverages edge detection and intrinsic luminance decomposition to overcome challenges like texture-less surfaces and illumination variations, demonstrating that training on real-world data with optimized frame sampling outperforms supervised methods on synthetic phantoms.

Xinwei Ju, Rema Daher, Danail Stoyanov, Sophia Bano, Francisco Vasconcelos

Published 2026-02-23
📖 5 min read🧠 Deep dive

Imagine you are a doctor performing a colonoscopy. You are guiding a tiny camera through a long, winding, pink tunnel (the colon) to look for polyps or cancer. The problem? The inside of the colon is often slippery, shiny, and lacks distinct patterns (like a smooth, wet cave). Sometimes the light reflects off the wet walls, creating blinding glare.

Because of this, it's very hard for a computer to figure out how deep the camera is looking or where exactly the camera is moving. If the computer gets lost, it might miss a dangerous spot or tell the doctor the wrong location.

This paper introduces a new AI system called PRISM (Pose-Refinement with Intrinsic Shading and edge Maps) designed to help the camera "see" its way through this confusing tunnel better than ever before.

Here is how it works, explained with simple analogies:

1. The Problem: The "Blind" Camera

Standard AI tries to guess depth and movement just by looking at the video picture (RGB). But in a colon, this is like trying to navigate a dark, foggy room where the walls are all the same color and shiny.

  • The Glare: The light reflects off the wet tissue, confusing the AI.
  • The Smoothness: Without texture (like a brick wall or a tree), the AI doesn't know how far away things are.
  • The Data Gap: We don't have a "map" (ground truth) for real human guts to teach the AI, so it has to learn on its own.

2. The Solution: PRISM's "Super-Senses"

Instead of just looking at the raw video, PRISM gives the AI two extra "super-senses" to help it understand the scene:

A. The "Shadow Detective" (Luminance)

  • The Analogy: Imagine walking into a dark cave with a flashlight. Even if the walls are smooth, you can tell how far away a wall is because the light gets dimmer the further it travels.
  • How PRISM uses it: The AI separates the "shiny reflection" from the "actual brightness" of the tissue. It learns that if an area is naturally darker (shading), it's likely further away, and if it's bright, it's closer. This helps the AI ignore the confusing glare and focus on the shape of the tunnel.

B. The "Outline Artist" (Edge Maps)

  • The Analogy: If you look at a white wall in a white room, you can't tell where the corner is. But if someone draws a black line around the corner, you instantly know where the edge is.
  • How PRISM uses it: The AI uses a special tool to draw invisible "sketches" of the folds and ridges in the colon. These sketches act as a roadmap, telling the AI exactly where the boundaries are, even if the colors are confusing.

3. The Training Strategy: "Learn, Then Polish"

The authors didn't just throw all this data at the AI at once. They used a clever three-step training process, like teaching a student:

  1. Step 1: The Basics (Pre-training): First, they teach the AI how to draw those "outlines" and how to separate "shadows" from "light" on its own.
  2. Step 2: The Guessing Game (Joint Training): The AI tries to guess the depth and movement using the video, the outlines, and the shadows. It gets feedback by checking if the picture looks consistent when it moves the camera virtually.
  3. Step 3: The Fine-Tuning (Refinement): This is the secret sauce. The authors noticed that while the AI got really good at guessing depth, it got a bit sloppy at guessing movement. So, they froze the depth part and gave the movement part a specific "homework assignment": Make sure your movement guess matches the outlines perfectly. This "polished" the AI's ability to navigate without messing up its depth perception.

4. The Big Discovery: Real vs. Fake Data

The paper made a very surprising discovery that changes how we should train these robots:

  • The Old Way: Scientists used to train AI on "phantom" data (fake, plastic models of colons) because they had perfect maps to teach the AI.
  • The New Finding: The authors found that training on real, messy human video data is actually better, even if that real data doesn't have perfect maps!
  • The Analogy: It's like learning to drive. You could learn on a perfect, empty driving simulator (Phantom), but you'll actually become a better driver if you practice on a real, bumpy, unpredictable city street (Real Data), even if you don't have a perfect GPS map. The AI learns to handle the "real world" chaos better.

5. The Result

The PRISM system is now the best at:

  • Depth: It creates a 3D map of the colon that is sharper and has fewer "ghost" errors (hallucinations).
  • Navigation: It tracks the camera's path more accurately, especially around tricky folds.
  • Robustness: It doesn't get confused by bright reflections or smooth, shiny surfaces.

Summary

Think of PRISM as giving a colonoscopy camera a pair of glasses that highlight the edges of the tunnel and a flashlight that understands shadows. By teaching this AI on real, messy human videos rather than perfect plastic models, the researchers created a system that can navigate the human body more safely and accurately, helping doctors find problems they might have otherwise missed.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →