Multi-Modal Monocular Endoscopic Depth and Pose Estimation with Edge-Guided Self-Supervision

Imagine you are a doctor performing a colonoscopy. You are guiding a tiny camera through a long, winding, pink tunnel (the colon) to look for polyps or cancer. The problem? The inside of the colon is often slippery, shiny, and lacks distinct patterns (like a smooth, wet cave). Sometimes the light reflects off the wet walls, creating blinding glare.

Because of this, it's very hard for a computer to figure out how deep the camera is looking or where exactly the camera is moving. If the computer gets lost, it might miss a dangerous spot or tell the doctor the wrong location.

This paper introduces a new AI system called PRISM (Pose-Refinement with Intrinsic Shading and edge Maps) designed to help the camera "see" its way through this confusing tunnel better than ever before.

Here is how it works, explained with simple analogies:

1. The Problem: The "Blind" Camera

Standard AI tries to guess depth and movement just by looking at the video picture (RGB). But in a colon, this is like trying to navigate a dark, foggy room where the walls are all the same color and shiny.

The Glare: The light reflects off the wet tissue, confusing the AI.
The Smoothness: Without texture (like a brick wall or a tree), the AI doesn't know how far away things are.
The Data Gap: We don't have a "map" (ground truth) for real human guts to teach the AI, so it has to learn on its own.

2. The Solution: PRISM's "Super-Senses"

Instead of just looking at the raw video, PRISM gives the AI two extra "super-senses" to help it understand the scene:

A. The "Shadow Detective" (Luminance)

The Analogy: Imagine walking into a dark cave with a flashlight. Even if the walls are smooth, you can tell how far away a wall is because the light gets dimmer the further it travels.
How PRISM uses it: The AI separates the "shiny reflection" from the "actual brightness" of the tissue. It learns that if an area is naturally darker (shading), it's likely further away, and if it's bright, it's closer. This helps the AI ignore the confusing glare and focus on the shape of the tunnel.

B. The "Outline Artist" (Edge Maps)

The Analogy: If you look at a white wall in a white room, you can't tell where the corner is. But if someone draws a black line around the corner, you instantly know where the edge is.
How PRISM uses it: The AI uses a special tool to draw invisible "sketches" of the folds and ridges in the colon. These sketches act as a roadmap, telling the AI exactly where the boundaries are, even if the colors are confusing.

3. The Training Strategy: "Learn, Then Polish"

The authors didn't just throw all this data at the AI at once. They used a clever three-step training process, like teaching a student:

Step 1: The Basics (Pre-training): First, they teach the AI how to draw those "outlines" and how to separate "shadows" from "light" on its own.
Step 2: The Guessing Game (Joint Training): The AI tries to guess the depth and movement using the video, the outlines, and the shadows. It gets feedback by checking if the picture looks consistent when it moves the camera virtually.
Step 3: The Fine-Tuning (Refinement): This is the secret sauce. The authors noticed that while the AI got really good at guessing depth, it got a bit sloppy at guessing movement. So, they froze the depth part and gave the movement part a specific "homework assignment": Make sure your movement guess matches the outlines perfectly. This "polished" the AI's ability to navigate without messing up its depth perception.

4. The Big Discovery: Real vs. Fake Data

The paper made a very surprising discovery that changes how we should train these robots:

The Old Way: Scientists used to train AI on "phantom" data (fake, plastic models of colons) because they had perfect maps to teach the AI.
The New Finding: The authors found that training on real, messy human video data is actually better, even if that real data doesn't have perfect maps!
The Analogy: It's like learning to drive. You could learn on a perfect, empty driving simulator (Phantom), but you'll actually become a better driver if you practice on a real, bumpy, unpredictable city street (Real Data), even if you don't have a perfect GPS map. The AI learns to handle the "real world" chaos better.

5. The Result

The PRISM system is now the best at:

Depth: It creates a 3D map of the colon that is sharper and has fewer "ghost" errors (hallucinations).
Navigation: It tracks the camera's path more accurately, especially around tricky folds.
Robustness: It doesn't get confused by bright reflections or smooth, shiny surfaces.

Summary

Think of PRISM as giving a colonoscopy camera a pair of glasses that highlight the edges of the tunnel and a flashlight that understands shadows. By teaching this AI on real, messy human videos rather than perfect plastic models, the researchers created a system that can navigate the human body more safely and accurately, helping doctors find problems they might have otherwise missed.

1. Problem Statement

Monocular depth and pose estimation are critical for computer-assisted colonoscopy navigation, aiming to reduce blind spots, minimize missed lesions, and prevent incomplete examinations. However, this task faces significant challenges:

Visual Complexity: Endoscopic scenes often feature texture-less surfaces, complex illumination patterns, specular highlights, and tissue deformation.
Lack of Ground Truth: Reliable ground truth for depth and pose in real in-vivo endoscopy is unavailable, making supervised learning difficult.
Domain Shift: Existing methods often rely on synthetic or phantom datasets (e.g., C3VD) which suffer from substantial domain shifts when applied to real-world data.
Training Configuration Gaps: Prior work focuses heavily on architecture innovations while neglecting the impact of training configurations, such as data source (real vs. synthetic), temporal sampling rates, and supervision types.

2. Methodology: The PRISM Framework

The authors propose PRISM (Pose-Refinement with Intrinsic Shading and edge Maps), a self-supervised learning framework that integrates anatomical and illumination priors to guide geometric learning. The system consists of four main networks trained in a three-stage process:

A. Network Components

Luminance Extractor (LumNet): Based on the SHADeS framework, this module decouples endoscopic images into luminance, albedo, and specularity. It outputs luminance maps ( $L_t$ ) which capture illumination intensity correlated with scene depth and camera motion while suppressing dynamic specularities that confuse depth estimation.
Edge Detector (EdgeNet): A learning-based edge detector (adapted from DexiNed) trained on the SegCol dataset to detect thin, high-frequency boundaries of colon folds. It outputs edge maps ( $E_t$ ) to provide structural guidance.
DepthNet & PoseNet: Standard encoder-decoder networks that estimate scene depth and relative 6-DoF camera pose. Unlike standard approaches that use only RGB inputs, PRISM concatenates:
- DepthNet Input: Raw frames + Luminance maps.
- PoseNet Input: Raw frames + Edge maps.

B. Three-Stage Training Strategy

Stage 1 (Pre-training): LumNet and EdgeNet are pre-trained and their weights are frozen.
Stage 2 (Joint Training): DepthNet and PoseNet are jointly trained using a standard self-supervised photometric loss (reprojection error) combined with a smoothness loss. This stage utilizes the pre-computed luminance and edge maps as additional input features.
Stage 3 (Pose Refinement): To address the observation that Stage 2 often improves depth at the expense of pose accuracy, PoseNet is fine-tuned while DepthNet is frozen. A new Edge-Guided Structural Consistency Loss is introduced. This loss warps edge maps from source frames to the target view using the predicted pose and depth, minimizing the SSIM difference between warped and target edge maps. This enforces geometric alignment of object boundaries without compromising depth accuracy.

3. Key Contributions

Multi-Modal Self-Supervised Framework: PRISM uniquely integrates luminance cues (to disambiguate shading from geometry) and edge cues (to provide structural boundaries) into a self-supervised depth and pose estimation pipeline.
Stage-Wise Training with Edge-Aware Loss: The authors demonstrate that using edge maps as both input features and a supervisory signal in a dedicated refinement stage yields the best balance between pose and depth accuracy.
Systematic Analysis of Training Factors: The paper provides a comprehensive ablation study on:
- Data Source: Real-world data vs. Phantom/Synthetic data.
- Temporal Sampling: The impact of video frame rate and inter-frame motion magnitude.
- Supervision Type: Self-supervised vs. Supervised (using noisy ground truth).

4. Experimental Results

The model was evaluated on multiple datasets: Hyper-Kvasir (real data, used for training), C3VD (phantom data with ground truth, used for testing), and EndoMapper (real data for qualitative evaluation).

Depth Estimation: PRISM achieved State-of-the-Art (SOTA) performance on the C3VD test set (trained on Hyper-Kvasir), outperforming general methods (Monodepth2, MonoViT) and endoscopy-specific baselines (IID-SfM, SHADeS) in metrics sensitive to large errors (RMSE). Qualitatively, it produced sharper depth contrasts around fold edges and fewer artifacts in real-world videos.
Pose Estimation: While SHADeS achieved the lowest Absolute Trajectory Error (ATE) on phantom data, PRISM achieved comparable performance. Crucially, PRISM demonstrated superior robustness to illumination reflections and produced more accurate trajectories on real data compared to baselines that hallucinated features.
Ablation Insights:
- Luminance: Best utilized by DepthNet to reduce artifacts.
- Edges: Best utilized by PoseNet (via the refinement loss) to improve trajectory alignment.
- Training Data: Models trained on real-world data (Hyper-Kvasir) consistently outperformed those trained on phantom data (C3VD), even when tested on C3VD. This highlights that domain realism is more valuable than the availability of ground truth.
- Temporal Sampling: Inter-frame motion magnitude is a dominant factor. C3VD requires significant subsampling (larger intervals) to create sufficient motion for learning, whereas real data (Hyper-Kvasir) performs best with standard sampling.
- Supervision: Adding supervised depth loss (using C3VD ground truth) to a self-supervised baseline degraded performance. The noisy ground truth in occluded regions caused the model to memorize noise rather than learn multi-view consistency.

5. Significance and Impact

Paradigm Shift in Training: The paper challenges the reliance on synthetic/phantom datasets with ground truth, proving that self-supervised training on diverse, real-world unlabeled data yields better generalization.
Practical Guidelines: It establishes best practices for endoscopic depth estimation, specifically emphasizing the need for dataset-specific video frame sampling to ensure sufficient inter-frame motion and the superiority of real-data training over supervised phantom training.
Robustness: By explicitly modeling illumination (luminance) and structure (edges), PRISM addresses the specific failure modes of endoscopic navigation (specular highlights and texture-less surfaces), making it a more viable solution for clinical assistance systems.
Future Directions: The work opens avenues for attention-based structural feature integration, domain adaptation to reduce real-data dependency, and the use of soft supervision to handle noisy annotations.