EndoDDC: Learning Sparse to Dense Reconstruction for Endoscopic Robotic Navigation via Diffusion Depth Completion

Imagine you are trying to navigate a dark, foggy cave using only a flashlight. The flashlight (your camera) shows you the walls, but because the walls are smooth and shiny (like wet tissue in the human body), the light just bounces off, leaving you blind to the true shape of the cave. You can see some dots of light hitting the wall, but you can't tell how far away the wall really is or what the bumps look like.

This is the exact problem surgeons face with endoscopic robots. These tiny robots travel inside the human body to perform surgery, but the "inside" is often a smooth, wet, and poorly lit environment. Standard cameras struggle to guess the 3D shape of the organs, which is dangerous if the robot needs to move precisely.

Here is how the paper "EndoDDC" solves this problem, explained simply:

1. The Problem: The "Blurry Map"

Usually, robots try to guess depth (how far away things are) by looking at a 2D picture.

The Old Way: They try to learn from thousands of pictures, but they need a "perfect map" (a 3D depth guide) to learn from. Getting these perfect maps inside a human body is nearly impossible.
The Result: Without a perfect guide, the robot's guess is often wrong. It might think a smooth wall is a deep hole, or vice versa. This is like trying to draw a detailed map of a cave while wearing foggy glasses.

2. The Solution: "Filling in the Dots"

The researchers realized that while we can't get a perfect map, we can get a few accurate dots. Special sensors can tell the robot, "Hey, this specific pixel is exactly 5cm away." But these dots are sparse (scattered like stars in the sky), leaving huge gaps in between.

EndoDDC is a new system that takes these scattered dots and fills in the gaps to create a perfect, smooth 3D map.

3. How It Works: The "Smart Painter" Analogy

Think of the EndoDDC system as a master painter who is trying to restore a damaged, old painting.

The Input (The Clues): The painter is given two things:
1. The original photo (the RGB image).
2. A few scattered, accurate dots of paint (the sparse depth data) telling them exactly where the edges are.
The Secret Sauce (The Gradient): The painter doesn't just look at the dots; they look at the slope of the dots. If the dots are close together, the wall is steep. If they are far apart, the wall is flat. The system uses this "slope information" (gradient) to understand the shape better.
The Magic Tool (Diffusion Model): This is the coolest part. Imagine the painter starts with a blank canvas covered in static noise (like TV snow).
- They use the scattered dots and the slope clues as a guide.
- Step-by-step, they "denoise" the image, slowly turning the static snow into a clear, sharp picture of the organ.
- Because they are guided by the accurate dots and the slope clues, they don't just guess; they reconstruct the shape with high precision, even in the dark, shiny parts of the cave.

4. Why It's Better Than Before

Old Robots: Tried to guess the whole shape from scratch. They often got lost in the "fog" (weak textures) or got confused by the "glare" (shiny reflections).
EndoDDC: Uses the few accurate dots it has as anchors. It then uses its "smart painter" brain to fill in the rest, ensuring the final map is smooth, accurate, and safe.

The Real-World Impact

Think of this as giving a surgical robot super-vision.

Before: The robot might accidentally bump into a delicate organ because it thought a smooth wall was far away.
Now: The robot sees a clear, 3D "hologram" of the inside of the body. It knows exactly where the bumps, curves, and edges are, allowing it to navigate safely and perform surgery with the precision of a human master surgeon.

In short: EndoDDC takes a few scattered, accurate measurements and uses a smart, step-by-step "denoising" process to turn them into a perfect, high-definition 3D map of the human body, making robotic surgery safer and more precise.

1. Problem Statement

Accurate depth estimation is critical for the navigation, 3D reconstruction, and safe instrument guidance of endoscopic surgical robots. However, existing approaches face significant limitations in endoscopic environments:

Data Scarcity: Fine-tuning state-of-the-art (SOTA) foundational models requires dense depth ground truth, which is difficult to acquire due to privacy, safety, and technical constraints in surgery.
Weak Textures & Lighting: Endoscopic scenes often feature textureless tissue surfaces and specular reflections. Self-supervised methods (monocular or video-based) struggle here, suffering from scale ambiguity and geometric inaccuracies.
Sparse Sensor Limitations: While depth sensors (e.g., ToF, stereo endoscopes) provide accurate sparse points, they yield incomplete 3D reconstructions. Standard depth completion methods, successful in autonomous driving, often fail in endoscopy due to the unique challenges of weak textures and complex lighting.

Goal: To develop a robust method that converts sparse depth maps and RGB images into dense, accurate depth maps specifically tailored for endoscopic robotic navigation.

2. Methodology: EndoDDC

The proposed EndoDDC framework integrates image features, sparse depth, and depth gradient information, optimizing the reconstruction through a Diffusion Model. The pipeline consists of four main stages:

A. Feature Extraction & Depth Grad Fusion

Input: An RGB image ( $I$ ) and a corresponding sparse depth map ( $S$ ).
Backbone: Uses a pre-trained PVT (Pyramid Vision Transformer) to extract multi-scale global and local features.
Depth Grad Fusion Module: This is a core innovation. It employs Convolutional Gated Recurrent Units (ConvGRU) to iteratively process depth maps and gradient maps.
- It takes hidden states, context inputs, predicted depth, and predicted depth gradients as input.
- It outputs updated hidden states and refined depth/gradient variations.
- The final hidden state serves as the conditioning input for the diffusion model, while the gradient features guide the geometric optimization.

B. Depth Diffusion (Conditional Optimization)

Base Model: Built upon the Denoising Diffusion Implicit Model (DDIM).
Initialization: Instead of starting from pure noise, the reverse denoising process is initialized with the initial coarse depth prediction ( $\hat{D}_{init}$ ) generated by the backbone.
Conditional Guidance: The diffusion process is explicitly guided by the depth gradient features ( $\tilde{N}'_i$ $\tilde{N}_{i}^{'}$ ) extracted from the fusion module.
- These features are distilled into a single-channel guidance map via $1\times1$ convolution.
- The guidance map is concatenated with the noisy depth map ( $x_t$ ) and fed into the noise prediction network ( $\epsilon_\theta$ ).
Mechanism: This allows the model to iteratively refine the depth map, leveraging geometric priors to resolve local ambiguities caused by textureless surfaces or reflections, ensuring a globally coherent and physically plausible output.

C. Depth Enhancement & Refinement

Upsampling: The refined coarse depth map (at 1/4 resolution) is upsampled to full resolution using a convex combination method.
SPN Refinement: A pre-trained Spatial Propagation Network (SPN) further refines the upsampled map using full-resolution features to produce the final dense depth map ( $\hat{D}_{final}$ ).

D. Loss Functions

The model is trained using a composite loss function:

Depth Loss: Combines L1 and L2 losses on both the predicted coarse depth and the final upsampled depth against ground truth.
Gradient Loss: Supervises the predicted depth gradients against ground-truth gradients (with decay over iterations).
Diffusion Loss: L2 loss to minimize the difference between the predicted noise and the actual Gaussian noise added during the forward process.

3. Key Contributions

EndoDDC Pipeline: A novel depth completion framework specifically designed for endoscopic robots, effectively bridging the gap between sparse sensor data and dense 3D reconstruction.
Depth Gradient Fusion: A multi-scale feature extraction module that iteratively fuses depth and gradient information to provide robust geometric guidance for sparse-to-dense reconstruction.
Conditional Diffusion Strategy: The first application of a depth gradient-conditioned diffusion model for endoscopic depth completion. It uses initial coarse depth and gradient features to iteratively optimize the depth map, overcoming the limitations of traditional feedforward methods.
State-of-the-Art Performance: Extensive validation demonstrating superior accuracy and robustness compared to fine-tuned foundational models and existing depth completion techniques.

4. Experimental Results

The method was evaluated on two public datasets: C3VD (Colonoscopy 3D Video Dataset) and StereoMIS (Surgical SLAM dataset).

Quantitative Performance:
- Outperformed all SOTA baselines (including DepthAnything-v2, Marigold-DC, CompletionFormer, and OGNI-DC) across all metrics (RMSE, MAE, REL, and $\delta$ accuracy).
- On C3VD, EndoDDC achieved an RMSE of 0.6412 mm (vs. 0.6770 mm for OGNI-DC) and a REL of 0.0060.
- On StereoMIS, it achieved an RMSE of 1.4691 mm, significantly outperforming the fine-tuned DepthAnything-v2 (2.2465 mm).
- Notably, it improved accuracy by 25.55% over EndoDAC on the StereoMIS dataset.
Robustness to Sparsity:
- The model was trained on 500 sparse points but tested on varying levels (50, 500, 5000, 50000).
- EndoDDC maintained superior performance across almost all sparsity levels, demonstrating that its architecture effectively integrates RGB features with sparse geometric constraints, unlike models relying solely on visual priors (e.g., Marigold-DC) which struggle when geometric constraints are extremely weak.
Qualitative Analysis:
- Visual comparisons show EndoDDC produces fewer errors at image edges and in regions with high depth variation compared to competitors.
- It successfully reconstructs fine details of the intestinal wall that other methods miss.
Ablation Study:
- Removing the Depth Grad Fusion or the Initial Depth input significantly degraded performance, confirming that both the geometric conditioning and the initialization strategy are critical for the diffusion model's success.

5. Significance

This work addresses a critical bottleneck in surgical robotics: the lack of accurate, dense 3D perception in challenging endoscopic environments. By successfully adapting diffusion models to the specific constraints of endoscopy (weak textures, specular reflections), EndoDDC offers:

Enhanced Safety: More accurate depth maps enable safer instrument guidance and collision avoidance.
Autonomy: Improved spatial awareness supports autonomous path planning and real-time navigational decisions.
Generalizability: The method reduces reliance on expensive, dense ground-truth datasets, making it a viable solution for clinical deployment where such data is unavailable.

The code is set to be released to facilitate further research in surgical robotics and depth completion.