Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

Imagine you are trying to build a perfect, 3D holographic map of a delicate surgery happening inside a patient's body. You have a video camera (the robot's eye) recording the scene. But there's a huge problem: the surgeon's tools (scissors, clamps, etc.) keep blocking the view, hiding the soft, squishy tissues underneath.

If you try to build a 3D map while those tools are in the way, your map will have giant holes or "glitches" where the tools are. It's like trying to draw a picture of a landscape, but someone keeps holding a giant black sign in front of your face. You can't see the mountains behind the sign, so you just leave a black hole in your drawing.

Diff2DGS is a new, clever system designed to fix this problem. Think of it as a two-step "magic repair kit" for surgical videos.

Step 1: The "Time-Traveling Art Restorer" (The Inpainting Stage)

First, the system looks at the video and finds all the parts covered by surgical tools. It doesn't just guess what's underneath; it uses a special kind of AI called a Diffusion Model.

Imagine you are looking at an old, damaged painting where a piece is missing. A normal AI might just guess a random color to fill the hole. But this Diffusion Model is like a master art restorer who has seen thousands of similar paintings. It looks at the frames before and after the tool moved. It understands how the tissue moves and flows over time.

It essentially says, "Okay, I know what this tissue looked like a second ago, and I know how it's stretching right now. I can 'paint' over the tool with a perfect, realistic version of the tissue that should be there." This creates a clean video where the tools have vanished, and the hidden tissue is revealed with high consistency.

Step 2: The "Stretchy Clay Sculptor" (The 2D Gaussian Splatting Stage)

Now that we have a clean video, we need to turn it into a 3D model. Traditional methods are like trying to build a statue out of rigid, hard clay. If the tissue moves or stretches (which it does a lot in surgery), the hard clay cracks or looks fake.

Diff2DGS uses a technique called 2D Gaussian Splatting. Imagine instead of hard clay, you are using thousands of tiny, flat, stretchy stickers (or "splats") that can float in 3D space.

The Magic Trick: The system adds a special "Learnable Deformation Model." Think of this as giving the stickers a memory of how they stretch and twist. When the tissue moves, these stickers don't just break; they stretch and slide smoothly, just like real skin and muscle.
The Result: You get a 3D model that looks incredibly real and moves naturally, even when the tissue is being pulled or pushed.

Why is this a big deal?

Most previous methods had two major flaws:

They ignored the holes: They tried to build the 3D map even while the tools were blocking the view, leading to blurry, glitchy spots.
They cared only about the picture, not the depth: They made the image look pretty (like a high-quality photo), but if you looked at the 3D shape from a different angle, the depth was wrong. It was like a flat painting that looked 3D from the front but collapsed when you walked to the side.

Diff2DGS fixes both:

It "erases" the tools first, so the 3D map has no holes.
It uses a special "depth loss" training method. Imagine a teacher who doesn't just grade your drawing on how colorful it is, but also checks if the mountains are the right height. This ensures the 3D shape is accurate, not just pretty.

The Bottom Line

The researchers tested this on real surgical robot videos. The results were impressive:

The 3D models were sharper and more accurate than any previous method.
The "hidden" areas behind the tools were reconstructed perfectly.
The system is fast enough to potentially work in real-time, which is crucial for helping surgeons navigate or for training robots to do surgery autonomously.

In short, Diff2DGS is like giving a surgeon a pair of X-ray glasses that can see through the tools, combined with a sculptor who can instantly mold a perfect, moving 3D map of the inside of the body, ensuring nothing is hidden and everything is measured correctly.

1. Problem Statement

Real-time 3D reconstruction of deformable surgical scenes is critical for robotic surgery, navigation, and training. However, existing methods face two primary challenges:

Occlusion Artifacts: Surgical instruments frequently occlude tissue in endoscopic videos. Current reconstruction methods (e.g., NeRF, 3DGS) often struggle to reconstruct the hidden tissue behind instruments, leading to holes or hallucinations in the 3D model.
Depth Accuracy vs. Image Quality: Most benchmarks (EndoNeRF, StereoMIS) lack 3D ground truth, relying solely on image quality metrics (PSNR, SSIM). The authors argue that high image fidelity does not guarantee accurate 3D geometry. Furthermore, existing Gaussian Splatting methods often produce reconstructions that look good from the camera view but degrade significantly when the viewpoint changes due to inaccurate depth estimation.

2. Methodology: Diff2DGS

The authors propose Diff2DGS, a novel two-stage framework designed to handle occlusions and dynamic tissue deformation simultaneously.

Stage 1: Diffusion-Based Surgical Instrument Inpainting

Goal: Remove surgical instruments from video frames and inpaint the occluded tissue with high spatiotemporal consistency before 3D reconstruction.
Mechanism:
- Utilizes a Diffusion Model (based on Stable Diffusion v1.5) to generate missing tissue pixels.
- Temporal Priors: Incorporates a Temporal Attention Mechanism to ensure consistency across video frames, preventing flickering or structural hallucinations common in standard diffusion models.
- Training: Optimized using a mask-weighted L2 loss in the latent space, focusing on restoring the occluded regions while preserving global structure.
- Inference: Uses DDIM sampling with a Phased Consistency Model (PCM) to accelerate inference to just two denoising steps per frame.

Stage 2: 2D Gaussian Splatting with Learnable Deformation Model (LDM)

Goal: Reconstruct the dynamic, deformable surgical scene from the inpainted video.
Core Representation: Instead of standard 3D Gaussians, the method uses 2D Gaussian Splatting (2DGS), which models surfaces as planar Gaussians embedded in 3D space. This is more efficient for representing tissue surfaces and edges.
Learnable Deformation Model (LDM):
- To handle tissue deformation, the authors introduce an LDM that explicitly models dynamic changes in position, rotation, and scale.
- It employs Gaussian functions with learnable centers and variances to estimate deformation over time.
- This design is more parameter-efficient than previous approaches like Deform3DGS or Endo-4DGS while maintaining high fidelity.
Adaptive Depth Loss:
- To address the issue of poor depth accuracy, the authors introduce an Adaptive Depth Loss Weighting strategy.
- Instead of a fixed weight for depth loss, the system dynamically adjusts the weight ( $\lambda_{depth}$ ) based on the ratio of RGB loss to depth loss during training.
- This ensures the model prioritizes depth accuracy in later training stages, preventing the RGB loss from dominating and ensuring geometric faithfulness.

3. Key Contributions

Novel Two-Stage Framework: Diff2DGS is the first to explicitly combine diffusion-based video inpainting with 2D Gaussian Splatting for surgical scenes, effectively eliminating occlusion artifacts before 3D reconstruction.
Efficient Deformation Modeling: The introduction of the Learnable Deformation Model (LDM) for 2DGS allows for efficient, real-time reconstruction of dynamic tissue without the heavy computational cost of 4D volumetric approaches.
Adaptive Depth Optimization: The proposed adaptive depth loss mechanism improves geometric accuracy, addressing the gap where high image quality metrics do not correlate with accurate 3D geometry.
Comprehensive Evaluation: The authors validate their method on three datasets (EndoNeRF, StereoMIS, and SCARED) and introduce a rigorous evaluation protocol that includes depth accuracy analysis (RMSE) against stereo reconstructions and ground truth where available.

4. Experimental Results

The method was evaluated on EndoNeRF, StereoMIS, and SCARED datasets.

Image Quality: Diff2DGS outperforms state-of-the-art methods (EndoNeRF, Deform3DGS, SurgicalGS) in PSNR and SSIM.
- EndoNeRF: Achieved 38.02 dB PSNR.
- StereoMIS: Achieved 34.40 dB PSNR.
Occlusion Handling: In synthetic occlusion tests on the SCARED dataset, Diff2DGS achieved 30.53 dB PSNR in masked regions, significantly outperforming Deform3DGS (19.52 dB) and EndoGaussian (23.56 dB).
Depth Accuracy:
- On the SCARED dataset (which has 3D ground truth), Diff2DGS achieved an RMSE of 8.21 mm in masked regions, compared to 27.19 mm for Deform3DGS.
- Visualizations show that while other methods degrade in accuracy when the camera viewpoint changes, Diff2DGS maintains consistent depth and geometry.
Efficiency: The framework maintains real-time rendering speeds (hundreds of times faster than NeRF-based methods) and comparable speeds to other Gaussian Splatting methods, despite the added inpainting step.

5. Significance

Clinical Relevance: By providing reliable 3D reconstructions of occluded tissues, Diff2DGS enhances the safety and precision of robotic surgery, enabling better surgeon guidance and autonomous assistance.
Benchmarking Advancement: The paper highlights the limitation of relying solely on image metrics for 3D reconstruction and demonstrates the necessity of evaluating depth accuracy, particularly in occluded regions.
Technical Innovation: The successful integration of diffusion priors with 2D Gaussian Splatting offers a new paradigm for handling dynamic, occluded scenes in computer vision, extending beyond surgery to other domains requiring high-fidelity 3D recovery.

In conclusion, Diff2DGS represents a significant step forward in intraoperative scene reconstruction, solving the dual problems of occlusion and geometric inaccuracy through a synergistic combination of generative AI and efficient 3D representation.

Diff2DGS: Reliable Reconstruction of Occluded Surgical Scenes via 2D Gaussian Splatting

Step 1: The "Time-Traveling Art Restorer" (The Inpainting Stage)

Step 2: The "Stretchy Clay Sculptor" (The 2D Gaussian Splatting Stage)

Why is this a big deal?

The Bottom Line

1. Problem Statement

2. Methodology: Diff2DGS

Stage 1: Diffusion-Based Surgical Instrument Inpainting

Stage 2: 2D Gaussian Splatting with Learnable Deformation Model (LDM)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration