4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

Imagine you are trying to build a perfect 3D model of a squishy, moving piece of fruit (like a grape) while someone is juggling it, spinning it, and poking it with a stick. Now, imagine you can only see this happening through a tiny, single-lens camera attached to the fruit itself.

That is essentially the challenge of 4D Surgical Reconstruction. Surgeons use endoscopes (tiny cameras) to look inside the body. The body is full of soft tissues that breathe, pulse, and get pushed around by tools. The camera moves wildly. The goal is to turn that shaky, 2D video into a stable, high-quality 3D movie that doctors can use for training or planning surgery.

For a long time, computers struggled with this. If the camera moved too much, the 3D model would fall apart, looking like a melted wax figure.

Enter Local-EndoGS, a new method proposed by researchers that solves this problem. Here is how it works, explained through simple analogies:

1. The Problem: The "One-Size-Fits-All" Trap

Previous methods tried to build the entire surgery scene using one giant, static blueprint (called a "canonical space"). They assumed the camera stayed mostly still.

The Analogy: Imagine trying to describe a whole movie using a single photograph. If the camera zooms in, pans left, or moves forward, that single photo can't possibly capture the new details or the changing perspective. The result is a blurry, broken mess.
The Reality: When the endoscope moves around inside the body, the "single blueprint" approach fails because the scene changes too drastically for one model to handle.

2. The Solution: The "Rolling Window" Approach

Local-EndoGS changes the strategy. Instead of trying to build the whole movie at once, it breaks the video into small, manageable chunks.

The Analogy: Think of a scrolling marquee or a film strip. Instead of looking at the whole reel of film, the computer looks at just 5 seconds at a time. It builds a perfect 3D model for that specific 5-second clip. Then, it slides the window forward, builds the next clip, and so on.
Why it works: By focusing on small windows where the camera doesn't move too wildly, the computer can create a highly accurate 3D model for that specific moment. It stitches these high-quality "snapshots" together to form the full 4D reconstruction.

3. The "Coarse-to-Fine" Start-Up

Starting a 3D model from a single camera view is hard because the computer doesn't know how far away things are (it's like looking at a flat painting and not knowing if the tree is 1 meter away or 100 meters away).

The Analogy: Imagine trying to build a sandcastle without a bucket. You start by dumping a huge pile of sand (Coarse) to get the general shape. Then, you use a small trowel to carve out the details and fix the edges (Fine).
How they do it:
1. Coarse: They use a smart AI (called Track-Any-Point) to follow pixels across the video frames, creating a rough, 3D "skeleton" of the tissue.
2. Fine: They look at where the model looks wrong (like a blurry edge) and use a depth-sensing AI to fix just those specific spots, refining the shape until it's perfect.

4. The "Physics Police"

Even with a good start, the computer might make the tissue move in impossible ways (like a jellyfish turning inside out).

The Analogy: Imagine a puppet show. If the puppeteer pulls the strings too hard, the puppet's arm might snap backward. Local-EndoGS acts like a strict physics teacher. It tells the computer: "Hey, soft tissue stretches, but it doesn't teleport or twist into a knot. Keep it realistic."
The Result: The computer adds "rules" (priors) to ensure the tissue moves naturally, preserving the shape and structure of the organs.

Why This Matters

For Surgeons: It creates a "virtual twin" of a patient's anatomy. Surgeons can practice on this 3D model before touching the real patient, reducing risks.
For Training: Medical students can watch a high-quality 3D replay of a surgery, seeing exactly how the tissue deforms under different angles, rather than just watching a flat 2D screen.
The Big Win: Unlike previous methods that needed two cameras (stereo) or a perfectly still camera, this works with one moving camera, which is exactly how real surgeries happen.

In summary: Local-EndoGS is like a smart film editor that cuts a chaotic surgery video into tiny, manageable scenes, builds a perfect 3D model for each scene using smart guessing and physics rules, and then stitches them together to create a realistic, moving 3D map of the human body.

1. Problem Statement

The paper addresses the challenge of 4D reconstruction (3D space + time) of deformable surgical scenes from monocular endoscopic videos where the camera undergoes arbitrary motion.

Context: Endoscopic surgery involves soft tissue deformation due to breathing, heartbeat, and instrument interaction. High-quality 3D/4D reconstruction is vital for surgical simulation, training, and preoperative planning.
Limitations of Existing Methods:
- Fixed Viewpoint Assumption: Most state-of-the-art (SOTA) methods (based on Implicit Neural Representations or 3D Gaussian Splatting) assume the endoscope is stationary. They fail when the camera moves significantly because new scene content enters the field of view, breaking the correspondence between the observed space and a single global "canonical" space.
- Initialization Dependency: Current methods rely heavily on stereo depth priors or accurate Structure-from-Motion (SfM) (e.g., COLMAP) for initialization. Monocular sequences lack stereo depth (suffering from scale ambiguity), and SfM often fails in endoscopic scenes due to low texture, lighting changes, and tissue deformation.
- Performance Drop: When applied to monocular sequences with large camera movements, existing methods suffer from significant degradation in geometry and appearance quality or fail completely.

2. Methodology: Local-EndoGS

The authors propose Local-EndoGS, a framework built on 3D Gaussian Splatting (3DGS) that enables high-quality 4D reconstruction for monocular sequences with arbitrary camera motion. The pipeline consists of four key components:

A. Progressive Window-Based Global Scene Representation

Instead of modeling the entire sequence with a single canonical space, the method adaptively divides the input video into local windows based on scene dynamics.

Adaptive Partitioning: Windows are created based on translational/rotational camera pose changes and RGB frame differences. This ensures that the content within each window is relatively consistent.
Progressive Optimization: Each window is modeled by a local deformable scene representation (a local canonical space + a local deformation network). The parameters are optimized sequentially: once window $i-1$ is optimized, its parameters are saved, and the process moves to window $i$ . This allows the system to scale to long sequences with substantial camera movement.

B. Local Deformable Scene Representation

Within each window, the method uses a standard 3DGS-based deformable approach (specifically adapting EH-SurGS).

Canonical Space + Deformation Field: A local canonical space (3D Gaussians at $t=0$ ) is deformed over time using a network that predicts offsets for position, rotation, scale, and opacity.
Lifecycle Mechanism: To handle irreversible changes (e.g., tissue shearing), Gaussians have a "lifecycle," being activated only within their valid temporal range.
Static/Deformable Separation: An adaptive motion hierarchy distinguishes between static and deformable regions to optimize resource allocation.

C. Coarse-to-Fine Initialization Strategy (Monocular)

To overcome the lack of stereo depth and unreliable SfM, a robust initialization strategy is introduced for the local canonical space:

Coarse Stage (Scale-Aware):
- TAP-based Tracking: Instead of traditional feature matching (SIFT), which fails in low-texture endoscopic scenes, the method uses the Track-Any-Point (TAP) model to establish dense pixel-wise correspondences across frames.
- Triangulation: These correspondences are triangulated using known ground-truth poses to create a dense, scale-consistent point cloud.
- Cross-Window Propagation: For windows after the first, the optimized Gaussians from the previous window are transformed (using the previous window's deformation network) to serve as a prior for the current window's initialization.
Fine Stage (Error-Guided Refinement):
- The method renders an initial depth map and compares it with a monocular depth prior (from a pre-trained network like Depth Anything).
- An error-guided refinement is performed: regions with high rendering errors are back-projected using the aligned monocular depth to generate new 3D points, refining the geometry where the coarse initialization was inaccurate.

D. Optimization with Physical Priors

The optimization loss function combines several terms to ensure physical plausibility:

Rendering Loss ( $L_{rgb}$ ): Standard photometric loss (L1 + SSIM) excluding surgical instruments.
2D Tracking Loss ( $L_{track}$ ): Enforces consistency between the rendered 3D Gaussian trajectories and the 2D pixel tracks predicted by the TAP model.
Physics-Based Regularization: Three constraints are applied to neighboring Gaussians to prevent unnatural deformations:
- Local Rigidity ( $L_{rigid}$ ): Encourages adjacent Gaussians to move rigidly together.
- Rotation Similarity ( $L_{rot}$ ): Ensures consistent rotation among neighbors.
- Local Isometry ( $L_{iso}$ ): Preserves relative distances between Gaussian centers over time.

3. Key Contributions

Scalable 4D Framework: The first framework to enable high-quality 4D reconstruction of deformable surgical scenes from monocular endoscopic sequences with arbitrary camera motion, utilizing a progressive, window-based global representation.
Robust Monocular Initialization: A novel coarse-to-fine initialization strategy that integrates multi-view geometry (via TAP), cross-window information propagation, and monocular depth priors. This eliminates the need for stereo depth or accurate SfM.
Physical Plausibility: The integration of long-range 2D trajectory constraints and physics-based motion priors (rigidity, rotation, isometry) to ensure recovered deformations are anatomically realistic.
Comprehensive Evaluation: Extensive validation on three public datasets (EndoNeRF, StereoMIS, EndoMapper) demonstrating superior performance over SOTA methods in both appearance and geometry.

4. Experimental Results

The method was evaluated on three datasets with varying camera motions (fixed, moving around tissue, moving forward).

Quantitative Performance:
- EndoNeRF (Fixed Camera): Local-EndoGS achieved the best PSNR (38.727/39.647) and SSIM, outperforming the second-best method (EH-SurGS) while maintaining competitive depth accuracy.
- StereoMIS (Moving Camera): Significant improvements were observed. For Sequence 1, PSNR improved by 24.1% over the second-best method. Depth metrics (Abs Rel, RMSE) showed improvements of 68.5% and 72.7% respectively.
- EndoMapper (Forward Motion): Local-EndoGS achieved the highest PSNR (33.483) and SSIM (0.944), significantly outperforming SLAM-based and 4DGS baselines.
Qualitative Results: Visual comparisons show that Local-EndoGS preserves fine tissue textures and produces accurate depth maps, whereas baseline methods suffer from blurring, artifacts, and geometric distortions under camera motion.
Efficiency:
- Training: ~2.4 minutes (EndoNeRF) to ~8.4 minutes (StereoMIS), comparable to other 3DGS methods.
- Rendering: Achieves real-time speeds of ~330 FPS, significantly faster than implicit neural methods (which are often <1 FPS).

5. Significance and Impact

Clinical Applicability: By removing the dependency on stereo cameras and fixed viewpoints, Local-EndoGS makes high-quality 4D reconstruction feasible for standard monocular endoscopes used in real clinical settings.
Robustness: The ability to handle large camera motions and deformable tissues simultaneously addresses a major bottleneck in current surgical reconstruction research.
Foundation for Future Work: The framework provides a robust basis for surgical simulation, augmented reality (AR) guidance, and automated surgical skill assessment. The authors acknowledge limitations regarding topological changes (cutting/tearing) and real-time intraoperative use, pointing to future directions for hybrid representations and parallel training strategies.

In summary, Local-EndoGS represents a significant leap forward in surgical scene understanding, successfully bridging the gap between theoretical 4D reconstruction and the practical constraints of monocular, dynamic endoscopic surgery.