NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy

Imagine you are a doctor holding a tiny camera on a stick (an endoscope) and sliding it inside a patient's body. You see a video of a beating heart, a twisting lung, or a squishy stomach lining. The problem? The camera is just a single eye, the tissues are constantly squishing and stretching like jelly, and the lighting changes wildly. It's incredibly hard to turn that 2D, wobbly video into a stable, 3D model that you can rotate and examine from any angle.

This paper introduces NeRFscopy, a clever AI tool designed to solve exactly that problem. Here is how it works, explained through simple analogies:

1. The Core Idea: The "Magic Clay" and the "Time Machine"

Think of the inside of the body as a lump of magic clay.

The Old Way: Traditional 3D reconstruction tries to build this clay by taking photos and guessing where every pixel belongs. But because the clay is squishy and moving, the old methods often get confused, resulting in a blurry or broken model.
The NeRFscopy Way: Instead of building the model piece by piece, NeRFscopy treats the scene as a digital cloud of invisible paint. It uses a neural network (a type of AI brain) to learn the "recipe" for this paint. It asks: "If I look at this spot from this angle, what color and density should I see?"

2. Handling the Squishiness: The "Dance Instructor"

The biggest challenge is that tissues don't just move; they twist, stretch, and rotate all at once.

The Problem: If you just tell the AI "move this point here," it might stretch the tissue like taffy in a way that doesn't make physical sense.
The Solution: NeRFscopy uses a special mathematical tool called an SE(3) deformation field. Think of this as a dance instructor for the clay. Instead of telling every single grain of sand where to go individually, the instructor tells a whole group of them, "Rotate 10 degrees to the left and slide forward." This ensures the tissue moves like a real, solid object (rotating and sliding) rather than melting into a puddle.

3. Learning Without a Map: The "Self-Taught Artist"

Usually, to build a 3D model, you need a pre-made map or a 3D scanner. NeRFscopy is self-supervised, meaning it teaches itself.

The Analogy: Imagine an artist trying to paint a 3D sculpture of a moving dancer, but they only have a flat video of the dancer. They don't have a blueprint.
How it works: The AI looks at the video and makes a guess about the 3D shape. It then tries to "re-render" the video from that guess. If the re-rendered video looks different from the real video, the AI knows it made a mistake and adjusts its internal "recipe." It repeats this millions of times until the 3D model perfectly matches the 2D video.

4. The Secret Sauce: "Depth Hints" and "Smoothness"

To make sure the AI doesn't get lost, the authors added a few "training wheels":

Depth Hints: They use a pre-trained AI to guess how far away things are (like a rough sketch of the terrain). This gives NeRFscopy a head start, so it doesn't have to guess blindly.
The "Smoothness" Rule: Real tissues don't jump around erratically from one frame to the next. The AI is taught to penalize "jumpy" movements, forcing the 3D model to flow smoothly over time, just like real flesh.

Why Does This Matter?

Currently, doctors can only see what's directly in front of the camera. With NeRFscopy, they could:

Stop the video and rotate the view to see a polyp (a small growth) from the "back" or "side" without moving the camera.
Measure things accurately (like the size of a tumor) in 3D space.
Plan surgeries by visualizing the anatomy in a virtual 3D space before cutting.

The Bottom Line

NeRFscopy is like a time-traveling 3D scanner that works on squishy, moving body parts using only a standard video camera. It takes a messy, 2D video of a beating heart or a twisting lung and reconstructs a clean, rotatable, 3D model that doctors can explore, helping them make better decisions for their patients.

While it's not quite fast enough to run in real-time on a phone yet (it takes a little time to process), it proves that we can finally turn "squishy video" into "solid 3D understanding."

1. Problem Statement

Endoscopy is a critical tool for medical diagnosis and treatment, yet it typically relies on 2D monocular video feeds. Reconstructing accurate 3D models of internal tissues from these videos is essential for better visualization, surgical planning, and disease monitoring. However, this task presents significant challenges:

Non-Rigid Deformations: Biological tissues are highly deformable, making traditional rigid Structure-from-Motion (SfM) methods ineffective.
Monocular Limitations: The use of single-camera endoscopes lacks depth information and stereo cues.
Visual Artifacts: Endoscopic videos suffer from specular reflections, occlusions (by tools or fluids), lack of texture, motion blur, and unpredictable camera movements.
Unknown Trajectories: Camera motion is often unknown and coupled with tissue deformation, creating an ill-posed problem.

Existing Neural Radiance Field (NeRF) methods are typically designed for rigid scenes or require pre-calibrated camera trajectories and dense 3D Gaussian initialization, which are not feasible for generic, uncalibrated endoscopic videos.

2. Methodology: NeRFscopy

The authors propose NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable tissues from a single monocular video. The method does not rely on pre-trained models or template shapes.

Core Architecture

The model combines a Canonical Radiance Field with a Time-Dependent Deformation Field:

Canonical Space ( $F_\Theta$ ): A Multi-Layer Perceptron (MLP) that represents the scene in a static "canonical" state. It maps 3D coordinates and viewing directions to color and volume density.
Deformation Field ( $G_\Phi$ ): Instead of using simple displacement vectors (which struggle with complex rotations), NeRFscopy employs a dense SE(3) deformation field.
- This field encodes rigid transformations (rotation and translation) for each point in the scene over time.
- It is parameterized by a screw axis $S = (\hat{a}, \hat{b})$ representing the rotation axis and angle, and a translation vector.
- Points are warped from the canonical space to the observed frame using Rodrigues' formula and homogeneous transformation matrices.
Input: A calibrated RGB video sequence ( $I_i$ ). The method assumes camera motion is null to focus on tissue deformation, though it can accommodate known motion. Surgical tools are masked out using a binary mask.

Depth-Guided Sampling

To adapt NeRF for monocular inputs without ground truth 3D data, the authors integrate pre-trained monocular depth estimation (e.g., DPT, Depth-Anything).

A relative depth map ( $D_i$ ) is generated for each frame.
Sampling is guided by a Gaussian transfer function near the estimated tissue surface, concentrating samples where the tissue actually exists. This eliminates the need for hierarchical coarse-to-fine sampling.

Optimization and Loss Functions

The model is trained via stochastic gradient descent to minimize a composite loss function ( $L$ ):

Photometric Loss ( $L_C$ ): Minimizes the difference between rendered and observed pixel colors.
Depth Loss ( $L_D$ ): Penalizes deviation between the predicted depth map and the pre-computed monocular depth map (using Huber norm).
Jacobian Regularization ( $L_J$ ): Encourages local deformations by penalizing deviations of the Jacobian matrix's singular values from zero, ensuring the deformation field remains physically plausible.
Depth Gradient Regularization ( $L_g$ ): Encourages sharp discontinuities in the estimated depth to match the input depth edges.
Depth Smoothness Loss ( $L_s$ ): Enforces smoothness in depth values for neighboring pixels, weighted by image gradients to preserve edges.
Temporal Total Variation ( $L_{tv}$ ): Enforces temporal coherence between consecutive frames to prevent abrupt, unrealistic changes in deformation.

3. Key Contributions

SE(3) Deformation Modeling: Introduction of a dense SE(3) deformation field for NeRF, which is more efficient and capable of capturing complex simultaneous rotations and translations in tissues compared to standard displacement fields.
Self-Supervised Monocular Pipeline: A generic framework that learns 3D implicit representations solely from video data and pre-trained depth estimators, requiring no rigid priors, template shapes, or camera calibration.
Sophisticated Regularization: The integration of specific loss terms (Jacobian, gradient, smoothness, and temporal TV) tailored to handle the noise and artifacts inherent in endoscopic imaging.
Novel View Synthesis: The ability to generate physically plausible, alternative deformation states (views) of the tissue that were not present in the original video.

4. Experimental Results

The method was evaluated on four in-vivo monocular video datasets (TECAB surgeries, lung lobectomy, bronchoscopy) and the EndoNeRF robotic prostatectomy dataset.

Quantitative Performance:
- NeRFscopy consistently outperformed competing methods (EndoNeRF, EndoSurf, LerPlane-32k, EndoGaussian) in PSNR and LPIPS metrics.
- On the EndoNeRF dataset, NeRFscopy achieved a PSNR of 37.204 (vs. 36.429 for the next best) and an LPIPS of 0.054 (vs. 0.083), indicating superior image fidelity and perceptual similarity.
Ablation Studies:
- The study analyzed the impact of different pre-trained depth estimators, finding Depth-Anything provided the best visual detail.
- Adding gradient and smoothness terms improved the baseline significantly. However, the temporal total variation term ( $L_{tv}$ ) sometimes degraded performance on highly dynamic scenes, suggesting that high-frequency details in RGB inputs can conflict with strict temporal smoothness constraints.
Qualitative Results:
- Visual evaluations showed high-quality 3D reconstructions and novel view synthesis that were physically consistent with the input video.
- The method successfully handled severe deformations and varying illumination conditions.

5. Significance and Conclusion

NeRFscopy represents a significant advancement in medical imaging by enabling robust, self-supervised 3D reconstruction of non-rigid biological tissues from standard monocular endoscopes.

Clinical Impact: It allows physicians to visualize internal structures in 3D, assess nodule dimensions, and review novel views of anatomy post-procedure, potentially improving diagnostic accuracy and treatment planning.
Technical Innovation: By moving away from rigid assumptions and template-based approaches, it offers a universal solution for diverse endoscopic interventions (gastroscopy, laparoscopy, bronchoscopy).
Future Work: The authors plan to incorporate explicit camera motion estimation into the formulation to further enhance robustness in scenarios where camera movement is significant.

While the current implementation is not real-time (0.14 FPS), the authors prioritize accuracy and efficacy, viewing the computational cost as a trade-off for the high quality of the 3D reconstruction.