FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

Imagine you are a surgeon performing a delicate operation inside a patient's liver. The problem? You are looking through a tiny camera (a laparoscope) that only sees the surface of the liver. You cannot see the tumors or blood vessels hidden underneath. It's like trying to find a specific pebble inside a giant, shifting sandcastle while blindfolded, only allowed to peek through a keyhole.

To solve this, surgeons use Augmented Reality (AR). They want to project a "ghost map" of the patient's pre-surgery CT scan onto the live video feed, so the hidden tumors light up like neon signs.

But here's the catch: The liver is a jelly. It squishes, stretches, and twists when the surgeon touches it, when air is pumped into the belly to make room, or when gravity pulls on it. If the "ghost map" doesn't move perfectly with the squishy liver, the surgeon might cut the wrong spot.

This paper presents a new, smarter way to make that ghost map stick to the real liver. Here is how they did it, broken down into simple steps:

1. The "GPS" Problem: Getting Started

Before you can fix a wobbly map, you need to know roughly where you are.

The Old Way: Previous systems tried to match the outline of the liver in the video to the 3D model. It's like trying to navigate a city using only the silhouette of the buildings against the sky. If the lighting changes or the building is partially hidden, you get lost.
The New Way (FoundationPose): The authors used a powerful AI called FoundationPose. Think of this AI as a super-smart GPS that doesn't just look at the building's outline; it also looks at depth (how far away things are).
- They fed the AI the liver's outline, a mask (a silhouette), and a depth map (a guess at how deep every pixel is).
- The Analogy: Imagine trying to guess the shape of a hidden object. If you only see its shadow (outline), it's hard. But if you also know how far away the shadow is from the wall (depth), you can guess the 3D shape much better. This gave them a much more accurate starting point.

2. The "Jelly" Problem: Fixing the Squish

Once the GPS gets the liver roughly in the right spot, the liver starts to deform (squish) because the surgeon is touching it.

The Old Way: To fix the squish, old systems used Finite Element Analysis (FEA). This is like trying to simulate the physics of a real liver by calculating the stress on every single molecule. It's incredibly accurate but takes a supercomputer and hours to run. It's like trying to predict exactly how a specific piece of jelly will wiggle by doing complex physics homework.
The New Way (NICP + PCA): The authors took a shortcut.
- Step A (The Library of Shapes): They looked at hundreds of different liver shapes from other patients and used math (PCA) to find the "top 10 ways a liver usually squishes." Think of this as having a library of 10 pre-made "squishy liver costumes."
- Step B (The Magic Search): Instead of calculating physics, they used a smart search algorithm (CMA-ES) to try on these costumes. They asked: "Which combination of these 10 squishy costumes makes the ghost map look most like the real liver right now?"
- The Analogy: Instead of calculating the physics of a wiggling jelly, they just tried on different pre-made wiggly outfits until one fit perfectly. It's much faster and doesn't require a physics degree.

3. The Results: A Better Map

They tested this on real patients.

The Score: Their new system reduced the error to about 8.5 mm (less than a centimeter).
The Comparison: Previous methods were often off by 15mm or more, or they were too slow to be useful in the operating room.
The Surprise: Interestingly, just adding the "depth" info to the starting GPS step made a huge difference. And while the "squishy outfit" method wasn't perfect for tracking deep internal tumors (it's better at the surface), it was good enough to be clinically useful and much faster than the old physics-heavy methods.

Why This Matters

This paper is like upgrading from a paper map that gets wet and rips to a smartphone GPS that updates in real-time.

It's lighter: It doesn't need a supercomputer to calculate physics.
It's smarter: It uses depth information to avoid getting confused by shadows and lighting.
It's faster: It can update the map quickly enough to help a surgeon while they are working.

In short, they built a system that helps surgeons see the "invisible" parts of the liver by using AI to guess the 3D shape and a clever "squishy outfit" trick to keep the map aligned as the liver moves. It's a step toward making liver surgery safer and more precise.

1. Problem Statement

In laparoscopic liver surgery, accurate localization of tumors and major vessels is challenging because internal structures are not directly visible. While Augmented Reality (AR) can overlay preoperative CT data onto intraoperative endoscopic views to improve safety, achieving accurate registration is difficult due to:

Non-rigid Deformation: The liver deforms significantly due to pneumoperitoneum, gravity, and surgical contact.
Partial Visibility: Only a small portion of the liver surface is visible at any given time.
Depth Ambiguity: Monocular cameras lack inherent depth sensing, making 3D pose estimation from 2D contours ambiguous.
Complexity of Existing Methods: Current state-of-the-art non-rigid registration often relies on Finite Element Analysis (FEA), which requires heavy computation, patient-specific material parameters (stiffness, elasticity), and significant engineering expertise.

The paper aims to develop a registration pipeline that achieves clinically relevant accuracy (<10 mm error) while reducing modeling complexity and computational burden compared to FEA-based approaches.

2. Methodology

The proposed pipeline consists of two main stages: Rigid Initialization and Non-Rigid Refinement.

A. Rigid Initialization: FoundationPose with Depth Augmentation

Instead of relying solely on organ contours, the authors adapt FoundationPose, a foundation model for 6D object pose estimation, to the laparoscopic domain.

Input Data: The network (RefineNet architecture) takes three inputs:
1. Contour Maps: Skeletonized and augmented contours of liver ridges, ligaments, and silhouettes.
2. Liver Masks: Full segmentation masks (including occluded regions).
3. Monocular Depth Maps: Estimated using Depth Anything V2, excluding instrument occlusions.
Training Strategy: To bridge the domain gap between synthetic training data and real intraoperative images, the authors employ extensive data augmentation:
- Contours: Random dilation, pixel deletion, and skeletonization.
- Masks: Random erosion/dilation.
- Depth: Simulated instrument occlusions (zero-value blocks), scaling, and normalization.
Loss Function: A surface-based Mean Squared Error (MSE) loss is used, comparing the transformed 3D mesh of the predicted pose against a reference pose, rather than separate rotation/translation losses.
Inference: An iterative refinement process is used where the predicted pose updates the input for the next iteration until convergence (max 10 iterations).

B. Non-Rigid Registration: NICP + CMA-ES

To handle deformation without FEA, the authors propose a statistical shape model approach combined with a gradient-free optimizer.

Deformation Modeling:
- A dataset of 398 liver meshes is aligned to a reference patient mesh using Non-Rigid Iterative Closest Point (NICP).
- Principal Component Analysis (PCA) is applied to the displacement vectors to extract the first 10 principal deformation modes. This creates a low-dimensional subspace of plausible liver shapes.
Optimization Algorithm:
- Since the objective function (Hausdorff distance between rendered contours and ground truth masks) is non-differentiable, gradient-based optimizers (like SGD) cannot be used.
- The authors employ Covariance Matrix Adaptation Evolution Strategy (CMA-ES), a stochastic, derivative-free algorithm.
- Objective: Minimize a weighted Hausdorff distance between the rendered model contours (based on current pose and shape coefficients) and the input label masks.
- Constraints: Shape coefficients are bounded within $[-1, 1]$ to prevent unrealistic deformations; pose parameters are constrained near the initialization.

3. Key Contributions

Depth-Augmented FoundationPose: Demonstrated that integrating monocular depth maps (estimated via Depth Anything) with contour/mask inputs significantly improves rigid initialization accuracy compared to contour-only methods in laparoscopic scenes.
FEA-Free Non-Rigid Pipeline: Proposed a novel pipeline replacing complex Finite Element Analysis with a NICP-based statistical shape model optimized via CMA-ES. This eliminates the need for patient-specific biomechanical parameters and reduces computational complexity.
Robust Data Augmentation: Developed a comprehensive augmentation strategy (including depth occlusion simulation and contour skeletonization) to train foundation models on synthetic data for real-world surgical application.

4. Experimental Results

The method was evaluated on real patient data from the Rabbani et al. dataset (4 patients, 8–21 frames each).

Performance Metrics:
- Mean Target Registration Error (TRE): The full pipeline (FoundationPose + Non-Rigid Optimization) achieved a mean error of 8.52 mm (excluding Patient 2).
- Comparison: This outperformed rigid-only registration (11.43 mm) and previous state-of-the-art methods like LMR [5] (17.33 mm) and NM [23] (15.87 mm).
- Depth Impact: The FoundationPose model with depth achieved 9.91 mm mean error, significantly better than the contour-only version (15.86 mm).
Patient 2 Anomaly: Patient 2 was excluded from the final average because the liver underwent extreme torsion during surgery, making registration nearly impossible for all methods (even manual registration yielded ~35 mm error).
Efficiency:
- Rigid initialization converges in a few seconds.
- Non-rigid refinement takes ~30–60 seconds per frame.
- Once the first frame is registered, subsequent frames can be updated in real-time using camera tracking, re-invoking non-rigid optimization only when significant deformation is detected.

5. Significance and Conclusion

Clinical Relevance: The pipeline achieves sub-10 mm accuracy, which is considered clinically relevant for surgical guidance, while offering a lightweight alternative to heavy FEA simulations.
Engineering Simplicity: By replacing FEA with NICP and CMA-ES, the method lowers the barrier to entry for surgical AR systems, removing the need for complex material property estimation.
Depth Utility: The study validates that even relative depth maps from monocular estimation networks provide crucial geometric cues that resolve ambiguities inherent in contour-only registration.
Limitations & Future Work:
- The method relies on surface alignment; it does not perfectly track internal tumor shifts if the surface deformation does not correlate with internal changes.
- Evaluation is limited by the scarcity of public datasets with tumor-level ground truth. Future work will involve phantom-based experiments for controlled validation.

In summary, this paper presents a hybrid approach that leverages modern foundation models for robust initialization and statistical shape modeling for efficient deformation correction, offering a practical and accurate solution for AR-guided liver surgery.

FoundationPose-Initialized 3D-2D Liver Registration for Surgical Augmented Reality

1. The "GPS" Problem: Getting Started

2. The "Jelly" Problem: Fixing the Squish

3. The Results: A Better Map

Why This Matters

1. Problem Statement

2. Methodology

A. Rigid Initialization: FoundationPose with Depth Augmentation

B. Non-Rigid Registration: NICP + CMA-ES

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Multi-Agent Home Energy Management Assistant

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Fundamentals of Computing Continuous Dynamic Time Warping in 2D under Different Norms

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

Efficient Model Repository for Entity Resolution: Construction, Search, and Integration