4D Monocular Surgical Reconstruction under Arbitrary Camera Motions

This paper proposes Local-EndoGS, a novel 4D reconstruction framework that leverages a progressive window-based representation and a coarse-to-fine optimization strategy to achieve high-quality, scalable monocular surgical scene reconstruction under arbitrary camera motions without relying on stereo depth priors or accurate initial structure-from-motion.

Jiwei Shan, Zeyu Cai, Cheng-Tai Hsieh, Yirui Li, Hao Liu, Lijun Han, Hesheng Wang, Shing Shin Cheng

Published 2026-02-20
📖 4 min read☕ Coffee break read

Imagine you are trying to build a perfect 3D model of a squishy, moving piece of fruit (like a grape) while someone is juggling it, spinning it, and poking it with a stick. Now, imagine you can only see this happening through a tiny, single-lens camera attached to the fruit itself.

That is essentially the challenge of 4D Surgical Reconstruction. Surgeons use endoscopes (tiny cameras) to look inside the body. The body is full of soft tissues that breathe, pulse, and get pushed around by tools. The camera moves wildly. The goal is to turn that shaky, 2D video into a stable, high-quality 3D movie that doctors can use for training or planning surgery.

For a long time, computers struggled with this. If the camera moved too much, the 3D model would fall apart, looking like a melted wax figure.

Enter Local-EndoGS, a new method proposed by researchers that solves this problem. Here is how it works, explained through simple analogies:

1. The Problem: The "One-Size-Fits-All" Trap

Previous methods tried to build the entire surgery scene using one giant, static blueprint (called a "canonical space"). They assumed the camera stayed mostly still.

  • The Analogy: Imagine trying to describe a whole movie using a single photograph. If the camera zooms in, pans left, or moves forward, that single photo can't possibly capture the new details or the changing perspective. The result is a blurry, broken mess.
  • The Reality: When the endoscope moves around inside the body, the "single blueprint" approach fails because the scene changes too drastically for one model to handle.

2. The Solution: The "Rolling Window" Approach

Local-EndoGS changes the strategy. Instead of trying to build the whole movie at once, it breaks the video into small, manageable chunks.

  • The Analogy: Think of a scrolling marquee or a film strip. Instead of looking at the whole reel of film, the computer looks at just 5 seconds at a time. It builds a perfect 3D model for that specific 5-second clip. Then, it slides the window forward, builds the next clip, and so on.
  • Why it works: By focusing on small windows where the camera doesn't move too wildly, the computer can create a highly accurate 3D model for that specific moment. It stitches these high-quality "snapshots" together to form the full 4D reconstruction.

3. The "Coarse-to-Fine" Start-Up

Starting a 3D model from a single camera view is hard because the computer doesn't know how far away things are (it's like looking at a flat painting and not knowing if the tree is 1 meter away or 100 meters away).

  • The Analogy: Imagine trying to build a sandcastle without a bucket. You start by dumping a huge pile of sand (Coarse) to get the general shape. Then, you use a small trowel to carve out the details and fix the edges (Fine).
  • How they do it:
    1. Coarse: They use a smart AI (called Track-Any-Point) to follow pixels across the video frames, creating a rough, 3D "skeleton" of the tissue.
    2. Fine: They look at where the model looks wrong (like a blurry edge) and use a depth-sensing AI to fix just those specific spots, refining the shape until it's perfect.

4. The "Physics Police"

Even with a good start, the computer might make the tissue move in impossible ways (like a jellyfish turning inside out).

  • The Analogy: Imagine a puppet show. If the puppeteer pulls the strings too hard, the puppet's arm might snap backward. Local-EndoGS acts like a strict physics teacher. It tells the computer: "Hey, soft tissue stretches, but it doesn't teleport or twist into a knot. Keep it realistic."
  • The Result: The computer adds "rules" (priors) to ensure the tissue moves naturally, preserving the shape and structure of the organs.

Why This Matters

  • For Surgeons: It creates a "virtual twin" of a patient's anatomy. Surgeons can practice on this 3D model before touching the real patient, reducing risks.
  • For Training: Medical students can watch a high-quality 3D replay of a surgery, seeing exactly how the tissue deforms under different angles, rather than just watching a flat 2D screen.
  • The Big Win: Unlike previous methods that needed two cameras (stereo) or a perfectly still camera, this works with one moving camera, which is exactly how real surgeries happen.

In summary: Local-EndoGS is like a smart film editor that cuts a chaotic surgery video into tiny, manageable scenes, builds a perfect 3D model for each scene using smart guessing and physics rules, and then stitches them together to create a realistic, moving 3D map of the human body.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →