ReMeDI: Refined Memory for Disambiguation of Identities with SAM3 in Surgical Segmentation

Imagine you are watching a very complex, fast-paced surgery video. The camera is inside a patient's body, and the view is often blocked by blood, tissue, or other tools. The goal is for a computer to keep track of every single surgical tool, knowing exactly which one is which, even when they disappear behind an organ and then pop back out.

This is the problem the paper ReMeDI-SAM3 tries to solve.

Here is the story of the paper, told through simple analogies:

The Problem: The Computer Gets "Amnesiac"

The researchers started with a powerful AI tool called SAM3. Think of SAM3 as a very smart, but slightly naive, security guard watching a busy hallway.

The Issue: When a tool (like a pair of forceps) gets hidden behind a piece of tissue (occlusion) and then reappears, the security guard gets confused.
The Mistake: Because the guard is trying to remember everything it sees, it sometimes remembers the "bad" blurry frames where the tool was half-hidden. When the tool comes back, the guard might say, "Oh, that's the yellow tool I saw earlier," even though it's actually a blue tool that just entered the room. The guard has "identity drift"—it loses track of who is who.
The Memory Limit: Also, the guard has a tiny notepad. If the surgery is long, the notepad fills up, and the guard has to erase the early notes to write new ones. If the tool was hidden for a long time, the guard might have erased the only clue that would help identify it when it returns.

The Solution: ReMeDI-SAM3

The authors built a "Refined Memory" system (ReMeDI) to upgrade this security guard. They didn't retrain the AI from scratch (which would be like hiring a whole new guard); instead, they gave the existing guard three new super-tools.

1. The "Two-Drawer" Filing System (Dual Memory)

Instead of one messy notepad, the new system uses two specific drawers:

The "High-Confidence" Drawer: This drawer only accepts clear, sharp photos of the tools. If the image is blurry or the tool is half-hidden, it doesn't go here. This keeps the main memory clean and prevents the guard from getting confused by bad data.
The "Emergency Backup" Drawer: This is the clever part. Just before a tool gets hidden, the system saves a few "last known good" photos of it into this special drawer, even if the quality isn't perfect.
- Analogy: Imagine you are taking a photo of a friend. Just as they are about to walk behind a wall, you quickly snap a backup photo of their back. When they pop out the other side, you check that backup photo to make sure it's still your friend and not a stranger.

2. The "Identity Detective" (Re-Identification)

When a tool pops out from behind an obstacle, the system doesn't just guess. It acts like a detective.

It looks at the tool's "face" (its visual features) and compares it to a database of all the tools it has seen before.
It uses a voting system: It checks the tool over a few seconds. If the tool looks 80% like the "Blue Forceps" and only 20% like the "Yellow Forceps," the system votes to confirm it is the Blue one.
This stops the guard from mixing up two different tools that look similar.

3. The "Expandable Notepad" (Memory Expansion)

Surgery videos can be very long. The original AI had a fixed memory size (like a notepad with only 7 pages). If the surgery lasted longer, the AI would forget the beginning.

The authors invented a way to stretch the notepad. They didn't just add random blank pages; they used a smart mathematical trick (piecewise interpolation) to fill in the gaps between the existing pages.
Analogy: Imagine you have a timeline of 7 dots. Instead of just adding more dots randomly, they stretch the space between the dots so you can fit 15 or 20 dots in the same amount of space, keeping the start and end points perfectly accurate. This allows the AI to remember tools from much earlier in the surgery.

The Results: A Super-Guard

When they tested this new system on real surgical videos (EndoVis and CholecSeg8k datasets):

It was much more accurate: It correctly identified tools about 5% to 8% better than the original AI.
It handled confusion better: In cases where a tool disappeared and a different tool appeared, the new system correctly identified the new tool, whereas the old system kept calling it the old tool.
It worked without extra training: The best part is that they didn't need to teach the AI new things with thousands of hours of data. They just gave it better rules for how to use its memory.

Summary

ReMeDI-SAM3 is like upgrading a forgetful security guard into a sharp, organized detective. By separating "good" memories from "emergency" memories, using a voting system to verify identities, and stretching its memory to remember longer stories, it ensures that in the chaotic world of surgery, the computer never loses track of the tools. This helps surgeons and robots work together more safely and effectively.

1. Problem Statement

Accurate segmentation of surgical instruments in endoscopic videos is critical for computer-assisted interventions (e.g., tracking, workflow analysis). However, this task faces significant challenges:

Frequent Occlusions and Re-entry: Instruments are often hidden by tissue or other tools and re-enter the field of view later.
Identity Drift: Standard Video Object Segmentation (VOS) models often fail to distinguish between a re-entering instrument and a new one, or they lose the identity of an instrument after prolonged occlusion.
Limitations of SAM3: While SAM3 (Segment Anything Model 3) provides a strong spatio-temporal framework, it suffers from:
- Indiscriminate Memory Updates: It writes low-quality predictions (e.g., during occlusion) into memory, causing error accumulation.
- Fixed Memory Capacity: It uses a fixed set of temporal positional encodings (size 7), limiting long-term context retention in long surgical procedures.
- Weak Identity Recovery: It struggles to recover the correct identity of an instrument after it reappears, often confusing it with previous occluded objects.

2. Methodology: ReMeDI-SAM3

The authors propose ReMeDI-SAM3, a training-free extension of SAM3 designed specifically for surgical videos. It introduces three core components to address the limitations above:

A. Dual-Partitioned Memory Design

Instead of a single memory bank, ReMeDI splits the memory (size $M$ ) into two distinct partitions ( $M/2$ each):

Relevance-Aware Memory:
- Function: Stores high-confidence frames for stable, long-term tracking.
- Mechanism: Only frames with a reliability score ( $r_t = \text{objectness} \times \text{confidence}$ ) exceeding a strict threshold ( $\tau_{rel}$ ) are stored. This prevents noisy, low-quality predictions from contaminating the memory.
Occlusion-Aware Memory:
- Function: Preserves critical identity cues before an occlusion occurs to aid recovery.
- Mechanism: An "Unconditional Buffer" stores all past frames. When an occlusion recovery event is detected (object reappears), this partition is populated with pre-occlusion frames using a relaxed threshold ( $\tau_{occ} < \tau_{rel}$ ). This ensures that even slightly degraded frames containing identity information are retained for recovery.

B. Memory Capacity Expansion via Piecewise Interpolation

SAM3 is limited by fixed temporal positional encodings (only 7 slots), which causes early informative frames to be overwritten in long videos.

Strategy: The authors propose a novel expansion scheme that increases the memory size ( $M > 7$ ) without retraining.
Technique: They use piecewise interpolation of the temporal positional encodings.
- Boundary encodings (start and end of the sequence) are kept fixed to preserve strong temporal priors.
- The interior encodings are linearly resampled to fill the new slots.
- This allows the model to retain a larger temporal context while maintaining the semantic integrity of the temporal boundaries.

C. Feature-Based Re-Identification (ReID) with Temporal Voting

To resolve identity ambiguity when an instrument reappears after occlusion:

Feature Bank: A multi-scale appearance descriptor bank is maintained for each instrument class, constructed from high-reliability frames.
Verification Process: Upon recovery, the system computes cosine similarity between the current frame's features and the feature banks of all classes over a temporal window ( $K$ frames).
Decision: If the self-similarity score is higher than cross-class similarity, the identity is accepted. Otherwise, the label is reassigned to the class with the highest similarity. This prevents the model from persisting with the wrong identity after a long occlusion.

3. Key Contributions

Dual-Memory Architecture: A novel design combining relevance-aware propagation (for stability) and occlusion-aware storage (for recovery), explicitly targeting the trade-off between temporal stability and identity preservation.
Training-Free Identity Correction: A feature-based ReID module with temporal voting that verifies and corrects identities post-occlusion without requiring model fine-tuning.
Scalable Memory Strategy: A piecewise interpolation method that expands effective memory capacity for long-horizon surgical videos without retraining the backbone.
State-of-the-Art Zero-Shot Performance: The approach achieves superior results in a zero-shot setting, outperforming both vanilla SAM3 and prior training-based surgical segmentation methods.

4. Experimental Results

The method was evaluated on three benchmarks: EndoVis17, EndoVis18, and CholecSeg8k.

Quantitative Improvements (Zero-Shot):
- EndoVis17: +5.8% mean Class IoU (mcIoU) over vanilla SAM3.
- EndoVis18: +8.0% mcIoU over vanilla SAM3.
- CholecSeg8k: +2.0% mcIoU over vanilla SAM3.
Comparison: ReMeDI-SAM3 outperformed specialized training-based models (e.g., SurgicalSAM, SP-SAM) and other SAM-based zero-shot approaches (e.g., PerSAM, TrackAnything).
Ablation Studies:
- Removing the ReID module caused a significant drop in performance, confirming its necessity for identity disambiguation.
- Piecewise interpolation outperformed uniform interpolation, proving that preserving boundary temporal priors is crucial.
- Increasing memory size to 15 frames improved performance, but further expansion yielded diminishing returns.

Qualitative Evidence: In cases where a yellow instrument occluded and a blue instrument re-entered, vanilla SAM3 confused the new blue instrument with the old yellow one. ReMeDI-SAM3 correctly identified the new instrument and suppressed the false positive, demonstrating robust identity recovery.

5. Significance

ReMeDI-SAM3 represents a significant advancement in surgical video analysis by solving the "long-term identity" problem in a training-free manner.

Clinical Relevance: By reliably tracking instruments through occlusions and re-entries, it enhances the safety and reliability of computer-assisted surgery systems.
Efficiency: It eliminates the need for expensive, data-hungry retraining of foundation models for specific surgical domains, making advanced segmentation accessible with minimal computational overhead.
Generalizability: The dual-memory and interpolation strategies offer a blueprint for improving long-term video object segmentation in other domains characterized by frequent occlusions and long sequences.