RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation

Here is an explanation of the RPG-SAM paper, translated into simple, everyday language with creative analogies.

🩺 The Big Picture: Finding Polyps Without a PhD

Imagine you are a doctor trying to find polyps (small growths that can turn into cancer) inside a patient's colon using a camera. Usually, to teach a computer to do this, you need to show it thousands of examples where humans have carefully drawn lines around every single polyp. This takes forever and is expensive.

RPG-SAM is a new "smart assistant" that doesn't need thousands of examples. It only needs one picture of a polyp (the "Support Image") to find similar polyps in a new video or photo (the "Query Image"). It's like showing a detective one photo of a suspect and saying, "Find this person in this crowd," without needing a database of millions of faces.

However, the old way of doing this had three big problems. RPG-SAM fixes them like a master mechanic tuning a car.

🚧 The Three Problems (And How RPG-SAM Fixes Them)

1. The "Bad Photo" Problem (Regional Heterogeneity)

The Issue: Imagine you show the detective a photo of a suspect, but the photo is blurry, has a glare from a flash, or is covered in mud. If the detective tries to match every pixel in that bad photo to the crowd, they will get confused and point at innocent people.

The Old Way: The computer treated every part of the reference photo as equally important, even the blurry or shiny parts.
The RPG-SAM Fix (Reliability-Weighted Prototype Mining):
- Analogy: Think of this as a "Trust Score" system. RPG-SAM looks at the reference photo and asks, "Is this part of the image clear and useful?"
- It gives a high "Trust Score" to the clear parts of the polyp and a low score to the blurry or shiny parts.
- The Secret Weapon: It also looks at the background (the healthy colon tissue) and uses it as a "Negative Anchor." It's like telling the detective, "Also, make sure you don't pick people who look like the background wall." This helps filter out false alarms.

2. The "One-Size-Fits-All" Problem (Intensity Heterogeneity)

The Issue: In some photos, the polyp is bright red; in others, it's dark purple. In some, the lighting is harsh; in others, it's dim. The old computers used a fixed rule (e.g., "If the pixel is brighter than 50%, it's a polyp").

The Problem: A rule that works for a bright photo fails miserably in a dark one. It's like trying to use the same volume setting on a radio whether you are in a quiet library or a loud rock concert.
The RPG-SAM Fix (Geometric Adaptive Selection):
- Analogy: Instead of a fixed rule, RPG-SAM acts like a smart shape-shifter.
- It tries out many different "volume settings" (thresholds) to see which one creates a shape that looks most like a real polyp.
- It checks: "Does this shape look round and solid? Or is it just a jagged speck of noise?" It picks the setting that creates the most "polyp-like" shape, adapting to the specific lighting of the new image.

3. The "Rough Draft" Problem (Iterative Refinement)

The Issue: Even with the best guess, the computer's first outline of the polyp might be a little jagged or miss a tiny corner.

The Old Way: The computer would just accept the rough draft.
The RPG-SAM Fix (Prior-guided Iterative Refinement):
- Analogy: Think of this as an editor polishing a manuscript.
- RPG-SAM takes its first guess and runs it through a loop. It asks, "Did I miss any parts of the polyp? Did I accidentally include too much background?"
- If it missed a spot, it adds a "positive prompt" (a nudge to include more). If it included too much background, it adds a "negative prompt" (a nudge to cut it out).
- It repeats this process a few times until the outline is smooth and perfect.

🏆 Why Does This Matter?

The researchers tested RPG-SAM on a famous dataset called Kvasir.

The Result: It improved the accuracy by 5.56% compared to the previous best methods.
The Real-World Impact: In medical terms, that extra 5% means fewer missed polyps and fewer false alarms. It means doctors can rely on this tool even if they only have one example to start with, making early cancer detection faster and cheaper.

📝 Summary in One Sentence

RPG-SAM is a smart, training-free tool that finds colon polyps by ignoring bad parts of reference photos, adapting to different lighting conditions like a chameleon, and repeatedly polishing its own work until the result is perfect—all without needing to be retrained on new data.

Here is a detailed technical summary of the paper "RPG-SAM: Reliability-Weighted Prototypes and Geometric Adaptive Threshold Selection for Training-Free One-Shot Polyp Segmentation."

1. Problem Statement

The paper addresses the limitations of training-free one-shot segmentation in medical imaging, specifically for polyp detection in colonoscopy. While existing methods leverage foundation models like SAM (Segment Anything Model) to transfer knowledge from a single labeled support image to query images without retraining, they suffer from a "uniformity bias" across three critical dimensions:

Regional Heterogeneity in Support Images: Existing methods treat all pixels in the support foreground equally. However, endoscopic images often contain degraded regions (e.g., reflections, mucus) that introduce misleading features, leading to false positives.
Contextual Heterogeneity (Foreground vs. Background): Many approaches neglect the support background as a distinct information layer. Failing to use background features as negative anchors makes it difficult to distinguish polyps from visually similar intestinal folds.
Intensity Heterogeneity in Query Responses: Current pipelines rely on fixed thresholds to convert similarity heatmaps into binary masks. However, response intensities vary stochastically across different query scenarios and clinical conditions, making rigid sampling rules inadequate for maintaining both fidelity and diversity.

2. Methodology: RPG-SAM Framework

RPG-SAM is a SAM2-based framework designed to tackle these heterogeneity gaps through three core components:

A. Reliability-Weighted Prototype Mining (RWPM)

Goal: Address regional and contextual heterogeneity by filtering noise and prioritizing high-fidelity features.

Feature Extraction: Uses DINOv2 to extract deep features from support ( $x_s$ ) and query ( $x_q$ ) images.
Superpixel Clustering: Applies SLIC to the support image to generate $K$ superpixel clusters, creating foreground prototypes ( $P_{fg}$ ) and background prototypes ( $P_{bg}$ ).
Reliability Scoring: Each foreground prototype is weighted based on two metrics:
1. Contrast Factor ( $C_k$ ): Measures the discriminative power of a prototype within the support image (distinguishing it from background).
2. Reverse Purity Factor ( $R_k$ ): Evaluates cross-image matching stability by checking if the prototype's top matches in the query map project back to the correct foreground region.
Noise Suppression: The initial heatmap ( $H_{init}$ ) is generated by aggregating weighted foreground similarities while explicitly subtracting background similarities (using $P_{bg}$ as negative anchors) to suppress false positives.

B. Geometric Adaptive Threshold Selection (GAS)

Goal: Address intensity heterogeneity by replacing static thresholds with a dynamic, geometry-aware selection mechanism.

Candidate Generation: Instead of a single fixed threshold, GAS generates a pool of candidate binary masks by scanning a range of thresholds ( $\tau \in [\tau_{min}, \tau_{max}]$ ).
Morphological Filtering: Candidates are refined by filling internal holes and filtering out small components (retaining only those $\ge 20\%$ of the largest component).
Geometric Scoring ( $S_{geo}$ ): Each candidate is scored based on:
1. Weighted Solidity: The area-weighted average compactness, favoring convex, anatomical shapes.
2. Scale Consensus: A penalty for candidates significantly smaller than a reference polyp area ( $A_{ref}$ ).
Selection: The candidate with the highest $S_{geo}$ is selected as the optimal prior mask ( $M_{prior}$ ) to generate prompts for SAM2.

C. Prior-guided Iterative Refinement (PIR)

Goal: Progressively polish anatomical boundaries using SAM2's capabilities.

Iterative Loop: The framework iteratively refines the segmentation mask ( $M_t$ ) using $M_{prior}$ as a structural reference.
Error Correction Logic:
- False Negatives: If coverage is low, the geometric center of the missing region (identified via Euclidean Distance Transform) is sampled as a positive prompt to expand the mask.
- False Positives: If coverage is high but IoU is low, the region where the current mask overlaps the background (relative to $M_{prior}$ ) is sampled as a negative prompt to suppress noise.
Termination: The loop stops when coverage and IoU thresholds are met or a maximum iteration count is reached. The mask with the highest historical IoU is selected as the final output.

3. Key Contributions

Identification of Heterogeneity: The paper systematically identifies and addresses three types of information heterogeneity (regional, contextual, and intensity) that limit current one-shot segmentation methods.
RWPM Module: Introduces a reliability-weighted mechanism that filters out degraded support features (e.g., reflections) and utilizes background anchors for contrastive noise suppression.
GAS Module: Proposes a dynamic thresholding strategy based on morphological priors (solidity and scale) rather than fixed thresholds, adapting to stochastic response intensities.
PIR Loop: Designs an automated, iterative refinement process that leverages SAM2 to correct errors and polish boundaries without manual intervention.
Training-Free Efficiency: The entire framework operates without model training or fine-tuning, making it highly scalable for label-scarce clinical environments.

4. Experimental Results

The method was evaluated on four public datasets: Kvasir, CVC-ClinicDB, CVC-ColonDB, and PolypGen (multi-center).

Performance on Kvasir: RPG-SAM achieved 78.65% mIoU and 85.65% mDice, outperforming the previous state-of-the-art (ProtoSAM) by 5.56% mIoU and 4.11% mDice.
Robustness: On the multi-center PolypGen dataset, RPG-SAM demonstrated superior robustness against domain shifts, significantly reducing false-positive activations compared to fixed-threshold methods.
Ablation Studies:
- Removing Background Suppression caused a significant drop in performance (3.78% mDice loss).
- RWPM improved spatial granularity by relieving the "uniform representation fallacy."
- GAS outperformed the optimal fixed threshold ( $\tau=0.7$ ) by 2.59% mDice, proving the necessity of adaptive thresholding.
- PIR provided the final refinement, boosting overall accuracy.

5. Significance

RPG-SAM represents a significant advancement in training-free medical image analysis. By explicitly modeling the heterogeneity of endoscopic data, it overcomes the limitations of "one-size-fits-all" prompt sampling strategies.

Clinical Impact: It offers a scalable, robust alternative to data-intensive supervised models, crucial for early colorectal cancer screening where expert annotations are scarce and image quality varies.
Generalizability: The framework's ability to adapt to different centers and image conditions without retraining makes it a practical solution for real-world deployment.
Future Work: The authors plan to extend this framework to exploit temporal consistency in endoscopic video streams.