Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Imagine you have a team of incredibly smart, super-fast robots designed to draw outlines around bones in medical CT scans. These robots are "Foundation Models" (FMs). They are like master painters who have seen millions of images and can guess what something looks like just by you pointing at it.

But here's the catch: In the lab, scientists usually test these robots by giving them perfect, computer-generated pointers. It's like telling the robot, "Draw a line exactly here," with mathematical precision. The robots ace these tests, getting 99% accuracy.

The Big Question: What happens when a real human doctor or student tries to use them? Humans aren't perfect. We might click a little to the left, draw a box that's slightly too big, or pick a spot that's a bit off. Does the robot fall apart because our "pointers" aren't perfect?

This paper is a massive experiment to find out. The researchers treated these AI models like new cars on a test track, but instead of a perfect driver, they used 20 medical students to "drive" (or rather, point) the cars.

The Cast of Characters

The researchers tested 11 different AI models. Think of them as different car brands:

Some were trained on natural images (like photos of cats and cars) and just learned to do medical stuff later (like a sports car trying to drive on a farm).
Some were trained specifically on medical images (like a tractor built for the farm).
Some worked 2D (looking at one slice of bread at a time).
Some worked 3D (looking at the whole loaf of bread at once).

The Experiment: The "Human Touch" Test

The researchers set up a massive study with four body parts: the wrist, shoulder, hip, and lower leg.

The "Ideal" Test: First, they let the AI run with perfect, computer-generated pointers. This is the "Gold Standard."
The "Human" Test: Then, they asked 20 medical students to draw the boxes and points themselves.
The Comparison: They compared the AI's drawings when guided by the computer vs. when guided by the humans.

The Big Discoveries (The Plot Twists)

1. The "Perfect" Score is a Lie
When the AI was tested with perfect pointers, it looked like a genius. But when real humans used it, the performance dropped.

Analogy: Imagine a chef who can make a perfect cake if you give them pre-measured ingredients. But if you ask them to guess the amount of sugar by eye, the cake might be a bit too sweet or dry. The paper warns us: Don't trust the "perfect" lab scores too much. Real-world human use is messier.

2. Not All Robots Are Created Equal
Some models handled human mistakes much better than others.

The Winners: In 2D, SAM2.1 was the champion. In 3D, Med-SAM2 and nnInteractive did the best.
The Losers: Some models were so sensitive that if a human moved their finger just a tiny bit, the AI would draw the bone in the wrong place or miss it entirely.

3. The "Simple vs. Complex" Rule
The robots were great at simple shapes but struggled with complex ones.

Analogy: If you ask a robot to outline a wrist bone (which is small and round), it's easy. But if you ask it to outline a hip with a metal implant (which has weird shapes and metal that confuses the scanner), the robot gets confused. The humans also struggled more with these complex parts, leading to more errors.

4. The "Human Consistency" Problem
The study found that even humans aren't consistent.

If the same student drew a box twice, they were pretty close.
But if two different students drew a box, they often disagreed.
The Result: The AI models were sensitive to these differences. If Student A pointed slightly left and Student B pointed slightly right, the AI produced two very different results. This is a problem for doctors who need reliable, repeatable results.

The Takeaway: Why This Matters

This paper is a reality check for the medical AI world.

For a long time, researchers have been saying, "Look how well our AI works!" based on tests with perfect, computer-generated inputs. This paper says, "Wait a minute. Real humans aren't computers."

If you buy a medical AI tool for a hospital, you need to know:

Will it work if the doctor is tired and clicks slightly off?
Will it give the same result if Dr. Smith uses it vs. Dr. Jones?

The Conclusion:
The best models for real-world use aren't necessarily the ones with the highest "perfect" scores. They are the ones that are robust—meaning they can handle a human being a little sloppy without falling apart.

The researchers found that while some models are getting better, none of them are perfect yet. They are all sensitive to how humans point. So, before we let AI take over our X-rays, we need to build models that are more forgiving of human error, or we need to train humans to be more consistent.

In short: The robots are smart, but they are currently very picky about how we talk to them. We need to teach them to understand our "human" way of pointing before we trust them with our health.

Here is a detailed technical summary of the paper "Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation."

1. Problem Statement

While Promptable Foundation Models (FMs) like the Segment Anything Model (SAM) have revolutionized medical image segmentation, current evaluation methodologies suffer from a critical gap: they rely heavily on "ideal" prompts (algorithmically generated from ground-truth reference masks). These synthetic prompts do not account for the inherent variability, noise, and imperfections of human-generated prompts.

Consequently, there is a lack of understanding regarding:

How sensitive FMs are to variations in human input (e.g., slight shifts in point placement or box sizing).
Whether the high performance reported in benchmarks translates to real-world clinical settings where users (radiologists or technicians) provide the prompts.
Which models remain robust when faced with intra-rater (same user, different attempts) and inter-rater (different users) variability.

2. Methodology

The study employed a rigorous, two-stage evaluation strategy focusing on Musculoskeletal (MSK) CT segmentation across four anatomical regions: wrist, lower leg, shoulder, and hip.

A. Dataset

Hybrid Approach: Combined a private dataset (Amsterdam UMC, Orthopedic Surgery) containing bone and implant scans to ensure independence from training data, with a subset of the public TotalSegmentator dataset to ensure reproducibility.
Scope: 49 CT scans, 404 axial slices, and 18 class labels (including bones and metal implants).

B. Observer Study (Human Prompt Collection)

Participants: 20 medical students from the University of Amsterdam.
Task: Annotators placed bounding boxes and center points on visible bone structures following strict guidelines.
Design:
- Intra-rater: Annotators labeled the same slices twice (blinded) to measure consistency.
- Inter-rater: Comparisons between different annotators.
- Prompt Types: Bounding boxes, center points, and combinations.
Metrics: Localization error (Euclidean distance), Intersection over Union (IoU), and annotation time.

C. Model Evaluation Pipeline

Stage 1 (Pareto-Optimal Selection): 11 promptable FMs (including SAM, SAM2, Med-SAM, nnInteractive, etc.) were tested using reference (ideal) prompts. Models were filtered based on a Pareto front analysis, selecting those with the best trade-off between segmentation performance (DSC, NSD, HD95) and parameter efficiency.
Stage 2 (Human Prompt Sensitivity): The selected top-performing models were re-evaluated using the human-generated prompts.
Sensitivity Analysis: The study calculated the correlation between prompt variability (e.g., distance between two human points) and segmentation consistency (e.g., Dice Similarity Coefficient between resulting masks).
- Metric: Spearman's rank correlation coefficient ( $\rho$ ). A low correlation indicates high robustness; a high correlation indicates high sensitivity.

3. Key Contributions

First Large-Scale Human-Prompt Benchmark: Unlike previous studies using synthetic prompts, this work quantifies the impact of real human annotation variability on FM performance.
Two-Stage Evaluation Framework: A novel strategy that first identifies Pareto-optimal models using ideal prompts and then subjects only these candidates to rigorous human-prompt sensitivity testing, optimizing computational resources.
Hybrid Data Strategy: Successfully balanced the need for independent validation (private data) with scientific reproducibility (public data).
Granular Sensitivity Metrics: Introduced a method to pinpoint the threshold at which models become sensitive to inter-rater variability, distinguishing between intra-rater robustness and inter-rater fragility.

4. Key Results

A. Human Prompt Variability

Accuracy: Human center points had a median localization error of 1.50 mm, and bounding boxes had a median IoU of 90.56% compared to reference masks.
Consistency: Intra-rater consistency was significantly higher than inter-rater consistency.
Anatomical Differences: Simple structures (wrist bones) showed high consistency. Complex structures (pelvis, tibia, implants) showed high localization errors and low consistency, particularly for center points on irregular shapes.

B. Segmentation Performance (Reference vs. Human Prompts)

Performance Drop: Transitioning from ideal to human prompts caused a statistically significant performance drop:
- 2D Models: ~2.07% drop in DSC.
- 3D Models: ~1.06% drop in DSC.
Top Performers (Reference Prompts):
- 2D: SAM2.1 (specifically the Tiny variant with combination prompts).
- 3D: Med-SAM2 and nnInteractive.
Top Performers (Human Prompts): The ranking remained similar, but absolute performance decreased. 3D medical FMs (Med-SAM2, nnInteractive) consistently outperformed 3D natural-image FMs (SAM2.1) in volumetric tasks.

C. Model Sensitivity to Prompt Variations

General Sensitivity: All models showed sensitivity to prompt variations. As input prompt variability increased, segmentation consistency decreased.
Robustness Findings:
- 2D Models: All were sensitive to intra-rater variations.
- 3D Models:
  - SAM2.1 (Point-prompted): Showed robustness to intra-rater variations but became sensitive at the lowest level of inter-rater variability.
  - nnInteractive (Combination-prompted): Showed robustness to intra-rater variations and remained robust up to the sixth lowest level of inter-rater variability. It was the most robust model tested but still failed under high inter-rater fluctuation.
Conclusion: No model tested was fully robust against large fluctuations between different annotators.

5. Significance and Implications

Re-evaluating Benchmarks: The study demonstrates that performance metrics derived from "ideal" prompts overestimate the capabilities of FMs in human-driven clinical workflows.
Critical Metric for Deployment: Model sensitivity to prompt variability must be established as a complementary performance metric alongside standard accuracy (DSC/NSD). A model with high DSC but high sensitivity to prompt noise is risky for clinical automation.
Clinical Workflow Design: The findings suggest that for complex anatomical structures, relying solely on automated segmentation or single-point prompts is insufficient. Strategies such as using bounding boxes (which showed higher consistency than points) or iterative refinement (user correction) are necessary to mitigate sensitivity issues.
Model Selection: For MSK CT tasks, nnInteractive (with combination prompts) and Med-SAM2 emerged as the most reliable candidates for human-in-the-loop settings, though they still require careful prompt management.

Code Availability: The authors have made their prompt extraction and inference codebase publicly available to facilitate further research in this domain.