GroundedSurg: A Multi-Procedure Benchmark for Language-Conditioned Surgical Tool Segmentation

Imagine you are walking into a busy, high-stakes kitchen where a team of chefs is preparing a complex meal. The kitchen is crowded, steam is rising, and there are dozens of identical-looking knives, spoons, and tongs scattered across the counter.

Now, imagine you are the head chef, and you need to give a quick instruction to a new assistant (an AI robot) to help you.

The Old Way (Current AI):
You say, "Pick up a knife."
The robot looks around, sees ten knives, and picks one up at random. It doesn't know which knife you meant. Maybe you wanted the one cutting the steak, not the one just sitting there cleaning a plate. In the real world of surgery, this mistake could be disastrous. The robot might grab the wrong tool, causing a collision or a delay.

The New Way (GroundedSurg):
This paper introduces a new "test" called GroundedSurg to teach robots how to understand exactly which tool you mean, even when there are many similar ones.

Here is how it works, using simple analogies:

1. The Problem: "Which One?"

In surgery, a surgeon might say, "Pass me the Harmonic Ace that is currently cutting the tissue."

The Challenge: There might be three "Harmonic Aces" in the view. One is cutting, one is idle, and one is being held by a nurse.
Old AI: Only knows the name of the tool. It sees "Harmonic Ace" and grabs the first one it finds.
The Goal: The AI needs to understand the story (it's cutting) and the location (the specific one in the middle of the action).

2. The Solution: A New "Training Gym"

The authors built a massive training gym (a dataset) for these AI robots.

The Images: They took over 600 photos from real surgeries (eye surgery, stomach surgery, robotic surgery, etc.).
The Instructions: Instead of just labeling "This is a knife," they wrote natural sentences like: "Find the scissors that are holding the stomach wall open."
The Answer Key: For every sentence, they drew a precise box around the exact tool and marked its center point. It's like giving the robot a treasure map with an "X" on the specific item, not just the general area.

3. The Test: Can the Robot Listen?

They tested the smartest AI robots available today (like the ones that power chatbots and image generators) using this new gym.

The Result: The robots struggled. Even the "smartest" ones got it wrong about 80% of the time when asked to find the specific tool based on a sentence.
The Analogy: It's like asking a student, "Find the red car that is driving away," in a parking lot full of red cars. The students (AI models) often pointed at a parked red car or a blue car, failing to understand the "driving away" part.

4. Why This Matters

The paper shows that for AI to be truly helpful in the operating room, it can't just be a "labeler" (identifying objects). It needs to be a "context-aware assistant."

Current AI: "I see a scalpel."
Future AI (what GroundedSurg wants): "I see three scalpels. The one you are talking about is the one touching the liver, not the one on the tray. I will guide the robotic arm to that specific one."

The Big Takeaway

Think of GroundedSurg as a new driving test for self-driving cars.

Old Test: "Can you stop at a red light?" (Yes, easy.)
New Test: "Can you stop at the red light specifically because a pedestrian is stepping off the curb, even though the light is green for the cross-traffic?"

The paper concludes that while our current AI is good at seeing, it is terrible at understanding the context of what it sees. GroundedSurg provides the first real-world "exam" to fix this, ensuring that future surgical robots won't just see tools, but will understand what the surgeon is actually trying to do.

1. Problem Statement

Current surgical AI research primarily focuses on category-level segmentation, where models detect all instances of predefined instrument classes (e.g., "scissors" or "graspers") without distinguishing between specific instances in a scene. However, real-world clinical decision-making often requires instance-level grounding: identifying a specific instrument based on its functional role, spatial relationship, or interaction with anatomy (e.g., "the Harmonic Ace currently dissecting tissue" vs. an idle one of the same type).

Existing benchmarks lack the capability to evaluate this nuance. They either:

Focus solely on class segmentation without language conditioning.
Use general vision-language grounding datasets (like RefCOCO) that do not capture the visual complexity, occlusions, and fine-grained morphology of surgical environments.

The Gap: There is no standardized benchmark that couples natural language references with precise, pixel-level localization of specific surgical instrument instances in multi-tool, cluttered operative scenes.

2. Methodology

A. The GroundedSurg Benchmark

The authors introduce GroundedSurg, the first language-conditioned, instance-level surgical grounding benchmark.

Dataset Composition:
- Scale: ~612 surgical images and 1,071 tool-level annotations.
- Diversity: Covers 4 distinct surgical domains: Ophthalmic, Laparoscopic, Robotic, and Open procedures (including Gastrectomy, Cholecystectomy, Nephrectomy).
- Annotation Structure: Each instance consists of:
  1. Surgical Image ( $I$ ): High-resolution intraoperative frames.
  2. Natural Language Query ( $T$ ): A prompt describing a specific instrument's role, spatial relation, or anatomical interaction (e.g., "Detect the Harmonic Ace used to dissect...").
  3. Spatial Grounding: Bounding boxes ( $B$ ) and center points ( $C$ ) to enforce coarse localization.
  4. Pixel-Level Mask ( $M$ ): Precise segmentation of the target instrument.
Data Curation Pipeline:
1. Generation: Initial queries are generated using a Vision-Language Model (Qwen-2.5 VL-Instruct) conditioned on visual content and spatial data.
2. Clinical Validation: Clinicians and experts review all 1,071 queries to eliminate hallucinations, correct semantic inconsistencies, and ensure alignment with surgical context.
3. Standardization: Refined prompts and masks are stored in a unified JSON schema.

B. Problem Formulation

The task is defined as a language-conditioned, instance-level segmentation problem.

Input: An image $I$ and a text query $T$ .
Objective: Learn a mapping $f(I, T) \rightarrow \hat{M}$ to predict the segmentation mask for the specific instrument described in $T$ .
Evaluation: Treated as independent image-query pairs to prevent cross-instance interference. Metrics include Intersection over Union (IoU), Dice coefficient, Bounding Box IoU, and Normalized Distance Error (NDE) for center points.

C. Experimental Setup

Models Evaluated: A comprehensive suite of Vision-Language Models (VLMs) including Open-Source (Qwen, Gemma, LLaMA), Reasoning-Oriented (VisionReasoner), Medical-Domain specific (MedMO, MedGemma), and Closed-Source (GPT-4o, GPT-5.2).
Protocol: Zero-shot evaluation (no fine-tuning).
Pipeline: Models predict structured localization outputs (bounding box + center point), which are then projected onto a frozen segmentation backend (SAM2 or SAM3) to generate the final mask.

3. Key Contributions

Reconceptualization of Surgical Perception: Shifts the paradigm from category-level recognition to grounded vision-language tasks requiring the resolution of context-dependent references to specific instrument instances.
GroundedSurg Benchmark: A novel dataset integrating natural language descriptions with explicit spatial grounding (boxes, points) and pixel-level masks across 4 diverse surgical procedures.
Systematic Evaluation: Establishes a rigorous protocol to quantify both linguistic reference resolution and pixel-level localization accuracy in clinically realistic, multi-instrument scenes.
Clinical Validation: A semi-automated pipeline verified by medical professionals to ensure semantic correctness and eliminate hallucinations in generated prompts.

4. Results

Quantitative Findings

Performance Gaps: Extensive experiments reveal substantial performance gaps across modern VLMs. While some models achieve moderate performance at loose thresholds (IoU@0.1), accuracy degrades sharply under stricter criteria (IoU@0.3+), indicating difficulty in precise boundary alignment.
Model Comparison:
- Reasoning-Oriented Models: VisionReasoner-7B achieved the best overall performance (highest BBox IoU and Dice scores), suggesting that structured reasoning enhances robustness against surgical ambiguity.
- Medical-Domain Models: Surprisingly, models pre-trained specifically on medical data (e.g., MedMO, MedGemma) did not consistently outperform general-purpose models, indicating that domain pretraining alone does not guarantee improved instance-level grounding.
- Closed-Source Models: GPT-5.2 and GPT-4o-mini showed competitive but not dominant results, reinforcing the difficulty of the task.
Backend Sensitivity: The choice of segmentation backend (SAM2 vs. SAM3) significantly impacts results. VisionReasoner showed pronounced improvement with SAM3, highlighting the tight coupling between accurate localization and advanced mask decoding.
Prompt Sensitivity: General-purpose models exhibited high instability when prompt phrasing changed. In contrast, reasoning-oriented models maintained consistent grounding behavior, demonstrating better invariance to linguistic rephrasing.

Qualitative Findings

General-purpose models often generate coarse or inaccurate spatial localizations in cluttered scenes, leading to "contextual leakage" (segmenting the wrong tool) when projected onto segmentation backends.
Reasoning-oriented models produced more spatially precise masks, particularly in multi-instrument scenarios where visual similarity is high.

5. Significance and Impact

Clinical Relevance: GroundedSurg addresses a critical gap in surgical AI by enabling systems that understand which tool is being used and why, which is essential for context-aware assistance, collision avoidance, and automated workflow guidance.
Benchmarking Standard: It provides the first standardized testbed for evaluating grounded surgical perception, moving beyond simple object detection to complex reasoning tasks.
Future Directions: The results highlight the urgent need for models that better integrate linguistic reasoning with fine-grained spatial perception. The paper suggests that future surgical AI must move beyond category recognition toward instance-level, context-aware understanding to be clinically viable.

Availability: The code and data are publicly available at https://github.com/gaash-lab/GroundedSurg.