ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

Imagine you are trying to teach a robot to perform surgery. Before the robot can cut or stitch, it needs to know exactly where its "hands" (the surgical tools) are in the video feed. This is the core problem the paper ROBUST-MIPS tries to solve.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Too Much Detail" Trap

In the past, to teach computers to see surgical tools, researchers used Instance Segmentation.

The Analogy: Imagine trying to teach a child to draw a cat. With segmentation, you have to trace the exact outline of every single whisker, the curve of the tail, and the shape of the ear. It's incredibly precise, but it takes hours to draw one picture.
The Issue: In surgery, tools are long, thin, and often hidden behind tissue or smoke. Drawing perfect outlines for every tool in thousands of video frames is too slow and expensive.

2. The Solution: The "Stick Figure" Approach

The authors argue that instead of tracing the whole outline, we should just draw Skeletal Poses (stick figures).

The Analogy: Instead of drawing the whole cat, you just draw a stick figure: a line for the body, a dot for the head, and a dot for the tail.
Why it works:
- Speed: It's much faster to draw a few dots and lines than a complex outline.
- Clarity: Even if the tool is partially hidden, you can often guess where the "elbow" (hinge) or "fingertip" (tool tip) is based on the straight line of the shaft.
- Structure: It tells the computer exactly how the tool is bending and where the important parts are.

3. The Dataset: ROBUST-MIPS

The team created a massive new library of data called ROBUST-MIPS.

The Source: They took an existing dataset of 10,000 surgical video frames (ROBUST-MIS) that already had the "cat outlines" (segmentation masks).
The Upgrade: They went through and added "stick figures" (skeletal poses) to every single frame.
The Result: Now, researchers have a dataset that has both the detailed outline and the simple stick figure. This allows them to compare: "Is the stick figure just as good as the detailed outline for teaching the robot?"

4. The Rules of the Game (Annotation)

Drawing stick figures on surgical tools is tricky because tools move in and out of the camera view. The authors created a strict rulebook:

The "Entry Point": Where the tool enters the body (like a doorframe).
The "Hinge": The joint where the tool bends (like an elbow).
The "Tips": The working ends of the tool (like fingers).
The "Invisible" Rules:
- Visible: You can see it.
- Occluded: It's hidden behind tissue, but you can guess where it is (like a hand behind a back).
- Missing: It's completely gone or doesn't exist (like a second finger on a rigid tool).
The "Zoom-Out" Trick: Sometimes a tool extends outside the video frame. The custom software they built lets annotators draw points outside the picture so the computer knows the tool is still connected, even if it's off-screen.

5. The "Scorecard" (Evaluation)

To see if their method works, they tested popular AI models (like RTMPose and ViTPose) on this new data.

The Twist: Standard scoring systems (like COCO) are designed for humans. If a human has two hands, the left hand is always the left hand. But surgical tools like scissors have two tips that are identical. If the AI swaps them, it's still correct!
The Fix: They tweaked the scoring system to say, "If the AI gets the tips swapped, it still gets full points."
The Scale Problem: Surgical tools are long and skinny. Standard scoring gets confused if a tool is vertical vs. horizontal. They invented a new way to measure "size" based on the tool's length (diagonal) rather than its area, so the score stays fair no matter how the tool is rotated.

6. The Results

The models trained on this "stick figure" data performed very well.

They achieved high accuracy in finding where the tools are.
This proves that you don't need the time-consuming, detailed outlines to teach a robot surgery; the simple "stick figure" approach is fast, efficient, and just as effective.

Summary

Think of this paper as the team that said, "Stop trying to paint a masterpiece of every surgical tool. Let's just draw stick figures." They built a giant library of these stick figures, taught the AI how to read them, and proved that this simpler method is the key to making computer-assisted surgery faster and more reliable. They also gave away their drawing tools and the library for free so other scientists can use them.

1. Problem Statement

The localization of surgical tools in intraoperative endoscopic video is critical for computer-assisted intervention (CAI) systems, enabling features like safety analysis and automated endoscope control. However, current research is predominantly limited by the scarcity of diverse, annotated data.

Annotation Bottleneck: Traditional semantic or instance segmentation requires creating complex polygons for every pixel, which is time-consuming and labor-intensive.
Limitations of Bounding Boxes: While bounding boxes are efficient in general computer vision, they are ineffective for surgical tools due to the tools' elongated, articulated structures. Bounding boxes often cover large image areas, overlap significantly, and fail to capture precise structural details (e.g., tip vs. shaft).
Need for Pose Estimation: Skeletal pose annotations offer a better balance between semantic richness and annotation efficiency. They capture structural information (tip, shaft, hinge) and can distinguish between tool instances. However, existing pose datasets (e.g., RMIT, EndoVis, SurgPose) suffer from small sizes, redundancy, or lack of complex in-vivo scenarios (occlusions, mutual interactions).

2. Methodology

The authors present ROBUST-MIPS, a new dataset derived from the existing ROBUST-MIS dataset, enriched with skeletal pose annotations.

A. Data Source and Composition

Origin: Derived from 10,040 laparoscopic frames extracted from 30 colorectal surgeries (10 rectal resections, 10 proctocolectomies, 10 sigmoid resections) performed at Heidelberg University Hospital.
Conditions: Includes challenging real-world scenarios: bleeding, smoke, illumination changes, overlapping instruments, and partial visibility.
Split Strategy: The data is divided into Training, Validation (Stage 1: same patients as training), and Testing (Stage 2 & 3: new patients and different surgery types) to evaluate domain shift and generalization.

B. Keypoint Definition and Annotation Protocol

The dataset defines a unified skeletal representation for both rigid and articulated instruments using four keypoint categories:

EntryPoint: The intersection of the instrument shaft and the circular endoscopic field of view (FoV).
HingePoint: The junction between the shaft and the tip (or the joint for articulated tools).
Tip1 / Tip2: The endpoints of the instrument.
- Note: For articulated tools (e.g., graspers), Tip1 and Tip2 are an unordered set due to symmetry and rotation ambiguity.
- Rigid tools: Only Tip1 exists; Tip2 is marked as "missing."

Visibility States:

Visible: Clearly seen.
Occluded: Physically present but hidden (e.g., by tissue) or outside the circular FoV but within the image frame; position is inferred.
Missing: Physically absent (e.g., second tip of a rigid tool) or completely out of view with no cues for inference.

Special Handling:

Trocar Cannulas: Removed from instance segmentation masks to reduce noise. The distal end of the cannula is defined as the EntryPoint for pose annotation.
Out-of-Bounds: If a shaft extends beyond the image, annotators can mark points in a padded area to maintain skeletal connectivity, though these are filtered during training.

C. Annotation Software

The authors released open-source software (tool-pose-annotation-gui) featuring:

Zoom-out capabilities for annotating points outside the visible frame.
Semantic shortcuts (e.g., 'E' for EntryPoint).
Logic to handle transitions between visible and occluded states.
Tools to refine instance masks by removing trocar cannulas.

D. Data Format and Metrics

Format: Annotations are stored in JSON files compatible with the Microsoft COCO schema, facilitating the use of standard human pose estimation frameworks.
Metric Adaptation (COCO OKS):
- Tip Equivalence: The metric was modified to swap Tip1 and Tip2 during evaluation to handle the unordered nature of symmetric tool tips.
- Scale Invariance: The standard COCO scale factor ( $s = \sqrt{wh}$ ) is unsuitable for slender, high-aspect-ratio tools as it collapses when tools are axis-aligned. The authors redefined $s$ using the arithmetic mean of squared dimensions ( $s = \sqrt{(w^2+h^2)/2}$ ) to ensure rotation invariance.
- Variance: A conservative standard deviation ( $\sigma = 0.107$ , based on human "hips") was applied to all keypoints to account for the high ambiguity in surgical tool annotation.

3. Key Contributions

ROBUST-MIPS Dataset: The largest and most varied dataset for surgical tool pose estimation, containing 10,040 frames with both skeletal pose and instance segmentation annotations.
Unified Annotation Scheme: A robust protocol handling articulated vs. rigid tools, visibility states (visible/occluded/missing), and out-of-bounds scenarios.
Open-Source Tools: Release of custom annotation software and benchmark training code.
Metric Innovation: A modified COCO OKS metric that accounts for tool symmetry and rotation-invariant scaling, providing a fairer evaluation for surgical instruments.
Benchmarking: Establishment of baseline performance using state-of-the-art models (RTMPose, SimpleBaseline, ViTPose) adapted for surgical tools.

4. Results

The authors trained three baseline models (SimpleBaseline, RTMPose, ViTPose) on the ROBUST-MIPS dataset.

Performance: The best-performing model, ViTPose-L, achieved an Average Precision (AP) of 0.754 on the testing set.
Robustness: The models demonstrated strong generalization capabilities across different surgery types and domain shifts (Stage 2 and Stage 3 testing).
Qualitative Analysis: Visualizations showed that the models could accurately localize tips and hinges even under occlusion and varying lighting, though performance varied slightly based on backbone architecture and resolution.

5. Significance and Future Work

Accelerated Research: By providing a large-scale, pose-annotated dataset, ROBUST-MIPS lowers the barrier for developing robust surgical tool tracking and localization systems.
Task Interplay: The dataset allows researchers to study the relationship between instance segmentation and pose estimation, potentially leading to multi-task learning approaches.
Limitations:
- Curved instruments (e.g., hooks) are not perfectly represented by straight line segments between keypoints.
- All tools are currently categorized as a single class; finer-grained classification (e.g., distinguishing specific grasper types) is not included.
- The unordered tip annotation requires specific handling in model architecture (e.g., permutation invariance), which current baselines do not explicitly encode.
Impact: This work paves the way for more advanced CAI features, such as automated safety monitoring and robotic assistance, by providing a reliable foundation for tool localization in complex, real-world surgical environments.