ReCAP: Recursive Cross Attention Network for Pseudo-Label Generation in Robotic Surgical Skill Assessment

Imagine you are learning to drive a high-tech, robotic car. To become a master driver, you need a coach to watch you and tell you how you're doing. Usually, this coach is a senior driver who watches you the whole time and gives you a single score at the end, like "8 out of 10."

The problem is, that single score doesn't tell you what you did wrong. Did you brake too hard? Did you forget to check your blind spot? Did you drive too fast on the curves? You just get a number, and you have to guess where you went wrong.

This paper introduces a new "AI Coach" called ReCAP that solves this problem for robotic surgery. Here is how it works, broken down into simple concepts:

1. The Problem: The "Black Box" Score

In the past, researchers tried to use computers to grade surgeons. They would feed the computer data about how the robot moved (kinematics) or video of the surgery. The computer would then spit out a final score (like the "8 out of 10").

The Flaw: It was like a teacher grading a math test and only telling you the final score without showing you which specific steps were wrong. It missed the little mistakes happening during the surgery.

2. The Solution: The "Segment-by-Segment" Coach

The authors built a new AI model (ReCAP) that acts like a coach who doesn't just wait until the end. Instead, it watches the surgeon step-by-step.

The Analogy: Imagine a marathon runner. A traditional coach waits until the finish line and says, "You ran well." ReCAP is like a coach running alongside the athlete, checking their form every 100 meters. It says, "Your arm swing is great here," then "Watch your breathing there," then "Great pace now."
How it works: The AI breaks the surgery into tiny chunks (segments). For every chunk, it guesses a score based on six different skills (like "respect for tissue," "handling tools," and "speed"). It doesn't have a teacher telling it the score for every single chunk; it has to figure it out on its own (this is called "weakly supervised").

3. The Magic Trick: "Pseudo-Labels"

This is the cleverest part. The AI doesn't know the exact score for every tiny second of the surgery. It only knows the total score for the whole surgery.

The Metaphor: Imagine you are baking a cake and you only know the final taste is "Delicious." You don't know if the sugar was perfect or if the flour was perfect.
ReCAP's Strategy: The AI guesses the quality of the sugar, the flour, and the eggs individually for every step of the baking process. Then, it adds up all those guesses to see if they equal "Delicious." If the total matches the real score, the AI knows its guesses were probably right. It creates its own "fake labels" (pseudo-labels) to teach itself how to grade the small steps.

4. The Results: Better Than the Old Way

The researchers tested this on a famous dataset of robotic surgeries (JIGSAWS).

The Score: When predicting the final grade, ReCAP was more accurate than any previous method that used movement data. It was just as good as methods that used expensive video cameras, but it used simple movement data (which is cheaper and easier to get).
The Detail: It also did a great job at grading the specific skills (like "handling the needle").
The Human Check: They showed the AI's step-by-step feedback to a real senior surgeon. The surgeon agreed with the AI's "good" or "bad" assessments 77% of the time. That's a huge win for a computer trying to judge human skill!

5. Why This Matters

Currently, if a junior surgeon wants to improve, they have to wait for a busy senior surgeon to watch them and give feedback. This is slow and expensive.

The Future: With ReCAP, a robot could watch a surgeon in real-time and say, "Hey, your hand was shaking a bit during that stitch," or "Great job on the knot tying!"
The Benefit: This gives surgeons instant, detailed feedback without needing a human to watch every second. It turns a vague "You did okay" into a specific "Here is exactly how to get better."

In a Nutshell

ReCAP is a smart AI that learns to grade robotic surgeries by breaking them down into tiny moments. Instead of just giving a final grade, it acts like a personal trainer, pointing out exactly which moves were good and which need work, all by learning from the final score alone. It's a big step toward making surgical training faster, fairer, and more automated.

1. Problem Statement

Surgical skill assessment is critical for training, traditionally relying on the Objective Structured Assessment of Technical Skills (OSATS) and the Global Rating Scale (GRS). While effective, these methods are subjective, time-consuming, and dependent on senior surgeons, creating a bottleneck for feedback.

Recent machine learning approaches have attempted to automate this using the JIGSAWS dataset (which contains kinematic and video data). However, existing methods face two main limitations:

Over-reliance on GRS Regression: Most models regress a single aggregate GRS score. This loses clinically meaningful variations and granular feedback regarding specific skills (e.g., tissue handling vs. time and motion).
Lack of Granular Feedback: Existing attempts to model intermediate scores often require dense ground-truth labels for every segment (which are expensive to obtain) or fail to translate predictions into actionable clinical feedback.

The authors argue that regressing GRS alone is limiting and propose a method to generate segment-level pseudo-labels for the six individual OSATS metrics in a weakly-supervised manner (using only the final trial-level GRS/OSATS labels).

2. Methodology: ReCAP

The authors propose ReCAP (Recursive Cross-Attention for Pseudo-label generation), a recurrent transformer-based model designed to map kinematic data to intermediate OSATS scores.

Input and Problem Formulation

Input: Kinematic signals ( $X_i$ ) are divided into equal segments ( $x_s$ ) of length $L$ .
Goal: Map the sequence of segments to a vector of OSATS scores ( $Y$ ).
Weakly-Supervised Constraint: The model is trained only on the final trial-level OSATS/GRS labels ( $Y$ $Y$ ), not on segment-level ground truth. The model must learn to predict intermediate segment scores ( $\hat{y}_s$ $\overset{y}{^}_{s}$ ) such that their average equals the ground truth label.
- Equation: $y_n = \frac{1}{S} \sum_{s=1}^{S} f_n(x_s)$ , where $f_n$ maps a segment to the $n$ -th OSATS score.

Architecture

Recurrent Backbone ( $h$ ):
- Processes kinematic segments sequentially.
- Takes two inputs: the current segment ( $x_s$ ) and the previous hidden state ( $z_{s-1}$ ).
- Fusion Module: Uses Multi-Head Self-Attention and Cross-Attention blocks to fuse temporal context (previous state) with current input to produce the new hidden state ( $z_s$ ).
Classification Heads ( $c_n$ ):
- Six parallel Multi-Layer Perceptrons (MLPs), one for each OSATS category.
- Each head takes the hidden state $z_s$ and outputs a segment-level prediction ( $\hat{y}_{s}^n$ ).
Aggregation:
- The final predicted OSATS score for the trial is the average of all segment-level predictions: $\hat{y}^n = \frac{1}{S} \sum \hat{y}_{s}^n$ .
- The final GRS is the sum of the six predicted OSATS scores.

Training Strategy

Loss Function: Cross-Entropy (CE) loss applied to the averaged segment predictions against the ground truth trial-level labels.
Regularization: An L2 penalty term is added.
Data Augmentation: Gaussian noise (based on signal std dev) and signal flipping (time reversal) to improve generalization.
Optimization: Adam optimizer, trained for 5000 epochs.

3. Key Contributions

Weakly-Supervised Pseudo-Label Generation: The first method to predict granular, segment-level OSATS scores using only trial-level labels, eliminating the need for expensive frame-by-frame annotation.
Recurrent Cross-Attention Architecture: A novel backbone that effectively fuses temporal history with current kinematic inputs to capture the evolution of surgical skill within a trial.
Actionable Feedback: By generating intermediate scores, the model provides qualitative, step-by-step feedback (e.g., identifying exactly where a surgeon struggled with "Time and Motion") rather than just a final score.
Task-Agnostic Modeling: The model is designed to work across different surgical tasks (Needle Passing, Suturing, Knot Tying) without task-specific retraining.

4. Experimental Results

The model was evaluated on the JIGSAWS dataset using a Leave-One-Supertrial-Out (LOSO) cross-validation scheme.

Performance Metrics

Global Rating Score (GRS):
- ReCAP achieved Spearman's Correlation Coefficients (SCC) of 0.88 (KT), 0.85 (NP), and 0.83 (SU).
- Comparison: It outperformed all previous kinematic-based methods (e.g., SMT-DCT-DFT, DCT-DFT-ApEn) and matched the performance of state-of-the-art video-based models (e.g., ViSA, Contra-Sformer), despite using only kinematic data.
OSATS Prediction (Granular):
- The model achieved SCCs ranging from 0.46 to 0.70 for average OSATS predictions across tasks.
- Specific OSATS metrics reached SCCs up to 0.95 (Time and Motion in Knot Tying).
- It outperformed the previous best kinematic method (CNN+Bilstm) in most categories, particularly in "Respect for Tissue" and "Time and Motion."

Validation

Surgeon Validation: A consultant surgeon reviewed 9 videos with model-generated pseudo-labels.
- The surgeon agreed with the model's predictions 77% of the time.
- This was significantly higher than agreement with random noise predictions (69%), with a statistically significant difference ( $p=0.006$ ).
Ablation Studies:
- Removing pseudo-label generation (training only on final GRS without intermediate constraints) caused a drastic performance drop, especially in Needle Passing (SCC dropped from 0.85 to 0.54).
- Data augmentation had a minimal effect on GRS but helped with time invariance.

5. Significance and Limitations

Significance:

Scalability: By removing the need for dense segment-level annotations, ReCAP makes automated skill assessment scalable to large datasets.
Clinical Utility: The ability to pinpoint specific moments of poor performance (via segment-level pseudo-labels) transforms automated assessment from a grading tool into a teaching tool.
Efficiency: Kinematic data is computationally cheaper and system-agnostic compared to video, making this approach viable for integration into various robotic surgical systems.

Limitations:

Visual Nuances: The model struggles with metrics heavily dependent on visual context, such as "Quality of Final Product" and "Needle Passing," where kinematic data cannot capture depth misjudgments or tissue appearance.
Dataset Bias: The JIGSAWS dataset is imbalanced (e.g., score 5 only appears in suturing), and the small sample size for certain folds introduces bias in correlation metrics.
Validation Difficulty: Fine-grained OSATS scores are inherently subjective; even experts disagree on segment-level ratings, making ground-truth validation for pseudo-labels challenging.

Conclusion

ReCAP demonstrates that weakly-supervised recurrent cross-attention networks can effectively bridge the gap between aggregate surgical scores and granular, actionable feedback. It achieves state-of-the-art performance on kinematic data, rivaling video-based models, and offers a promising pathway for automated, scalable surgical training systems.