SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding

Imagine a world where robot surgeons are getting smarter every day, helping doctors perform delicate, minimally invasive operations. But for these robots to be truly autonomous and safe, they need to "see" and "understand" the surgical scene perfectly. They need to know exactly where the tools are, what the tissues look like, and how deep everything is.

The problem? Data is scattered and private.

Hospitals in different cities (or even different countries) have their own unique surgical videos. They can't just share these videos because patient privacy laws forbid it. So, how do we teach a single AI to be smart enough to handle all these different surgeries without ever seeing the raw data?

Enter SurgFed, a new "teamwork" system for AI. Here is how it works, explained through simple analogies.

The Problem: The "One-Size-Fits-All" Failure

Imagine you are trying to teach a group of chefs to cook a perfect meal.

Hospital A uses only fresh, local vegetables.
Hospital B uses frozen, imported ingredients.
Hospital C uses exotic spices no one else has.

If you force all these chefs to use the exact same recipe (a standard AI model), the results will be terrible. The chef with fresh veggies will ruin the dish by adding too much salt (because the recipe was written for frozen veggies), and the chef with spices will burn the food.

In the world of surgery, this is called Tissue Diversity (different body parts look different) and Task Diversity (some hospitals want to find tools, others want to measure depth). Standard AI methods try to average everyone's learning, which leads to a "compromise" that is good at nothing.

The Solution: SurgFed (The "Smart Team Captain")

SurgFed is a new way for these hospitals to learn together without sharing their secret recipes (data). It uses two clever tricks to make sure every hospital gets a personalized chef's hat that fits them perfectly.

1. The "Language Guide" for Local Chefs (LCS)

The Metaphor: Imagine every local chef is given a magic instruction card written in plain English before they start cooking.
How it works: Instead of just looking at the video, the AI at each hospital reads a text prompt like: "We are doing a kidney surgery at Hospital A; focus on the shiny metal tools and the red tissue."
The Magic: This text acts as a spotlight. It tells the AI, "Hey, ignore the background noise; look only at the specific channels (features) that matter for your specific surgery." It helps the local model adapt instantly to its unique environment without needing to see other people's data.

2. The "Team Captain" with a Translation Book (LHA)

The Metaphor: Now, imagine the chefs send their "learning notes" (gradients) to a central Team Captain. Usually, the Captain just averages the notes. But if Chef A is learning to bake bread and Chef B is learning to grill steak, averaging them makes no sense.
How it works: The SurgFed Captain also has the magic instruction cards. When the Captain receives notes from Hospital A, it reads the card: "Ah, Hospital A is doing kidney surgery." When it gets notes from Hospital B, it reads: "Hospital B is doing heart surgery."
The Magic: The Captain uses a special "cross-attention" mechanism (like a translator) to understand how these different tasks relate. It doesn't just mash them together; it figures out, "Okay, the way Hospital A learned to spot a scalpel is actually very similar to how Hospital B learned to spot a needle, so let's share that specific insight." It creates a personalized update for each hospital, ensuring the learning is relevant.

Why This is a Big Deal

Before SurgFed, trying to train one AI on all these different surgeries was like trying to teach a dog to fly, swim, and climb trees all at once using the same training manual. The dog would get confused and fail at everything.

SurgFed changes the game by:

Respecting Privacy: No hospital ever shares a single pixel of patient video.
Personalization: It gives every hospital a model that is fine-tuned to their specific tools and tissues.
Collaboration: It still lets them learn from each other's successes, just in a smart, guided way.

The Results

The researchers tested this on five different public datasets (like five different cooking competitions). The result? SurgFed beat every other existing method. It didn't just average the scores; it helped every single hospital improve its performance significantly, whether they were trying to segment (outline) surgical tools or estimate how deep a cut was.

In short: SurgFed is like a global masterclass for robot surgeons where everyone learns together, but everyone gets a personalized cheat sheet based on their specific needs, ensuring the robots become safer and smarter for patients everywhere.

Here is a detailed technical summary of the paper "SurgFed: Language-guided Multi-Task Federated Learning for Surgical Video Understanding."

1. Problem Statement

The paper addresses the challenges of Multi-Task Federated Learning (MTFL) in the context of Robot-Assisted Minimally Invasive Surgery (RAS). While Federated Learning (FL) allows collaborative training across multiple clinical sites without sharing patient data, applying it to surgical video understanding (specifically scene segmentation and depth estimation) faces two critical hurdles due to data heterogeneity:

Tissue Diversity: Surgical videos vary significantly across sites due to different anatomical backgrounds, surgical instruments, and tissue appearances. Local models struggle to adapt to these site-specific features, leading to poor generalization.
Task Diversity: Different sites may have varying clinical requirements and label definitions for the same task. Traditional MTFL methods rely on gradient-based clustering for aggregation, which often fails to handle inter-site task heterogeneity, resulting in suboptimal parameter updates and inaccurate localization.

Existing FL methods either focus on single tasks (e.g., only instrument segmentation) or lack domain-specific guidance, failing to capture the complex semantic relationships between different surgical tasks and sites.

2. Methodology: SurgFed

The authors propose SurgFed, a novel MTFL framework designed to handle both tissue and task diversity through two core modules: Language-guided Channel Selection (LCS) and Language-guided Hyper Aggregation (LHA). The framework utilizes a SAM2 (Segment Anything Model 2) backbone, where the encoder is frozen, and the memory/decoder layers are fine-tuned.

A. Language-guided Channel Selection (LCS) - Local Side

Goal: To enable site-specific adaptation by selecting and enhancing relevant feature channels based on local surgical context.
Mechanism:
- Uses a pre-trained CLIP model to encode predefined text prompts describing the dataset, task type, and specific labels (e.g., "Dataset: EndoVis2017, Task: Instrument Segmentation, Label: Shaft").
- Generates a text embedding $\xi_k$ which is expanded to match the spatial dimensions of visual features.
- A lightweight, personalized channel selection network (not shared across sites) concatenates the pooled visual features with the text embedding.
- A fully connected layer with a sigmoid activation produces a composite indicator $\hat{\xi}_k$ .
- This indicator is used to perform pixel-wise multiplication with the visual features, effectively highlighting task-relevant channels and suppressing irrelevant ones for that specific site.

B. Language-guided Hyper Aggregation (LHA) - Server Side

Goal: To model inter-site task interactions and guide personalized parameter updates, overcoming the limitations of pure gradient-based aggregation.
Mechanism:
- Collects gradient updates ( $\Delta w_k$ ) from all $K$ sites.
- Employs a layer-wise cross-attention mechanism to compute an attention matrix $A_k$ based on the similarity of gradient updates across sites.
- Integrates the same pre-defined text prompts (via CLIP) used in LCS to create a language-guided indicator $\xi^*_k$ . This indicator modulates the attention matrix ( $\tilde{A}_k$ ) to emphasize sites and layers that share semantic similarities (e.g., similar tissue structures or motion patterns).
- A Hypernetwork uses these language-guided interactions to generate adaptive weights ( $\psi_k$ ) for updating each local model: $w_k = w_{k}^{t-1} + \Delta w_k + \psi_k \tilde{A}_k$ .
- This allows the server to perform structured, interpretable aggregation that respects both task dependencies and site-specific variations.

3. Key Contributions

First Language-Guided FL for Surgery: Introduces the use of pre-defined textual prompts describing surgical instruments and anatomy to inject semantic domain knowledge into the FL adaptation process, bridging the gap between heterogeneous data distributions.
Novel Architecture (SurgFed): Proposes a dual-module framework:
- LCS: A personalized adapter for site-specific feature channel selection.
- LHA: A task-aware, cross-site hypernetwork guided by language inputs for dynamic parameter aggregation.
Comprehensive Benchmarking: Establishes a new benchmark using five public surgical datasets (EndoVis2017, EndoVis2018, AutoLaparo, SCARED, StereoMIS) covering four surgical types and two tasks (segmentation and depth estimation).

4. Experimental Results

The method was evaluated against state-of-the-art FL baselines (FedAvg, FedRep, FedProx, FedHCA2, MaT-FL) and local training baselines.

Performance Gains: SurgFed consistently outperformed all baselines across all five datasets.
- Overall Metric ( $\Delta m$ ): Achieved a +5.92% improvement over local training, significantly outperforming FedAvg (-35.95%) and FedRep (-10.99%).
- Segmentation: Improved Dice scores by up to 7.38% on EndoVis2017 and 7.23% on EndoVis2018 compared to FedAvg.
- Depth Estimation: Reduced RMSE significantly, showing a +18.42% improvement on the SCARED dataset compared to the baseline.
Ablation Studies:
- LCS vs. LHA: LCS alone improved performance by personalizing features (+1.50%), while LHA alone improved depth estimation by modeling task relationships. The combination yielded the best results.
- Indicator Design: Using semantic text prompts (Ours) was far superior to random initialization or one-hot embeddings, proving the necessity of semantic guidance for cross-task generalization.
- Fine-tuning Strategy: Fine-tuning both Memory and Decoder layers together provided the best synergy, leveraging temporal cues (Memory) and spatial decoding (Decoder).
Efficiency: The method introduced negligible parameter overhead (22.35MB vs. 22.10MB for baselines) and maintained comparable inference speeds (0.36 FPS vs. 0.50 FPS), with the heavy LHA computation restricted to the server side.

5. Significance

SurgFed represents a significant advancement in medical AI by demonstrating that semantic guidance (language) can effectively resolve the "heterogeneity gap" in Federated Learning.

Privacy-Preserving Collaboration: It enables multi-center collaboration without data sharing, crucial for sensitive surgical data.
Robustness to Diversity: By explicitly modeling tissue and task diversity through language, it overcomes the failure modes of traditional gradient-based FL in complex medical domains.
Generalizability: The framework is not limited to a single task or dataset; it successfully unifies segmentation and depth estimation across diverse surgical procedures, paving the way for more autonomous and cognitively aware robotic surgery systems.