LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings

Imagine you are trying to teach a robot how to perform surgery. To do this, the robot needs to "see" and understand what's happening inside a human body during an operation. But here's the problem: teaching a robot is like teaching a child to read, but you only have a few pages of a very blurry, confusing book.

Most existing medical video datasets are tiny, like a single chapter of a textbook. They have fewer than 100 videos, often less than 30 hours of footage. If you try to train a smart robot on such a small sample, it gets confused. It might think a specific tool is a scalpel when it's actually a pair of scissors, or it might not recognize a specific type of surgery because it's never seen it before.

Enter LEMON: The "Encyclopedia" of Surgery

The researchers behind this paper decided to build a massive library instead of a pamphlet. They created LEMON (Large Endoscopic MONocular Dataset).

The Scale: Instead of 30 hours, they gathered 938 hours of high-definition surgical videos. That's like watching a movie marathon for 39 days straight!
The Variety: They didn't just look at one type of surgery. They collected videos for 35 different procedures, from removing a gallbladder to transplanting a kidney, covering both robotic and traditional hand-held surgeries.
The Source: They found these videos on YouTube. But wait, YouTube has cat videos and cooking shows too. How did they get only surgery?

The "Smart Filter" Pipeline

Imagine you have a mountain of mixed-up DVDs: some are surgery, some are documentaries, some are just people talking about surgery without showing it. You can't watch them all manually.

The team built a digital assembly line (a data curation pipeline) to clean this up:

The Storyboard Check: They took a quick "snapshot" of every video (like a comic strip of 16 frames) and asked a computer, "Is this a surgery?" If the computer said "No," the video was tossed.
The Frame Scrub: Even in a surgery video, the beginning might have the surgeon introducing themselves, and the end might have credits. The team trained a computer to spot the actual surgery and cut out the "fluff" at the start and end.
The "Out-of-Body" Eraser: Sometimes, the camera shows the surgeon's face or the room outside the patient. The computer learned to "blur out" or remove those parts so the robot only sees what's happening inside the body.
The Human Safety Net: Finally, real human experts (surgeons and researchers) double-checked the work to make sure no mistakes slipped through.

LemonFM: The "Super-Student" Robot Brain

Once they had this massive library (LEMON), they needed to teach a robot how to learn from it. They created a model called LemonFM (Foundation Model).

Think of traditional AI training like giving a student a specific test and saying, "Memorize these answers." If the test changes slightly, the student fails.

LemonFM is different. It's like a medical student who has read every surgery book in the library and watched every surgery video.

Self-Taught: They used a special technique called "augmented knowledge distillation." Imagine showing the student two slightly different photos of the same surgery (maybe the lighting is different, or the patient is different). The student learns that, "Hey, even though the colors look a bit different, this is still the same tool doing the same job." This teaches the robot to be flexible and not get confused by small changes.
The Result: When they tested LemonFM on standard medical exams (downstream tasks), it didn't just pass; it crushed it.
- It got better at recognizing surgical phases (knowing if the surgery is just starting or finishing).
- It got better at spotting tools (knowing exactly which instrument is being used).
- It got better at action recognition (understanding what the surgeon is actually doing).
- It got better at segmentation (drawing a perfect outline around organs and tools).

Why Does This Matter?

Think of autonomous surgery as the "self-driving car" of the medical world. Just as self-driving cars need millions of miles of driving data to be safe, surgical robots need millions of hours of surgical data to be safe.

Safety: A robot trained on LEMON is less likely to make a mistake because it has "seen" almost everything before.
Efficiency: It can help surgeons work faster and with less fatigue.
Accessibility: Eventually, this technology could help bring high-quality surgical care to places where expert surgeons are scarce.

The Bottom Line

The paper says: "We built the biggest, cleanest library of surgery videos ever (LEMON) and used it to train the smartest surgical AI brain yet (LemonFM). This AI is so good that even if we only give it half the usual amount of test data to learn from, it still beats all the other experts."

They are essentially handing the medical community the keys to a massive, high-quality training ground, accelerating the journey toward robots that can one day help perform surgeries with superhuman precision and safety.

1. Problem Statement

The development of autonomous robotic surgery and advanced surgical perception systems is currently hindered by a critical scarcity of high-quality, large-scale annotated data.

Data Limitations: Existing open-access surgical datasets are typically small (often <100 videos, <30 hours) and cover limited procedure types, leading to poor model generalization.
Data Quality Issues: Previous attempts to aggregate public videos (e.g., from YouTube) often fail to filter out non-surgical content (e.g., patient testimonials, conference slides, UI overlays), introducing "spurious features" that confuse discriminative models.
Annotation Bottleneck: Manual annotation of surgical data is expensive, time-consuming, and restricted by patient privacy regulations, making it difficult to build the massive datasets required for training foundation models.

2. Methodology

The authors propose a comprehensive solution consisting of a novel data curation pipeline, the resulting LEMON dataset, and the LemonFM foundation model.

A. Data Curation Pipeline (Fig. 2)

To overcome the lack of large-scale, clean surgical data, the authors developed a multi-stage pipeline to process raw YouTube videos:

Collection: Aggregated ~18,000 raw videos from YouTube using search keywords for 35 distinct minimally invasive surgical procedures (robotic and non-robotic).
Video Classification (Storyboarding): Generated 4×4 image storyboards for each video. A ResNet18 classifier was trained on 4,070 annotated storyboards to filter out non-surgical videos.
Selection & Trimming: A frame-level classifier (ResNet18) identified the start and end of surgical footage to trim introductory/concluding non-surgical segments. Videos with >10% non-surgical frames were discarded.
Preprocessing (In-painting): A YOLOv8 model detected and "obliterated" (masked) non-surgical regions within surgical frames (e.g., logos, UI elements, out-of-body views) to ensure pixel-level cleanliness.
Annotation: Procedure types were assigned using video titles, enhanced by ChatGPT-4 for nuanced matching. All labels underwent manual verification by surgical experts.

B. The LEMON Dataset

Scale: 4,194 videos totaling 938 hours of footage and 85 million frames.
Diversity: Covers 35 distinct procedure types (e.g., pancreatectomy, hysterectomy, kidney transplant) and includes both robotic and traditional laparoscopic approaches.
Novel Tasks: Introduces two new supervised video classification tasks: Multi-label procedure classification and Binary surgery type classification (robotic vs. non-robotic).

C. LemonFM Foundation Model

A self-supervised foundation model pretrained on LEMON using a novel Augmented Knowledge Distillation approach:

Architecture: Uses a ConvNeXt-L backbone for both teacher and student networks.
Augmented Distillation: Unlike standard DINO, this method introduces a specific augmentation strategy ( $W_i$ $W_{i}$ ) to enforce invariance to:
- Inter-patient variation: By retrieving nearest neighbors from other videos of the same procedure type (based on cosine similarity).
- Intra-video motion: By using adjacent video frames.
Training Objective: Minimizes cross-entropy loss between the teacher and student distributions, encouraging the model to learn robust surgical features without relying on manual labels during pretraining.

3. Key Contributions

LEMON Dataset: The largest open-access surgical dataset to date (surpassing GenSurgery and SurgeNetXL in size and procedure diversity), featuring rigorous curation to remove non-surgical noise.
LemonFM: A state-of-the-art surgical foundation model pretrained on LEMON that demonstrates superior generalization across diverse downstream tasks.
Novel Methodology: A robust, automated data curation pipeline that effectively cleans public video sources, and an augmented distillation technique that improves feature invariance across patients and time.
New Benchmarks: Established leaderboards for multi-label procedure classification and binary surgery type classification.

4. Experimental Results

LemonFM was evaluated on six datasets (AutoLaparo, Cholec80, M2CAI16, GraSP, CholecT50, CholecSeg8k) across four tasks.

Surgical Phase Recognition: LemonFM outperformed existing foundation models (Endo-FM, EndoViT, GSViT, SurgeNetXL) by significant margins.
- Jaccard improvements: +9.5pp (AutoLaparo), +9.4pp (M2CAI16), +8.4pp (Cholec80).
Tool Presence Detection: Achieved +5.3pp mAP on Cholec80 and +10.2pp mAP on GraSP compared to the next best foundation model.
Action Recognition: Improved mAP by +4.4pp on CholecT50.
Semantic Segmentation: Surpassed all competitors with a +10.3pp improvement in mDice on CholecSeg8k.
Data Efficiency: LemonFM trained on only 50% of labeled downstream data still outperformed other models trained on 100% of the data, demonstrating high data efficiency.
Ablation Studies:
- Curation: The curation pipeline alone provided +4.5pp F1-score gains in phase recognition.
- Distillation: The augmented distillation method yielded +3.2pp mDice gains over vanilla DINO.
- Architecture: ConvNeXt backbones outperformed ViT backbones, likely due to better preservation of fine-grained details (e.g., tool tips) via convolutional inductive biases.

5. Significance and Impact

Foundation for Autonomous Surgery: By providing a massive, diverse, and clean dataset, LEMON addresses the primary bottleneck in training robust surgical AI, accelerating the development of autonomous robotic systems.
Generalization: The results prove that self-supervised learning on large-scale, diverse public data can yield models that generalize better than those trained on smaller, private, or task-specific datasets.
Accessibility: The dataset, code, and models are publicly available, democratizing access to high-quality surgical data for the research community and industry.
Ethical Handling: The authors implemented strict privacy measures, including the removal of patient identifiers and an opt-out mechanism for video owners, ensuring ethical compliance while leveraging public data.

In conclusion, this work establishes a new standard for surgical computer vision, demonstrating that large-scale, rigorously curated public data combined with advanced self-supervised learning can significantly advance the field of autonomous surgical perception.