MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features

Imagine you are a detective trying to solve a massive mystery, but instead of a crime scene, you are looking at a gigantic, high-resolution map of a city (a whole slide image of an ovary). Your job is to figure out what kind of "criminal" (cancer subtype) is hiding in this city and exactly where they are living.

The problem? The map is so huge that looking at every single brick (cell) one by one would take a human lifetime. Also, you don't have a list of exactly which bricks are bad; you only have a note saying, "There is a criminal somewhere in this whole city." This is called weak supervision.

Here is how the authors of this paper, Marcus Jenkins and his team, solved this puzzle using their new method, MB-DSMIL-CL-PL.

1. The Old Way: The "Frozen Library" vs. The "Exhaustive Search"

In the past, AI detectives had two main ways to work:

The Frozen Library (Traditional Method): They used a pre-written encyclopedia (frozen features) to describe every brick in the city. It was fast and cheap, but the encyclopedia was a bit outdated. It couldn't tell the difference between very similar-looking criminals, so the detective often made mistakes.
The Exhaustive Search (End-to-End Method): The detective tried to learn everything from scratch by looking at every single brick in real-time. This was very accurate but required a supercomputer and took forever. It wasn't scalable for real hospitals.

The Goal: The team wanted the speed and low cost of the "Frozen Library" but the sharp eyes of the "Exhaustive Search."

2. The New Solution: MB-DSMIL-CL-PL

The team built a new detective system that acts like a smart team of specialists working together. Here is the breakdown using a simple analogy:

A. The "Frozen Library" is still the base (Scalability)

They still use the pre-written encyclopedia (frozen features) to describe the bricks. This keeps the system fast and cheap, so hospitals can actually use it.

B. The "Multi-Branch" Detective (MB-DSMIL)

Imagine the detective has a team of specialists, one for each type of criminal (Serous, Mucinous, etc.).

Old Way: One general detective looked at the whole city and guessed.
New Way: Each specialist focuses only on their specific criminal type. If the "Serous Specialist" sees a clue, they shout, "That looks like a Serous criminal!" This prevents confusion between different types of cancer.

C. The "Contrastive Learning" Gym (CL)

This is the secret sauce. Imagine the detective is training in a gym.

The Workout: The detective takes a picture of a "criminal brick" and creates two slightly different versions of it (like adding a little blur or noise).
The Lesson: The detective is taught: "Even though these two pictures look slightly different, they are the same criminal. But this other picture? That's a totally different criminal."
The Result: The detective learns to recognize the essence of the criminal, not just the specific lighting or angle. This makes them much better at spotting the bad guys even when they are hiding or look slightly different.

D. The "Prototype" Memory Board (PL)

The detective keeps a Wanted Poster Board for each type of criminal.

How it works: As the detective looks at bricks, they update these posters. If they see a brick that looks like a "Mucinous Criminal," they stick a note on the Mucinous poster.
The Magic: Over time, these posters become the perfect "average" of what that criminal looks like. When the detective sees a new brick, they just ask, "Does this look more like the Mucinous poster or the Serous poster?" This stops the detective from getting confused by weird, one-off examples.

3. The Results: Why This Matters

The team tested their new detective against the old methods using real ovarian cancer slides.

Accuracy: The new method was 70% better at identifying the specific type of cancer in individual bricks (instances) and 15% better at identifying the cancer type for the whole slide.
Precision: It didn't just guess; it could point a finger and say, "The cancer is right here," with much higher accuracy.
Efficiency: It did all this without needing a supercomputer. It kept the "Frozen Library" speed but got the "Exhaustive Search" brainpower.

The Big Picture

Think of this paper as upgrading a GPS navigation system.

Old GPS: It knew the general roads but got lost in complex neighborhoods and couldn't tell you exactly which house was the destination.
New GPS: It uses the same map data (to stay fast) but adds a smart AI that learns from traffic patterns (contrastive learning) and remembers the specific look of every street (prototypes). Now, it can tell you exactly which house is the target and what kind of neighborhood it is in, instantly.

Why is this a big deal?
Ovarian cancer is deadly because it's often found too late. Pathologists (the human doctors) are overwhelmed with work. This new AI tool acts like a super-efficient assistant that can quickly sort through thousands of slides, spot the dangerous subtypes, and tell the doctor exactly where to look, helping to save lives by catching the disease earlier.

1. Problem Statement

The paper addresses the critical challenge of ovarian cancer subtype classification and localisation using Whole Slide Images (WSIs).

Clinical Context: Ovarian cancer has a high mortality rate, often due to late-stage diagnosis. Early detection and precise histological subtyping (e.g., High-grade serous carcinoma, Clear cell carcinoma, Borderline tumors) are essential for personalized treatment.
Computational Bottleneck: Traditional deep learning approaches for WSIs often rely on Multiple Instance Learning (MIL).
- Frozen Feature Approaches: Use pre-computed patch features (e.g., from ResNet or Vision Transformers) to ensure scalability. However, these are limited by the discriminative power of the frozen feature space.
- End-to-End Approaches: Train feature extractors alongside the classifier for better accuracy but suffer from massive computational costs and memory usage, making them unscalable for large WSI datasets.
Specific Gap: Existing MIL methods (like DSMIL and CLAM) often struggle with multi-class subtype classification (beyond binary cancer vs. normal) and rely on unstable pseudo-labeling strategies that can lead to confirmation bias or error propagation. Furthermore, few methods effectively combine the scalability of frozen features with the performance gains of contrastive learning.

2. Methodology: MB-DSMIL-CL-PL

The authors propose MB-DSMIL-CL-PL, a novel weakly supervised MIL framework that maintains the scalability of frozen features while integrating Contrastive Learning (CL) and Prototype Learning (PL).

A. Data Preprocessing & Feature Extraction

Patch Extraction: WSIs are segmented to remove background, then divided into non-overlapping $224 \times 224$ patches at $10\times$ magnification.
Frozen Encoder: All patches are embedded using UNI (a Vision Transformer pretrained on 100M histopathology images via DINOv2). These embeddings are pre-computed and frozen during training, ensuring low memory overhead.

B. Core Architectural Innovations

The method builds upon DSMIL (Dual-Stream MIL) but introduces three key modifications:

Multi-Branch DSMIL (MB-DSMIL):
- Standard DSMIL uses a shared query projection for all classes. The authors replace this with class-specific query projection functions ( $\phi_c$ ).
- This allows the attention mechanism to compute class-specific relevance scores, enhancing the model's ability to distinguish between multiple subtypes (e.g., distinguishing Mucinous Adenocarcinoma from Endometrioid Adenocarcinoma).
Contrastive Learning (CL) on Feature Space:
- Instead of training an image encoder end-to-end (which is memory-intensive), the authors apply SimCLR-style augmentations directly in the feature space of the frozen UNI embeddings.
- Augmentation: Normalized Gaussian noise is added to feature vectors while preserving the sign of dimensions to prevent semantic drift.
- Encoders: Simple Multi-Layer Perceptrons (MLPs) act as query and key encoders. The key encoder is updated via Exponential Moving Average (EMA).
- Loss: A warm-up phase uses unsupervised MoCo-style loss, followed by Supervised Contrastive Loss using pseudo-labels. This encourages clustering of instances belonging to the same subtype within the feature space.
Prototype Learning (PL):
- To stabilize pseudo-labeling, the method introduces class prototypes ( $\mu_c$ ) for every subtype (including normal tissue).
- Prototypes are updated via EMA based on instance features.
- Pseudo-Labeling: Instead of relying solely on the "most confident instance" (as in DSMIL) or attention scores (as in CLAM), instances are assigned soft pseudo-labels based on their similarity to the class prototypes. This reduces confirmation bias and error propagation.

3. Key Contributions

Scalable Multi-Class MIL: The first approach to successfully integrate contrastive learning and prototype learning into a MIL framework using pre-computed frozen features. This achieves end-to-end-like performance gains without the computational cost of re-training image encoders.
Architectural Improvements: Introduction of Multi-Branch DSMIL with class-specific attention projections, significantly improving multi-class discrimination compared to standard shared-projection DSMIL.
Robust Pseudo-Labeling: Replacing heuristic pseudo-labeling (confidence/attention) with prototype-based soft labeling, which provides superior stability for multi-class subtype classification.
Feature-Space Augmentation: Demonstrating that applying SimCLR-style augmentations to frozen embeddings effectively adapts general-purpose feature extractors (UNI) for specific downstream tasks.

4. Experimental Results

The method was evaluated on the DROV dataset (137 ovarian cancer slides with 8 distinct classes: 7 subtypes + normal).

Slide-Level Classification:
- MB-DSMIL-CL-PL achieved a Macro F1 score of 0.775, outperforming DSMIL (0.672) and CLAM (0.757).
- It showed significant improvements in AUC (0.974 vs. 0.952 for DSMIL).
Instance-Level Classification (Localisation):
- The proposed method achieved a massive 70.4% improvement in Macro F1 (0.513 vs. 0.301 for DSMIL) and a 16.9% gain in AUC (0.913 vs. 0.781).
- This indicates a drastic reduction in confusion between subtypes and between cancerous and normal tissue.
Localisation Quality:
- Attention maps generated by MB-DSMIL-CL-PL were less sparse and more aligned with ground-truth pixel annotations compared to baselines.
- Qualitative analysis (Figures 3-5) showed reduced misclassification of normal tissue as cancerous and better separation of specific subtypes.
Class-Specific Performance:
- The model showed particular strength in minority classes (e.g., Mucinous Adenocarcinoma and Mucinous Borderline Tumors), where baseline models often failed.

5. Significance and Conclusion

Clinical Impact: The ability to accurately classify and localise multiple ovarian cancer subtypes using only slide-level labels (weak supervision) can significantly reduce the diagnostic workload for pathologists and improve treatment personalization.
Computational Efficiency: By maintaining the use of frozen patch features, the method offers a "best of both worlds" solution: the high accuracy of end-to-end contrastive learning with the scalability and low memory footprint of traditional MIL.
Future Directions: The authors suggest extending this framework to multi-resolution inputs and applying it to binary classification benchmarks (like CAMELYON) to further validate its generalizability.

In summary, MB-DSMIL-CL-PL represents a significant advancement in computational pathology, solving the scalability vs. accuracy trade-off in weakly supervised learning for complex, multi-class histopathological tasks.