Enabling clinical use of foundation models in histopathology

Audun L. Henriksen, Ole-Johan Skrede, Lisa van der Schee, Enric Domingo, Sepp De Raedt, Ilyá Kostolomov, Jennifer Hay, Karolina Cyll, Wanja Kildal, Joakim Kalsnes, Robert W. Williams, Manohar Pradhan, John Arne Nesheim, Hanne A. Askautrud, Maria X. Isaksen, Karmele Saez de Gordoa, Miriam Cuatrecasas, Joanne Edwards, TransSCOT group, Arild Nesbakken, Neil A. Shepherd, Ian Tomlinson, Daniel-Christoph Wagner, Rachel S. Kerr, Tarjei Sveinsgjerd Hveem, Knut Liestøl, Yoshiaki Nakamura, Marco Novelli, Masaaki Miyo, Sebastian Foersch, David N. Church, Miangela M. Lacle, David J. Kerr, Andreas Kleppe

Published 2026-02-27

📖 4 min read☕ Coffee break read

View on arXiv ↗PDF ↗

Imagine you are trying to teach a brilliant but slightly distracted student how to identify different types of fruit. You show them thousands of photos of apples, oranges, and bananas.

The Problem: The "Background Noise" Student
In the world of medical AI, these "students" are called Foundation Models. They are incredibly smart and have been trained on millions of medical images (slides of tissue) to spot cancer. However, just like our distracted student, they have a bad habit: they don't just learn to recognize the fruit (the biology); they also memorize the background.

If an apple is always photographed in a red kitchen in one hospital, and an orange is always photographed in a blue kitchen in another, the student might start thinking, "Red kitchen = Apple, Blue kitchen = Orange." They aren't actually learning what an apple looks like; they are learning the kitchen.

In real medical terms, this means the AI is getting confused by:

The Scanner: Different machines take pictures with slightly different colors or sharpness.
The Lab: How the tissue was cut, stained, or prepared.

If the AI relies on these "kitchen" clues, it will fail when it sees a patient's tissue scanned in a different hospital with a different machine. This is dangerous for real-world medicine.

The Experiment: The "Twin Slide" Test
The researchers in this paper wanted to fix this. They gathered a massive collection of tissue slides from over 6,000 patients. Here is the clever part: for many of these patients, they had two copies of the exact same tissue slice.

One copy was scanned in the UK.
The other copy was scanned in Norway.
Even better, some were scanned on five different types of scanners in Norway.

It's like taking a photo of the same apple with a Canon camera, then an iPhone, then a Nikon. The apple is the same, but the photos look slightly different.

The Solution: The "Twin Teacher" Method
The researchers didn't want to retrain the super-smart Foundation Models (which would take years and huge computers). Instead, they built a new "coach" (a smaller, specific AI) that sits on top of the Foundation Model.

They introduced a special rule during training, which they call Robustness Loss. Think of it like this:

Imagine the student is looking at the "UK Apple" photo and the "Norway Apple" photo. The teacher yells, "Stop! These are the same apple! If you say the UK one is a 'Good Apple' and the Norway one is a 'Bad Apple' just because the lighting is different, you get a penalty!"

They added two specific penalties to the student's homework:

The Feature Penalty: "If you see the same spot of tissue on two different scanners, your internal description of it must be identical."
The Score Penalty: "Your final guess (the grade you give the apple) must be the same for both photos."

The Results: A Smarter, More Reliable Doctor
When they tested this new method:

The "Kitchen" Clues Disappeared: The AI stopped caring about which scanner took the picture. It finally learned to look at the actual tissue.
Accuracy Went Up: By ignoring the noise (the scanner differences), the AI actually got better at spotting the disease. It was like the student finally stopped looking at the background and started focusing on the fruit.
Consistency: Before, the AI might say "Cancer" for a slide from Hospital A and "No Cancer" for the exact same slide from Hospital B. Now, it gives the same answer regardless of where the photo was taken.

Why This Matters
Currently, if you build an AI in one hospital, it often fails when you try to use it in another hospital because the equipment is different. This paper provides a "plug-and-play" fix. You don't need to rebuild the whole AI; you just need to teach it this new rule about consistency.

The Bottom Line
The researchers found a way to make medical AI immune to technical glitches. They taught the AI to ignore the "camera" and focus on the "patient." This is a huge step toward making AI a reliable tool that doctors can use every day, no matter what scanner they have in their lab.

1. Problem Statement

The Robustness Gap in Foundation Models:
While foundation models (FMs) for computational pathology, trained via self-supervised learning on massive Whole Slide Image (WSI) datasets, promise generalizable deep learning systems, they suffer from a critical flaw: sensitivity to non-biological technical variation.

Shortcut Learning: These models inadvertently learn "spurious correlations" related to pre-analytic factors (tissue processing, staining reagents) and digital scanning equipment (scanner manufacturers, models).
Clinical Consequence: When a downstream task-specific model is trained on features extracted from an FM, it often predicts based on the scanner or laboratory origin rather than the underlying biology. This leads to:
- Inconsistency: The same patient's tissue section, scanned on different devices or prepared in different labs, yields vastly different prediction scores.
- Poor Generalization: Models fail when deployed in routine clinical practice where data comes from diverse, unseen scanners.
Current Limitations: Standard stain normalization techniques fail to remove scanner-specific signatures. Previous attempts to fix this (e.g., Carloni et al.) required separate data flows or specific paired data setups that were not fully integrated into the downstream network optimization.

2. Methodology

The authors propose a regularization framework that improves the robustness of downstream task-specific models without retraining the foundation models themselves.

A. Experimental Setup

Data: A comprehensive dataset of 27,042 WSIs from 6,155 patients across multiple cohorts (TransSCOT, QUASAR 2, VICTOR, etc.).
Key Feature: The dataset includes co-registered tiles from the same physical tissue section scanned on different scanners (e.g., Aperio AT2 vs. NanoZoomer XR) and prepared in different laboratories.
Foundation Models: Eight popular models were tested: H-Optimus-0, H-Optimus-1, Hibou-L, Phikon-v2, Prov-GigaPath, UNI, UNI2, and Virchow2.
Tasks:
1. Survival outcome prediction in early-stage Colorectal Cancer (CRC).
2. Lymph Node Metastasis (LNM) prediction in pathological T1 (pT1) CRC.

B. The Proposed Approach: Multi-Loss Regularization

The authors modify the standard Attention-based Multiple Instance Learning (MIL) pipeline by introducing two additional loss terms during the training of the downstream network (the "head"). These losses are applied to a single set of data consisting of paired tiles from different scanners.

Contrastive Embedding Loss (InfoNCE):
- Goal: Force the feature representations of the same physical tissue area (across different scanners) to be close, while pushing apart features from different patients.
- Mechanism: Uses a modified InfoNCE loss on tile-level embeddings. For a tile $i$ from scanner A, the corresponding tile from scanner B is the "positive" sample, while tiles from other patients on scanner A are "negative" samples.
Prediction Score Loss (Mean Squared Error):
- Goal: Ensure the final slide-level prediction scores are consistent regardless of the scanner used.
- Mechanism: Calculates the Mean Squared Error (MSE) between the prediction logits of the paired scans. This penalizes the network if it produces different outputs for the same biological sample just because the scanner changed.

Total Loss Function:
$L = L_{Classification} + \lambda (L_{Embedding} + L_{Score})$
Where $\lambda$ is a tunable weight factor. The authors systematically sweep $\lambda$ to find the optimal balance between robustness and predictive accuracy.

3. Key Contributions

Novel Regularization Strategy: Introduced a dual-loss approach (contrastive + score consistency) that regularizes the entire downstream network, not just the feature extractor. This ensures robustness propagates to the final clinical decision.
Comprehensive Validation: Validated the method across 8 different foundation models and 2 distinct clinical tasks using a massive, multi-center dataset with real-world scanner variability.
No Foundation Retraining: The method allows for the deployment of robust models using existing foundation models without the computational cost of retraining the massive base models.
Quantification of the Problem: Provided empirical evidence that foundation model features often encode scanner identity with near-perfect accuracy (linear probing accuracy >99% for most models), confirming the severity of the bias.

4. Key Results

Dramatic Improvement in Robustness:
- Inconsistency Metric: The relative average standard deviation of prediction scores for the same patient across scanners dropped significantly. For most models, inconsistency fell below 0.1 (from ~0.25–0.52 in baseline models).
- Classification Agreement: The percentage of patients receiving the same classification (e.g., Good vs. Poor outcome) across different scanners increased from ~80–90% to >95% for most models.
Improved Predictive Accuracy:
- Contrary to the assumption that robustness comes at the cost of accuracy, the regularization improved prediction accuracy (measured by c-index for survival and AUC for LNM).
- Example (LNM Prediction): For the Virchow2 model, AUC improved from 0.64 (baseline) to 0.73 (with robustness loss).
- Interpretation: By forcing the model to ignore scanner artifacts, it was compelled to focus on biologically relevant features, thereby improving generalization.
Task and Model Variability:
- The optimal weight ( $\lambda$ ) for the loss terms varied significantly between foundation models and tasks, suggesting that hyperparameter tuning is essential for each specific deployment.
- Some models (e.g., Hibou-L) were initially the least robust but became highly robust after regularization.
Comparison to Prior Work: The proposed method achieved greater robustness improvements using two scanners than previous methods (Carloni et al.) achieved using six scanners, due to the direct regularization of the final prediction score.

5. Significance

Clinical Viability: This work addresses the primary barrier to the clinical adoption of AI in pathology: the lack of reliability when moving from a research setting (single scanner) to real-world practice (multiple scanners, labs, and protocols).
Scalability: By avoiding the need to retrain foundation models, this approach is computationally efficient and can be applied immediately to existing state-of-the-art models.
Paradigm Shift: It demonstrates that "robustness" in medical AI is not just a data engineering problem (e.g., collecting more diverse data) but can be solved through architectural and loss-function design.
Future Impact: The methodology provides a blueprint for developing "production-ready" computational pathology models that are invariant to technical noise, ensuring that patient outcomes are determined by biology, not by the machine used to scan the slide.

Enabling clinical use of foundation models in histopathology

1. Problem Statement

2. Methodology

A. Experimental Setup

B. The Proposed Approach: Multi-Loss Regularization

3. Key Contributions

4. Key Results

5. Significance

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems