MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models

Imagine you are looking at a city from a drone. From high up, you can see the morphology: the shape of the buildings, the layout of the streets, and the density of the population. This is what current "Pathology Foundation Models" do. They are like super-smart drones trained on millions of photos of human tissue (cells and organs). They are incredible at recognizing shapes, patterns, and structures to diagnose diseases.

But there's a problem: The drone can see the shape of a factory, but it can't tell you what the factory is making inside. Is it producing medicine? Is it making toxic waste? Is it running at full speed or shutting down? In biology, this "what's happening inside" is the molecular state (gene expression).

For a long time, AI could see the city (morphology) but couldn't read the factory's production logs (molecular data).

Enter MINT: The "Bilingual" Translator

The paper introduces a new system called MINT (Molecularly Informed Training). Think of MINT as giving our super-smart drone a bilingual translator and a specialized notebook.

Here is how it works, broken down into simple concepts:

1. The "Two-Notebook" System (The ST Token)

Usually, when an AI tries to learn something new (like reading gene logs), it might accidentally "forget" what it already knew (how to recognize building shapes). This is called "catastrophic forgetting." It's like a chef who learns to play the piano so well they forget how to cook.

MINT solves this by giving the AI two separate mental channels:

The CLS Token (The Original Chef): This keeps the original knowledge of tissue shapes. It never stops doing what it was good at.
The ST Token (The New Translator): This is a brand-new "notebook" added specifically to learn the molecular data (gene expression).

By keeping these separate, the AI can learn the new language of genes without overwriting its old knowledge of shapes.

2. The "Ghost Teacher" (Distillation)

To make sure the AI doesn't get confused, MINT uses a "Ghost Teacher." Imagine the original, pre-trained AI is a master chef who is frozen in time. The new AI (the student) is allowed to taste new ingredients (gene data), but the Ghost Teacher constantly whispers, "Hey, don't forget how to chop onions!"

This ensures that while the student learns about genes, it stays anchored to its original, high-quality understanding of tissue shapes.

3. Two Different Magnifying Glasses (Spot vs. Patch)

The paper uses two types of molecular data, like looking at a city with two different lenses:

The Wide Lens (Visium/Spot-level): This looks at a whole neighborhood (a "spot") and tells you the average activity of all the houses there.
The Micro Lens (Xenium/Patch-level): This zooms in to see individual molecules inside a single house.

MINT learns from both. It understands the "neighborhood vibe" and the "individual house details" simultaneously, making it much smarter than models that only look at one scale.

The Result: A Super-Doctor AI

When the researchers tested MINT, the results were impressive:

Better at reading the logs: It became much better at predicting gene expression (what the cells are actually doing) compared to previous models.
Didn't forget the shapes: It didn't lose its ability to diagnose diseases based on tissue shape. In fact, it got slightly better at general tasks too!

The Big Picture:
Before MINT, AI pathologists were like detectives who could only look at the crime scene's layout. MINT gives them a way to read the suspect's diary as well. By combining the visual (what it looks like) with the molecular (what it's doing), MINT creates a more complete, powerful, and accurate understanding of human disease.

It proves that to build the ultimate medical AI, we don't just need more pictures; we need to teach the AI to understand the hidden language of life happening inside the pictures.

Here is a detailed technical summary of the paper "MINT: Molecularly Informed Training with Spatial Transcriptomics Supervision for Pathology Foundation Models."

1. Problem Statement

Pathology foundation models (e.g., UNI, Virchow, H-optimus) have achieved state-of-the-art performance by learning morphological representations from large-scale whole-slide images (WSIs) via self-supervised pretraining (e.g., DINO). However, these models are trained exclusively on visual patterns and do not explicitly capture the underlying molecular state of the tissue (e.g., gene expression, signaling pathways).

While Spatial Transcriptomics (ST) technologies (like Visium and Xenium) provide direct measurements of gene expression in situ, integrating this data into foundation models presents a critical challenge: Catastrophic Forgetting. Fine-tuning a pretrained model on gene expression regression tasks often overwrites the valuable morphological features acquired during large-scale pretraining, degrading performance on general pathology tasks.

2. Methodology: MINT Framework

The authors propose MINT (Molecularly Informed Training), a fine-tuning framework that integrates spatial transcriptomics supervision into pretrained Vision Transformers (ViTs) without sacrificing morphological capabilities. The framework relies on three core design principles:

A. Dedicated ST Token Architecture

Instead of forcing the standard CLS token (which encodes morphology) to learn gene expression, MINT appends a learnable ST token to the ViT input sequence alongside the CLS token and patch tokens.

Separation of Concerns: The CLS token retains its original role in encoding morphological features, while the ST token specializes in encoding transcriptomic information.
Interaction: The ST token participates in self-attention with CLS and patch tokens across all transformer layers, allowing it to learn molecular representations from the full spatial context.
Inference: The final feature representation is formed by concatenating both tokens: $[z_{cls} \parallel z_{st}]$ .

B. Dual Distillation Mechanism

To prevent catastrophic forgetting, MINT employs two complementary distillation objectives:

DINO Self-Distillation: Uses a student-teacher framework (with an Exponential Moving Average teacher) to maintain self-supervised visual learning dynamics.
Explicit Feature Anchoring: A frozen copy of the original pretrained encoder acts as an anchor. The student's CLS token is explicitly regularized via $L_2$ loss to stay close to the frozen model's output, ensuring morphological stability.

C. Multi-Scale Supervision

MINT leverages two types of spatial transcriptomics data as complementary supervision signals:

Spot-Level (Visium): Predicts gene expression for the entire tile (representing ~10–50 cells) using the ST token. The model uses a shared vocabulary of ~38k genes and applies stochastic Highly Variable Gene (HVG) selection.
Patch-Level (Xenium): Predicts gene expression at sub-cellular resolution (individual transcripts) using the patch tokens. This provides higher-resolution supervision, restricted to patches containing detected transcripts to avoid penalizing empty regions.

Total Objective Function:
$\mathcal{L} = \mathcal{L}_{DINO} + \lambda_{distill}\mathcal{L}_{distill} + \lambda_{ST}\mathcal{L}_{ST} + \lambda_{pST}\mathcal{L}_{pST}$

3. Key Contributions

Novel Framework: Introduced MINT, the first framework to incorporate spatial transcriptomics supervision into pathology foundation models while explicitly preventing catastrophic forgetting via a dedicated ST token and dual distillation.
Token Specialization: Demonstrated that separating molecular (ST) and morphological (CLS) learning into distinct tokens allows the model to capture complementary information. Combining both tokens yields consistent improvements across different backbone architectures (H-optimus-0, UNI2-h).
State-of-the-Art Performance: Achieved the best overall performance on both molecular prediction (HEST-Bench) and general pathology transferability (EVA) benchmarks, proving that cross-modal supervision is a distinct and effective axis for model improvement beyond image-only scaling.

4. Experimental Results

The model was trained on 577 publicly available HEST samples (paired histology and transcriptomics) using the H-optimus-0 backbone.

HEST-Bench (Gene Expression Prediction):
- MINT achieved a mean Pearson correlation of 0.440, outperforming all baselines including H-optimus-0 (0.415) and UNI2-h (0.414).
- It ranked first across all 9 cancer types evaluated.
EVA (General Pathology Transferability):
- MINT achieved an average score of 0.803, surpassing Virchow2 (0.798) and H-optimus-0 (0.793).
- Crucially, MINT maintained or improved performance on classification and weakly-supervised tasks, showing no trade-off between molecular and morphological capabilities.
Ablation Studies:
- Token Separation: Models using a concatenated $[CLS \parallel ST]$ representation outperformed those using only CLS or simple summation.
- Necessity of Distillation: Fine-tuning the CLS token directly for gene expression (without the ST token) caused a significant drop in EVA scores (catastrophic forgetting), even with distillation. The dedicated ST token successfully decoupled the learning tasks.

5. Significance

This work demonstrates that spatial transcriptomics supervision provides a complementary signal to morphology-centric self-supervised pretraining.

Efficiency: Significant performance gains were achieved by fine-tuning on a relatively small dataset (577 samples) rather than scaling up image data.
Future Direction: It establishes a new paradigm for pathology foundation models where cross-modal supervision (image + molecular) is essential for capturing the full biological state of tissue.
Scalability: The backbone-agnostic nature of MINT suggests it can be applied to any pretrained pathology ViT, and the approach is poised to benefit further as paired histology-transcriptomics datasets grow.