Adapting a Pre-trained Single-Cell Foundation Model to Spatial Gene Expression Generation from Histology Images

The Big Picture: Predicting the Invisible from the Visible

Imagine you are a detective trying to solve a crime. You have a high-resolution photo of a crime scene (a histology image stained with pink and blue dyes). You can see the layout of the room, the furniture, and the people standing there.

However, you can't see what the people are thinking or saying to each other. In biology, this is the difference between looking at a tissue slide under a microscope and knowing the gene expression (the chemical "conversation" happening inside the cells).

Usually, to hear that conversation, scientists have to use expensive, slow, and complex machines (Spatial Transcriptomics). HINGE is a new AI tool that says: "I can look at the photo of the room and predict exactly what the people are saying, without needing the expensive machine."

The Problem: Why is this hard?

The authors identified three main hurdles in building this AI:

The Language Barrier: The AI models that are really good at understanding genes (called Single-Cell Foundation Models) have only ever "read" text (gene data). They have never "seen" a picture. Trying to make them understand an image is like asking a blind poet to describe a sunset.
The "One-Size-Fits-All" Trap: Most existing AI tries to guess the answer with a single, rigid calculation (like a calculator). But biology is messy. Two cells that look identical might be saying slightly different things. The AI needs to be flexible, like a jazz musician improvising, rather than a robot following a script.
The "Forgetting" Problem: If you take a genius gene-expert AI and force it to learn from scratch using limited medical images, it might get confused and forget all the complex rules of biology it already knew. This is called "catastrophic forgetting."

The Solution: HINGE (The Smart Retrofit)

The authors built HINGE (HIstology-coNditioned GEneration). Think of HINGE as a high-tech translator and adapter that connects the "Gene Expert" AI to the "Image" world without breaking the expert's brain.

Here is how it works, step-by-step:

1. The "Ghost" Expert (The Frozen Backbone)

Imagine you have a world-famous chef (the CellFM model) who knows every recipe in the world (gene relationships). But this chef has never seen a kitchen; they only know the ingredients.
Instead of firing the chef and hiring a new one, HINGE keeps the chef exactly as they are. The chef's brain is frozen (frozen weights). We don't want to change their fundamental knowledge of how ingredients mix.

2. The "Headset" (SoftAdaLN)

To let the chef see the kitchen, we don't rebuild their brain. Instead, we put a special headset on them (called SoftAdaLN).

This headset listens to the histology image (the kitchen layout).
It whispers instructions to the chef: "Hey, this looks like a tumor area, so let's adjust the recipe slightly."
Crucially, the headset is set to "zero volume" at the start. This ensures the chef starts by cooking exactly as they always have, then slowly learns to listen to the whispers. This prevents the chef from getting confused and forgetting their recipes.

3. The "Fill-in-the-Blanks" Game (Masked Diffusion)

How does the chef generate the prediction?

Old Way: The AI tries to guess the whole sentence at once. If it gets one word wrong, the whole sentence makes no sense.
HINGE's Way: The AI plays a game of "Fill in the Blanks."
1. It starts with a blank page (no gene data).
2. It looks at the image and the chef's knowledge.
3. It fills in one gene at a time, then another, slowly revealing the full story.
4. Because it fills in the blanks one by one, it can check its work constantly. If it makes a mistake, it can correct it in the next step. This ensures the final story is biologically logical (genes that usually go together, stay together).

4. The "Warm-Up" (Curriculum Learning)

When you start teaching a new skill, you don't start with the hardest level. You start easy.
HINGE uses a Warm-Start Curriculum. At the beginning of training, it only asks the AI to fill in a few blanks (easy steps). As the AI gets better, it asks it to fill in more blanks at once. This stabilizes the learning process so the AI doesn't crash and burn early on.

Why is this a Big Deal?

The paper tested HINGE on three different types of tissue (skin cancer, breast cancer, and kidney). Here is what happened:

Better Accuracy: HINGE predicted gene expression more accurately than any previous method, whether they were "regression" (calculator) models or other "generative" (jazz musician) models.
Biological Sense: Because HINGE kept the "Gene Expert" intact, the predictions weren't just random numbers. They respected the complex relationships between genes. For example, if Gene A usually turns on Gene B, HINGE predicted that relationship correctly. Other models often broke these links.
Visual Clarity: When the researchers visualized the results, HINGE produced maps that looked exactly like real biological patterns, whereas other models produced blurry, smeared-out guesses.

The Takeaway

HINGE is like taking a brilliant, specialized translator (the gene model) and giving them a pair of glasses (the image adapter) so they can translate a photo into a biological story. By being careful not to change the translator's brain, but just giving them a way to see the new context, the AI can predict complex biological data from simple microscope images with high accuracy and biological sense.

This opens the door to using cheap, common microscope slides to get expensive, detailed genetic insights, potentially revolutionizing how doctors diagnose diseases.

1. Problem Statement

Spatial Transcriptomics (ST) allows for in situ gene expression profiling but suffers from high costs and limited throughput. A practical alternative is to predict spatial gene expression directly from standard Hematoxylin–Eosin (H&E) histology images.

Current Limitations: Most existing methods use deterministic regression (mapping histology patches to expression vectors). However, biological variability and measurement noise mean that a single histology patch does not uniquely determine gene expression.
Generative Gap: Recent generative approaches (e.g., diffusion models) attempt to model the conditional distribution of expression but often fail to explicitly model gene–gene dependencies (regulatory and co-expression patterns). These dependencies are crucial for biological coherence but are difficult to infer from histology alone.
The Challenge: While Single-Cell Foundation Models (sc-FMs) like CellFM or scGPT have learned complex gene relationships from massive scRNA-seq datasets, adapting them to histology-conditioned generation is non-trivial due to:
1. Modality Gap: sc-FMs lack a visual pathway for histology.
2. Objective Mismatch: sc-FMs are pre-trained via masked autoencoding (MAE), whereas generative models typically use Gaussian noise diffusion, creating a distribution mismatch.
3. Compositional Shift: scRNA-seq profiles single cells, while ST spots contain mixtures of cell types.
4. Limited Supervision: ST datasets are small, making full fine-tuning prone to catastrophic forgetting of the pre-trained knowledge.

2. Methodology: HINGE

The authors propose HINGE (HIstology-coNditioned GEneration), a framework that retrofits a pre-trained, expression-only sc-FM (specifically CellFM) into a conditional generator for histology-to-expression mapping.

A. Architecture & Backbone

Frozen Backbone: The pre-trained CellFM (a Transformer-based model) is kept frozen to preserve its learned gene–gene dependencies.
SoftAdaLN (Soft Adaptive Layer Normalization): To inject histology context without retraining the backbone, the authors introduce a lightweight, identity-initialized modulation mechanism inserted into every Transformer layer.
- It takes histology features (from a frozen encoder like UNI/CONCH) and timestep embeddings.
- It applies a SoftNorm followed by an affine modulation (scale and shift) initialized to identity (scale $\approx 1$ , shift $\approx 0$ ).
- Benefit: This allows the model to gradually learn to condition on histology while maintaining the original pre-trained behavior at the start of training, mitigating catastrophic forgetting.

B. Expression-Space Masked Diffusion

To bridge the gap between the sc-FM's MAE pre-training and the generative task, HINGE replaces standard Gaussian diffusion with a stochastic masked diffusion process:

Forward Process: Instead of adding Gaussian noise to all dimensions, the process progressively masks gene entries (setting them to zero or a mask token) based on a visibility schedule $\bar{\alpha}_t$ .
Reverse Process: The model predicts the masked gene entries given the partially observed expression, the timestep, and the histology condition.
Alignment: This design ensures the input distribution (partially observed vectors) and the supervision signal (reconstructing masked entries) match the original MAE pre-training objective of the sc-FM.

C. Warm-Start Curriculum

To further stabilize training and respect the pre-training regime (which used low mask ratios, $\approx 20\%$ ):

The training begins with a warm-start curriculum where timesteps are sampled only from low-mask regions (high visibility).
Only after this initial phase does the model sample from the full diffusion schedule. This prevents early instability caused by high-noise (high-mask) inputs.

3. Key Contributions

First Framework for Cross-Modal Adaptation: HINGE is the first method to adapt pre-trained, expression-only sc-FMs for histology-conditioned spatial expression generation.
Novel Conditioning Mechanism: Introduction of SoftAdaLN, a lightweight, identity-initialized modulation that injects visual context into frozen Transformer layers, enabling effective knowledge transfer without catastrophic forgetting.
Objective Alignment: Development of an expression-space masked diffusion process that aligns the generative objective with the sc-FM's masked autoencoding pre-training, solving the input-supervision mismatch.
Stabilization Strategy: A warm-start curriculum that samples low-mask timesteps initially to ensure stable optimization.

4. Experimental Results

The method was evaluated on three diverse ST datasets: cSCC (skin cancer), Her2ST (breast cancer), and Kidney (healthy and diseased).

Quantitative Performance: HINGE outperformed six state-of-the-art baselines (including discriminative regressors like TRIPLEX and generative models like Stem and STFlow) across all datasets.
- Metrics: It achieved the highest Pearson Correlation Coefficients (PCC) for the top 50 and 200 variable genes. For example, on cSCC, it improved PCC-50 by ~4% over the best baseline.
- Error: It achieved competitive MSE and MAE, often outperforming other generative models.
Biological Coherence:
- Marker Genes: Visualizations showed HINGE better captured localized expression patterns of specific markers (e.g., KRT6A, GNAS) compared to baselines, avoiding oversmoothing.
- Co-expression: HINGE preserved gene–gene correlation structures (co-expression patterns) much better than methods that did not leverage sc-FMs, demonstrating the successful transfer of regulatory knowledge.
Ablation Studies:
- Removing the pre-trained backbone (training from scratch) resulted in significantly worse performance.
- Replacing masked diffusion with standard Gaussian diffusion degraded performance, confirming the necessity of objective alignment.
- Removing identity initialization or SoftNorm in SoftAdaLN led to performance drops, highlighting the importance of stable conditioning.

5. Significance

Practical Route for sc-FMs: HINGE provides a generalizable pathway to incorporate powerful single-cell foundation models into histology-based tissue modeling, overcoming the modality gap.
Biological Fidelity: By preserving gene–gene dependencies, HINGE generates spatially coherent expression maps that are biologically plausible, which is critical for downstream tasks like biomarker discovery and disease mechanism analysis.
Efficiency: The framework achieves high accuracy with a relatively small number of inference steps, offering a practical trade-off between speed and quality for clinical or research applications.
Generalizability: While instantiated with CellFM, the architecture is compatible with other sc-FMs (e.g., scGPT), suggesting broad applicability in computational pathology.