CLM-X: A multimodal single-cell foundation model with flexible multi-way Transformer for unified scRNA-seq and scATAC-seq analysis

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine the human body as a massive, bustling city. For a long time, scientists could only take a census of this city in two very different ways:

The "What's Being Said" Census (scRNA-seq): This counts the active conversations in every building (cell). It tells us which genes are "speaking" loudly (high expression) and which are whispering or silent.
The "Blueprints" Census (scATAC-seq): This looks at the construction blueprints and open doors. It tells us which parts of the DNA are unlocked and ready to be read, even if they aren't being spoken about right now.

The Problem:
Until now, these two censuses were like two different languages spoken by two different groups of people. Scientists had to use complicated translators to try to understand how the "blueprints" (ATAC) relate to the "conversations" (RNA). Existing tools were like rigid translators: they could only handle specific pairs of sentences, they got confused by large crowds (millions of cells), and they often missed the subtle connections between the two languages.

The Solution: CLM-X
The authors of this paper built CLM-X, a new "Super-Translator" or a Multimodal Foundation Model. Think of it as a brilliant, all-knowing librarian who has read every single book in the city's library (millions of cells) and learned the deep, hidden rules of how the city works.

Here is how CLM-X works, using simple analogies:

1. Speaking a Common Language (Tokenization)

The biggest hurdle was that RNA and ATAC data look nothing alike. RNA is like a list of words; ATAC is like a massive grid of switches.

CLM-X's Trick: It forces both types of data into the same format. It chops the DNA blueprints into small "patches" (like cutting a map into puzzle pieces) and turns the gene conversations into "tokens" (like words in a sentence).
The Result: Now, the model can read a sentence that mixes "words" (RNA) and "puzzle pieces" (ATAC) all at once, understanding them as one continuous story.

2. The Three-Branch Brain (Multiway Transformer)

Imagine a brain with three specialized departments that share a central library:

Department A (RNA Specialist): Reads only the "conversation" data.
Department B (ATAC Specialist): Reads only the "blueprint" data.
Department C (The Mixer): Reads both together to see how the blueprints cause the conversations.
The Magic: These departments share a "Shared Attention" mechanism. They can look at each other's notes instantly. If the Blueprint Department sees a door open, the Conversation Department immediately knows to expect a loud shout. This allows the model to learn from unpaired data (just blueprints or just conversations) and still understand the connection, which is a huge advantage because paired data is rare.

3. The Training Method (Stage-wise Masked Reconstruction)

How do you teach this librarian? You don't just give them the answers; you play a game of "Fill in the Blanks."

Phase 1: You show the librarian a page of text with 15% of the words covered up (masked) and ask them to guess the missing words based on the context.
Phase 2: You do the same with the blueprints.
Phase 3 (The Master Class): You show them a page with both text and blueprints, but you cover up the blueprints and ask, "Based on the text, what should the blueprints look like?" Then you flip it: cover the text and ask, "Based on the blueprints, what should the text say?"
Why this matters: This forces the model to learn the deep, biological cause-and-effect relationship between the two, rather than just memorizing patterns.

What Can CLM-X Do? (The Superpowers)

Once trained, CLM-X acts like a Swiss Army knife for biologists:

The Great Unifier (Batch Correction): Imagine taking photos of the same city taken at noon, midnight, and during a storm. They look totally different. CLM-X can strip away the "lighting" (technical noise) and show you the city exactly as it is, regardless of when or how the photo was taken.
The Time Traveler (Cross-Modal Translation): If a scientist only has the "blueprints" (ATAC) for a new cell type but no "conversation" data (RNA), CLM-X can predict exactly what the cell is saying. It can also go the other way: predict the blueprints from the conversation. It's like predicting the weather just by looking at the birds' behavior.
The Detective (Cell Type ID): It can look at a messy, mixed-up crowd of cells and perfectly sort them into their specific neighborhoods (cell types), even if the data is noisy or incomplete.
The Predictor (Perturbation): If you knock out a specific gene (like removing a brick from a building), CLM-X can predict exactly how the whole city will react and change its behavior before the experiment is even run.

Why This Matters

Before CLM-X, scientists had to use different tools for different jobs, and those tools often struggled with the sheer size of modern data. CLM-X is a foundation model—meaning it's a single, massive brain that has learned the "grammar" of life. It can be adapted to almost any single-cell question, making biological discovery faster, more accurate, and capable of handling the massive datasets being generated today.

In short, CLM-X is the first model that truly speaks the combined language of our cells' blueprints and their conversations, allowing us to understand the city of life with unprecedented clarity.

1. Problem Statement

The field of single-cell biology is transitioning from unimodal to multimodal profiling (e.g., jointly measuring transcriptomics and epigenomics). However, current computational methods face three critical bottlenecks:

Modality Heterogeneity: scRNA-seq (gene expression counts) and scATAC-seq (chromatin accessibility peaks) have vastly different data structures, dimensions, and noise characteristics, making unified modeling difficult.
Data Scarcity & Bias: High-quality paired multi-omics data is scarce compared to the abundance of unpaired unimodal data. Existing methods often rely heavily on paired data or require specific preprocessing that introduces bias.
Lack of Flexibility: Most existing models are task-specific (e.g., designed only for batch correction or only for integration) or rely on contrastive learning objectives that struggle to capture complementary information across modalities or handle unpaired data effectively.
Scalability: Current methods struggle to scale to million-cell datasets required for large atlas initiatives (e.g., Human Cell Atlas).

2. Methodology: CLM-X Architecture

CLM-X is a foundation model built on the BEiT-3 architecture, adapted for single-cell biology using a Multi-way Transformer.

A. Unified Tokenization Strategy

To handle heterogeneous inputs within a single framework, CLM-X employs a harmonized tokenization scheme:

scRNA-seq: Genes are treated as tokens. Expression values are discretized into rank-preserving bins (50 bins) to normalize scale differences. Sequences are padded/truncated to a fixed length of 2,000 tokens.
scATAC-seq: Due to the ultra-high dimensionality (~1M peaks), peaks are grouped into contiguous genomic "patches" (approx. 575 peaks/patch). These patches are binarized (accessible/not accessible). Sequences are also fixed to 2,000 tokens.
Paired Input: For cells with both modalities, RNA and ATAC token sequences are concatenated into a single context window of up to 4,000 tokens, allowing direct cross-modal interaction via self-attention.

B. Multi-way Transformer Backbone

The model utilizes a Shared Multi-Head Self-Attention (MHSA) mechanism coupled with Modality-Specific Feed-Forward Networks (FFN):

Shared Attention: All modalities (RNA, ATAC, or Paired) share the same attention layers, enabling deep fusion and information exchange.
Specialized Experts: The FFN layers are split into three experts: R-FFN (for RNA-only), A-FFN (for ATAC-only), and RA-FFN (for paired inputs). This allows the model to learn modality-specific representations while maintaining a unified latent space.

C. Stage-wise Pretraining Strategy

To maximize the utility of abundant unimodal data while learning cross-modal relationships from scarce paired data, CLM-X uses a three-stage pretraining pipeline with parameter inheritance:

Stage 1 (RNA-only): Masked reconstruction of RNA expression values to initialize shared attention parameters.
Stage 2 (ATAC-only): Masked reconstruction of ATAC peak values, inheriting weights from Stage 1.
Stage 3 (Paired RNA-ATAC): A two-phase conditional reconstruction on paired data:
- Phase 1: Mask ATAC tokens and reconstruct them conditioned on visible RNA.
- Phase 2: Mask RNA tokens and reconstruct them conditioned on visible ATAC.
- Goal: This bidirectional conditional learning forces the model to capture alignment and complementary information without relying on contrastive loss.

D. Downstream Adaptation

CLM-X is fine-tuned for five specific tasks using lightweight task-specific heads (decoders or classifiers) while keeping the core encoder frozen or lightly fine-tuned:

Batch Correction
Multimodal Integration
Cross-modal Translation (RNA $\leftrightarrow$ ATAC)
Cell Type Annotation
Perturbation Response Prediction

3. Key Contributions

Unified Framework: CLM-X is the first foundation model to unify scRNA-seq and scATAC-seq analysis within a single Transformer architecture that natively supports RNA-only, ATAC-only, and paired inputs.
Flexible Tokenization: The harmonized token-value embedding scheme allows seamless integration of disparate data types without heavy manual filtering or dimensionality reduction that loses biological signal.
Efficient Pretraining: The stage-wise masked reconstruction strategy effectively leverages millions of unpaired cells (36M RNA, 2.8M ATAC) alongside a smaller set of paired cells (370k), overcoming the data scarcity bottleneck.
Superior Cross-Modal Translation: Unlike contrastive models, CLM-X learns generative mappings, enabling high-fidelity prediction of one modality from the other (e.g., predicting gene expression from chromatin accessibility).

4. Results

CLM-X was evaluated on 10 datasets (including PBMC and BMMC cohorts) across five downstream tasks, outperforming state-of-the-art methods (MultiVI, Multigrate, MIRA, scGPT, BABEL, etc.):

Batch Correction: Achieved the best balance between preserving biological structure (NMI) and removing batch effects (bASW), outperforming baselines by 5.9%–35.0% in overall scores.
Multimodal Integration: Produced fused embeddings with superior clustering agreement (ARI/NMI) and local neighborhood consistency (cLISI) compared to unimodal and other multimodal baselines.
Cross-Modal Translation:
- ATAC $\to$ RNA: Achieved the highest Pearson Correlation (PCC) and lowest RMSE across all datasets, accurately reconstructing quantitative gene expression.
- RNA $\to$ ATAC: Outperformed baselines in predicting chromatin accessibility, capturing both binary states and quantitative magnitudes.
Cell Type Annotation: Achieved the highest accuracy and Macro F1 scores (e.g., 90.38% Accuracy, 85.22% F1 on PBMC), particularly excelling at distinguishing rare or closely related cell types (e.g., IL1B+ Monocytes) where other models failed.
Perturbation Prediction: Demonstrated superior generalization to unseen genetic perturbations, achieving higher correlation with ground-truth differential expression profiles than GEARS and scGPT.

5. Significance

Paradigm Shift: CLM-X moves single-cell analysis from task-specific, modality-locked tools to a generalizable foundation model paradigm.
Biological Insight: By learning bidirectional transcriptional-epigenetic couplings, CLM-X provides a robust tool for inferring regulatory mechanisms and predicting cellular responses to perturbations.
Scalability: The architecture is designed to scale to the "million-cell" era of single-cell atlases, offering a unified solution for integrating heterogeneous data from diverse sources and technologies.
Future Potential: The framework sets a precedent for extending foundation models to additional modalities (proteomics, methylation) and dynamic cellular processes, potentially accelerating drug discovery and mechanistic biology.

In summary, CLM-X represents a significant advancement in computational biology, offering a flexible, scalable, and high-performance foundation model that unifies the analysis of the two most critical single-cell modalities: transcriptomics and epigenomics.