CellBench-LS: Benchmark Evaluation of Single-cell Foundation Models for Low-supervision Scenarios

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a massive mystery: How do billions of tiny cells in our bodies work together to keep us alive?

In the past, detectives (scientists) used simple, reliable tools like magnifying glasses and notepads (traditional methods like PCA or UMAP) to sort these cells into groups. But recently, a new generation of "Super Detectives" has arrived. These are Single-Cell Foundation Models (SCFMs). Think of them as AI detectives that have read the entire library of human biology books before they even started their case. They are incredibly smart and have seen millions of cells.

However, there's a catch. These Super Detectives are used to having a huge team of assistants (labeled data) to help them. But in the real world, scientists often have to work alone with very few clues (low supervision). Do these Super Detectives actually work better when they have to guess without a manual? Or do the old-school methods still win?

This paper, CellBench-LS, is the ultimate Talent Show designed to find out.

The Talent Show Setup

The authors set up a competition with 10 contestants:

The Veterans: 3 classic, reliable methods (like PCA, UMAP, and scVI). Think of them as the "Old Reliables" who have been doing this for years.
The Super Detectives: 7 new AI Foundation Models (like scGPT, Geneformer, CellPLM). These are the flashy, high-tech newcomers.

They put them through 5 different challenges (tasks) to see who performs best when they can't rely on a teacher standing next to them.

The 5 Challenges

The Sorting Hat (Cell Clustering):
- The Task: You dump a bag of mixed-up Lego bricks (cells) on the table. Can you sort them into piles of similar shapes without looking at the instruction manual?
- The Result: The Super Detectives generally won. Because they've "read" so many biology books, they have a better intuition for which cells belong together. The Old Reliables struggled a bit with the messy, complex piles.
The Noise Canceller (Batch Correction):
- The Task: Imagine taking photos of the same scene in a sunny park and a dark basement. The lighting (batch effects) makes the photos look totally different. Can you fix the photos so they look like they were taken in the same place?
- The Result: Again, the Super Detectives shined. They were better at ignoring the "bad lighting" (technical errors) and focusing on the actual subject (the biology).
The Name Tag (Cell Type Annotation):
- The Task: You have a few cells with name tags (e.g., "T-Cell"). Can you look at the other cells and guess their names based on just a few examples?
- The Result: Super Detectives crushed this. With just a tiny hint (few-shot learning), they could identify cell types much better than the Veterans. They understood the "vibe" of a T-Cell instantly.
The Photocopier (Gene Expression Reconstruction):
- The Task: You have a blurry, low-resolution photo of a cell's activity. Can you redraw it in high definition?
- The Result: Surprise! The Old Reliables won here. The AI models were so busy trying to be "smart" and find complex patterns that they sometimes overcomplicated things. The simple, direct math of the Veterans was actually better at just copying the data accurately. It's like how a simple sketch artist might capture a face better than a complex AI that tries to add too much artistic flair.
The Crystal Ball (Perturbation Prediction):
- The Task: If you poke a cell with a specific gene (like a poke in the eye), how will it react? Can you predict the future?
- The Result: The Super Detectives were the clear winners. They could predict how cells would change under stress much better than the old methods.

The Big Takeaway

The paper concludes that there is no single "Best Detective."

If you need to sort, identify, or predict the future: Hire the Super Detectives (Foundation Models). They are powerful, but they need a little bit of training (fine-tuning) to get the job done right.
If you just need to copy data or keep things simple: Stick with the Old Reliables (Traditional Methods). They are faster, cheaper, and sometimes just more accurate for specific, straightforward jobs.

Why This Matters

Before this paper, scientists were confused. They were buying expensive, complex AI tools thinking they were always better, but they didn't know when to use them.

CellBench-LS is like a User Manual for the Future. It tells scientists: "Hey, if you are doing X, use AI. If you are doing Y, use the old math." This helps researchers stop wasting time and money, ensuring that the right tool is used for the right job to help us understand diseases and develop new cures.

In short: The AI revolution is here, but it doesn't mean we throw away our old tools. It means we finally know exactly when to use the robot and when to use the hammer.

1. Problem Statement

Single-cell foundation models (SCFMs), leveraging transformer architectures and large-scale pretraining, have emerged as powerful tools for analyzing high-dimensional single-cell RNA sequencing (scRNA-seq) data. However, their practical utility in low-supervision scenarios (label-scarce settings) remains unproven and lacks systematic evaluation.

Key challenges identified include:

Generalization Gap: It is unclear whether SCFMs can generalize effectively to new datasets or tasks without extensive fine-tuning (zero-shot) or with minimal labels (few-shot).
Lack of Comprehensive Benchmarks: Existing benchmarks often focus on specific tasks (e.g., perturbation prediction) or multimodal integration, failing to systematically compare SCFMs against classical methods (e.g., PCA, UMAP, scVI) across a broad spectrum of downstream tasks under low-resource conditions.
Model Selection Dilemma: Practitioners lack guidance on when to choose a foundation model versus a traditional pipeline given specific data conditions and research goals.

2. Methodology: CellBench-LS Framework

The authors introduce CellBench-LS, a unified benchmarking framework designed to rigorously evaluate SCFMs and classical baselines under consistent conditions.

A. Models Evaluated

The framework compares 7 representative SCFMs against 3 classical baselines:

SCFMs: scGPT, Geneformer, LangCell, CellPLM, scMulan, scFoundation, and Nicheformer.
Classical Baselines: PCA, UMAP, and scVI (a deep generative model).

B. Datasets

Evaluation utilizes 13 diverse scRNA-seq datasets varying in scale (from ~12k to ~330k cells), tissue types (immune, pancreas, brain, liver, lung, etc.), and experimental conditions (general, batch correction, and perturbation).

C. Evaluation Protocols & Tasks

The benchmark assesses performance across five core tasks using two supervision settings:

Zero-Shot Tasks (No fine-tuning, frozen embeddings):
- Cell Clustering: Uses Louvain clustering on embeddings. Metrics: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Average Silhouette Width (ASW).
- Batch Correction: Uses Harmony for integration. Metrics: iLISI (batch mixing), 1-cLISI (biological purity), cASW, and 1-bASW.
Few-Shot Tasks (Lightweight fine-tuning of task-specific heads):
- Cell Type Annotation: MLP classifier with $k=1, 3, 5, 7, 9$ labeled cells per type. Metrics: Accuracy, Macro-F1, Precision, Recall.
- Gene Expression Reconstruction: Regression task predicting top 400 genes. Metrics: Mean Squared Error (MSE), Pearson Correlation.
- Perturbation Prediction: Predicting post-perturbation expression profiles. Metrics: Differential Expression Score (DES), Mean Absolute Error (MAE).

D. Implementation Details

All models use standardized preprocessing (HVG selection, log-normalization).
Few-shot tasks use a unified MLP head architecture (ReLU, BatchNorm, Dropout) trained with Adam optimizer.
Results are averaged over multiple random seeds to ensure reproducibility.

3. Key Contributions

First Comprehensive Low-Supervision Benchmark: CellBench-LS is the first framework to simultaneously evaluate SCFMs and classical methods across zero-shot and few-shot settings for five distinct single-cell tasks.
Stratified Performance Landscape: The study reveals that SCFMs do not universally outperform traditional methods; performance is highly task-dependent and dataset-dependent.
Practical Guidelines: The paper provides actionable recommendations for researchers on model selection based on task type (unsupervised vs. supervised) and data availability.
Identification of Bottlenecks: It highlights that current SCFMs struggle with domain generalization and specific tasks like precise gene expression reconstruction without adaptation.

4. Key Results

A. Zero-Shot Performance (Clustering & Batch Correction)

SCFMs Dominate: Foundation models (particularly CellPLM and Nicheformer) consistently outperform classical methods (PCA, UMAP) and scVI in clustering and batch correction. They achieve better biological coherence and batch mixing.
Classical Methods Lag: PCA and UMAP show limited expressiveness in heterogeneous datasets, while scVI, though better than linear methods, still trails behind SCFMs in robustness.

B. Few-Shot Performance (Annotation & Perturbation)

SCFMs Excel: In annotation and perturbation prediction, SCFMs significantly outperform traditional baselines even with very few labeled samples ( $k=1$ ). Models like CellPLM and Nicheformer show superior sensitivity and generalization.
Reconstruction Exception: In Gene Expression Reconstruction, PCA surprisingly outperforms most SCFMs and even scVI. This suggests that while foundation models capture high-level biological structure, they may lose fine-grained quantitative details required for precise reconstruction without specific task-aligned pretraining objectives.

C. Model-Specific Insights

No "One-Size-Fits-All": No single SCFM achieves state-of-the-art performance across all tasks. For instance, CellPLM excels in clustering and annotation but is less optimal for reconstruction.
Dataset Sensitivity: SCFMs exhibit high variance across datasets. A model performing well on PBMC data (e.g., scFoundation) may perform poorly on Pancreas data, indicating a lack of robust domain generalization.

5. Significance and Future Directions

Guidance for Practitioners: The study advises using classical methods (PCA/scVI) for unsupervised tasks like clustering or reconstruction when labels are unavailable, while recommending SCFMs for supervised tasks like cell type annotation or perturbation prediction where even minimal labels can unlock superior performance.
Call for Task-Aligned Pretraining: The authors argue that current SCFMs are limited by a mismatch between pretraining objectives (e.g., masked gene modeling) and downstream task requirements (e.g., cluster separability). Future models should incorporate task-aligned inductive biases (e.g., contrastive clustering losses) during pretraining.
Improving Domain Generalization: To make SCFMs reliable across diverse biological contexts, future research must focus on domain adaptation strategies and cross-dataset pretraining to mitigate sensitivity to batch effects and tissue-specific variations.

In conclusion, CellBench-LS establishes a critical standard for evaluating single-cell foundation models, demonstrating that while they offer transformative potential for low-supervision tasks, they are not yet a universal replacement for classical pipelines and require further architectural refinement to achieve robust, task-agnostic generalization.