Sparse Autoencoders Reveal Interpretable Features in… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a super-smart robot chef (the "Single-Cell Foundation Model") that has read millions of cookbooks about how cells work. This robot can tell you what kind of cell it's looking at, predict how a cell will react to a medicine, or even mix data from different kitchens (labs) together.

The problem? We don't know how the robot thinks. It's a "black box." You give it an ingredient (a cell), and it gives you a result, but if you ask, "Why did you decide that?" it just says, "I just know."

This paper is like hiring a detective to peek inside the robot's brain and map out its thoughts. Here's how they did it, using simple analogies:

1. The Detective Tool: The "Sparse Autoencoder" (SAE)

The researchers used a special tool called a Sparse Autoencoder. Think of the robot's brain as a giant, messy library where books are thrown everywhere.

The Problem: The books are mixed up. One shelf might have a book about "apples," but it's also mixed with books about "red things," "round things," and "things from New York." It's hard to tell what the robot actually cares about.
The Solution: The SAE is like a super-organized librarian. It takes that messy pile of books and sorts them into tiny, specific drawers.
- Drawer A: Only contains books about "Apples."
- Drawer B: Only contains books about "Redness."
- Drawer C: Only contains books about "New York."
By sorting the robot's thoughts this way, the researchers could see exactly which "drawer" (feature) the robot was using to make a decision.

2. What Did They Find Inside?

When they opened the drawers, they found two main types of thoughts:

The "Ingredient" Thoughts (Gene-Specific): These are like the robot thinking, "This is a tomato," or "This is a lot of sugar." The robot learned to recognize specific genes and how much of them are present, regardless of what kind of cell it is.
The "Recipe" Thoughts (Cell-Specific): These are the big-picture ideas. The robot learned to recognize "This is a T-Cell" or "This is a Cancer Cell." Interestingly, it didn't just use one "T-Cell" drawer. It used a combination of many drawers: some for "T-Cell markers," some for "immune system activity," and even some for "things that are not T-Cells" (like a negative space).

The Surprise: Even though the robot was trained on healthy cells, it developed a "disease detector" drawer. When they tested it on sick patients (COVID-19), this drawer lit up, showing the robot had learned to spot inflammation patterns it never explicitly saw during training!

3. The "Dirty Dishes" Problem (Technical Noise)

Here's the tricky part. The robot also learned to recognize how the data was collected, not just the biology.

Imagine two chefs: Chef A uses a red knife, and Chef B uses a blue knife.
The robot learned that "Red Knife" = "Chef A's Kitchen" and "Blue Knife" = "Chef B's Kitchen."
If you asked the robot to compare a cell from Chef A to a cell from Chef B, it might get confused and think they are different types of cells just because of the knife color (the "batch effect").

The researchers found specific drawers in the robot's brain that were dedicated entirely to "Red Knife" or "Blue Knife" thoughts.

4. The Magic Trick: "Steering" the Robot

Once they found the "Red Knife" drawers, they did something cool called Steering.

Imagine the robot is about to make a decision. The researchers reached into the robot's brain, found the "Red Knife" drawer, and taped it shut.
The Result: The robot stopped caring about the knife color. It could now compare cells from Chef A and Chef B fairly, focusing only on the food (the biology).
This proved that the robot wasn't just guessing; it was actively using those specific "knife" thoughts to make its mistakes. By turning them off, they fixed the robot's behavior without retraining it from scratch.

Why Does This Matter?

Trust: We can finally see why these AI models make decisions. They aren't magic; they are learning specific patterns.
Control: We can fix their mistakes (like ignoring the "knife color") by manually turning off the wrong "drawers."
Better Models: This helps scientists build better robots for the future that understand biology deeply, not just statistics.

In a nutshell: The researchers took a mysterious AI brain, organized its chaotic thoughts into neat little categories, found out it was getting distracted by "kitchen tools" (technical noise), and showed us how to tape those distractions shut so the AI can focus on the real science.

1. Problem Statement

Single-cell foundation models (scFMs) like scGPT, scFoundation, and Geneformer have emerged as powerful tools for analyzing cellular states, integrating data, and predicting perturbation effects. However, they function largely as "black boxes," making their internal mechanisms and decision-making processes opaque.

Lack of Interpretability: It is unclear how these models encode biological concepts (e.g., cell types, gene functions) versus technical artifacts (e.g., batch effects, sequencing protocols).
Performance Limitations: Recent benchmarks suggest scFMs sometimes underperform compared to simpler linear models in zero-shot settings, and their reliance on architectures inherited from Natural Language Processing (NLP) without sufficient biological adaptation raises questions about their robustness.
Need for Control: There is a critical need to understand what drives model predictions to enable better design, training strategies, and the ability to intervene on model behavior (e.g., removing batch effects without losing biological signal).

2. Methodology

The authors applied Sparse Autoencoders (SAEs) to the hidden representations of three major scFMs to decompose their latent spaces into interpretable, sparsely activated features.

Models Analyzed:
- scGPT: Analyzed in both pre-trained and fine-tuned states.
- scFoundation: Analyzed in its pre-trained state (fine-tuning code was unavailable).
- Geneformer: Analyzed in its fine-tuned state (pre-trained versions yielded poor feature definitions).
SAE Architecture & Training:
- Input: Intermediate token representations (residual stream activations) from Transformer layers, rather than final cell embeddings.
- Architecture: BatchTopK SAEs were used, which retain only the $k$ largest latent activations per batch, outperforming L1-regularized or gated SAEs in preliminary tests.
- Training: Trained on datasets including CellXGene Census (37M cells), a COVID-19 cohort, and tissue-specific benchmarks (Pancreas, Lung, Immune).
- Layer Selection: SAEs were trained on all 12 Transformer layers. Layer 6 was selected for steering experiments (optimal for batch features), and Layer 10 for interpretability analysis (optimal for structured biological representations).
Feature Analysis Techniques:
- Cell-Level Associations: Aggregating gene-level activations via max-pooling to compute Adjusted Mutual Information (AMI) and F1 scores against ground-truth labels (cell types, disease states, batches).
- Functional Enrichment: Using Gene Ontology (GO) and PanglaoDB marker sets to identify biological processes and cell type markers associated with specific features.
- Feature Steering (Intervention): Identifying features correlated with technical artifacts (batches) and "clamping" their activations to a negative value (-2) during inference to suppress them, then measuring the impact on downstream embeddings.
Evaluation Metrics:
- Embedding Recovery Score: A novel metric developed to measure how well token-level reconstruction preserves downstream cell embeddings (replacing standard "loss recovered" metrics which failed due to high scFM training loss).
- Batch Correction Metrics: Graph connectivity, kBET, iLISI, and biological conservation scores (following Luecken et al., 2022).

3. Key Contributions

Decomposition of scFM Representations: Demonstrated that scFMs organize information along two distinct axes: Gene-specific features (encoding expression levels, gene identity, and families) and Cell-specific features (encoding cell identity via distributed contextual information).
Discovery of Complex Encoding Strategies: Revealed that models use non-canonical strategies, such as negative encoding (activating on the absence of non-target cell markers) and proxy encoding (using ribosomal genes as discriminators for cell types), to build cell representations.
Technical Artifact Characterization: Showed that pre-trained models encode significant technical variation (sequencing protocols, study-specific biases) alongside biological signals, often allocating substantial capacity to these artifacts.
Feature Steering for Batch Correction: Proved that SAE-derived features are causally linked to model behavior. Suppressing specific "batch features" improved batch integration metrics while preserving biological conservation, outperforming native zero-shot models and, in some cases, matching specialized batch correction methods like scVI.
Open-Source Framework: Released a codebase for training SAEs on scFMs, extensible to new architectures and models.

4. Key Results

Interpretable Concepts: SAEs successfully identified features corresponding to:
- Gene Properties: Expression levels, gene families (ribosomal, mitochondrial, HLA), and biological processes (cell cycle, apoptosis).
- Cell Identity: Distinct features for major cell types (T cells, B cells, Monocytes). Notably, scGPT produced 3–8 diverse features per cell type even before fine-tuning, whereas scFoundation and Geneformer were more sparse or required fine-tuning to develop clear representations.
- Disease States: Features activated specifically in post-COVID-19 disorder patients, capturing inflammatory states not present in the healthy training data.
- Technical Biases: Features strongly correlated with sequencing technologies (e.g., SMARTer vs. Smart-Seq2) and specific study batches.
Model Differences:
- scGPT: Used binning strategies, leading to features with strong expression correlations across multiple levels.
- Geneformer: Encoded expression via token position, leading to position-specific features.
- scFoundation: Used Bayesian downsampling, resulting in features less correlated with high expression values.
Steering Efficacy:
- On the Pancreas dataset, steering fine-tuned scGPT (suppressing top 50 batch-correlated features) achieved a Total Integration Score of 0.62, outperforming the native fine-tuned model (0.45) and the DAR-corrected model (0.58).
- Steering successfully reduced batch effects (improved batch correction scores) while maintaining high biological conservation, demonstrating that technical artifacts are encoded in specific, manipulable features.
Generalization Limits: Cell type features often showed study-specific activation patterns rather than unified concepts across all datasets, suggesting that pre-trained models may learn separate representations for the same cell type across different studies due to overwhelming technical signals.

5. Significance and Implications

Mechanistic Understanding: This work provides the first deep mechanistic view of how scFMs process single-cell data, moving beyond black-box performance metrics to understanding what the models have learned.
Controllability: The ability to "steer" models by suppressing specific features offers a new paradigm for controlling model outputs. This allows for the removal of unwanted technical biases (batch effects) without retraining the entire model, similar to concept editing in Large Language Models (LLMs).
Training Insights: The findings suggest that architectural choices (e.g., masking ratios, expression binning) and training protocols significantly impact the structure of learned representations. Future models should be designed to minimize the encoding of technical artifacts.
Future Directions: The study highlights that while scFMs learn rich biological representations, they struggle to consolidate these across diverse studies without fine-tuning. Interpretability tools like SAEs are essential for guiding the next generation of robust, generalizable foundation models in computational biology.

Sparse Autoencoders Reveal Interpretable Features in Single-Cell Foundation Models