HEIMDALL: Disentangling tokenizer design for robust transfer in single-cell foundation models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a brilliant but very literal robot how to understand human biology. You have millions of photos of cells (the "data"), and you want the robot to learn what makes a liver cell different from a brain cell, or how a cell reacts when you give it a medicine.

The problem is, cells don't speak "Robot." They speak "Genes." A cell is just a messy list of thousands of genes, some turned on loud, some turned off, some barely whispering.

This paper introduces a new tool called HEIMDALL (named after the all-seeing guardian of the Bifrost in Norse mythology). HEIMDALL isn't a new robot; it's a universal translator and a kitchen for building better robots.

Here is the story of what they found, explained simply:

1. The Problem: The "Translation" Mess

In the past, scientists built these biology robots (called Foundation Models) by trying different ways to translate cell data into a language the robot understands. They called this translation process "Tokenization."

Think of it like this:

The Cell: A bag of ingredients (flour, sugar, eggs, salt).
The Tokenizer: The recipe card that tells the robot how to list those ingredients.

Some scientists wrote the recipe card as: "1 cup flour, 1 cup sugar, 2 eggs." (Ordered by importance).
Others wrote: "Eggs, Salt, Flour, Sugar." (Ordered by where they sit in the kitchen).
Others just wrote: "Flour, Sugar, Eggs" (Random order).

For a long time, scientists thought the robot's brain (the architecture) was the most important part. They assumed if they made the brain bigger and smarter, it would figure out the recipe on its own.

HEIMDALL's Big Discovery: The robot's brain matters, but the recipe card (the tokenizer) matters way more, especially when the robot faces a new situation.

2. The Kitchen: How HEIMDALL Works

HEIMDALL is a modular kitchen. Instead of building a whole new robot from scratch to test a new idea, HEIMDALL lets scientists swap out just the Recipe Card while keeping the robot's brain exactly the same.

They broke the "Recipe Card" down into three simple parts:

The Name Tag (Gene Identity): How does the robot know what "Flour" is? Does it know it's a grain? Does it know it's related to "Wheat"? (This is how they encode gene names).
The Quantity (Expression Encoding): How does the robot know how much "Flour" there is? Is it a cup? A pinch? A mountain? (This is how they encode gene activity).
The Order (Cell Construction): Does the robot read the list from left to right? Does it sort the ingredients by how much of them there are? Or does it just throw them in a pile?

3. The Experiments: When the Robot Gets Lost

The researchers tested these different recipe cards in four tough scenarios:

Scenario A: The Same Neighborhood (Training and Testing in the same place)
- The Result: It didn't matter much which recipe card you used. The robot did fine no matter what.
- Analogy: If you are a chef who only cooks in your home kitchen, you can use any recipe book and still make a good meal.
Scenario B: Moving to a New City (Cross-Tissue & Cross-Species)
- The Result: Suddenly, the recipe card became everything. If you used the wrong card, the robot failed miserably when trying to identify a brain cell after only learning about gut cells.
- Analogy: If you move to a new city where the ingredients are named differently or arranged differently, your old recipe book might say "Add 1 cup of 'Zorg'" when you actually need "1 cup of 'Flour'." The robot gets confused.
- The Winner: The best recipe cards were the ones that sorted ingredients by how much of them there were (Expression Sorting), rather than just listing them randomly.
Scenario C: The Missing Ingredients (Gene Panel Shift)
- The Result: Sometimes, the robot is tested on a cell type where only a few ingredients are available (like a spatial transcriptomics test).
- The Winner: The robot that used a recipe card that understood the relationships between ingredients (knowing that "Flour" and "Wheat" are cousins) did the best, even if it had never seen the specific test ingredients before.
Scenario D: The Reverse Puzzle (Reverse Perturbation)
- The Result: Instead of asking "What happens if I add salt?", they asked "I have a salty dish; what ingredient did I add?"
- The Winner: The robot needed a recipe card that clearly told it how much of each ingredient was present. If the recipe card ignored the quantities, the robot couldn't solve the puzzle.

4. The Big Takeaway

The paper concludes that there is no single "perfect" recipe card for every situation. However, the design of the recipe card is the most critical factor in whether a biology AI will be robust or fragile.

Old Way: "Let's build a bigger brain and hope it figures out the data."
New Way (HEIMDALL): "Let's build a smart brain, but first, let's make sure the recipe card (tokenizer) is designed to handle the specific challenges of the real world (new tissues, new species, missing data)."

Why This Matters

Imagine you are building a self-driving car. You could build a super-intelligent AI, but if you feed it the wrong map or the wrong traffic signs, it will crash.

This paper tells us that for AI in biology, the map (the tokenizer) is just as important as the driver (the AI model). By using HEIMDALL, scientists can now mix and match the best parts of different recipe cards to build robots that are truly ready for the real world, not just for the lab bench.

In short: Don't just make the brain bigger. Fix the way you feed it the data. That's the secret to making biology AI that actually works.

1. Problem Statement

Single-cell foundation models (scFMs) are emerging as powerful tools for analyzing single-cell RNA-sequencing (scRNA-seq) data, yet their performance is inconsistent, particularly when transferring to new biological contexts (e.g., different tissues, species, or gene panels).

The Core Issue: Unlike text or images, single-cell data lacks a canonical tokenization scheme. Cells are fundamentally unordered sets of genes with continuous expression values. Current scFMs adopt diverse, heuristic tokenization strategies that entangle biological assumptions with model architecture.
The Gap: Existing benchmarks compare fully pretrained models, making it impossible to attribute performance differences to specific causes (architecture vs. training data vs. tokenization). Consequently, there is no principled guidance on how to design tokenizers for robust generalization under distribution shifts.

2. Methodology: The HEIMDALL Framework

The authors introduce HEIMDALL, a modular framework designed to systematically dissect, evaluate, and redesign tokenizers in scFMs.

A. Modular Decomposition

HEIMDALL decomposes the tokenization process into three functional modules, abstracting the input pipeline into a sequence-based representation:

$F_G$ (Gene Identity Encoding): Encodes the identity of a gene.
- Options: Random initialization, ESM2 (protein sequence), Gene2vec (co-expression), GenePT (text description), HyenaDNA (DNA sequence).
$F_E$ (Expression Encoding): Encodes the continuous expression value of a gene.
- Options: No-op (zero vector), Continuous (MLP), Quantile binning, Integer binning, Autobinning.
$F_C$ (Cell Construction/Aggregation): Integrates $F_G$ $F_{G}$ and $F_E$ $F_{E}$ to assemble the final cell sequence. This is further subdivided into:
- ORDER: Defines the intrinsic ordering of gene tokens (e.g., expression sorting, chromosome sorting, random).
- SEQUENCE: Selects which genes to include and constructs the token sequence (e.g., truncation, weighted sampling).
- REDUCE: Combines identity and expression encodings (e.g., Sum, Identity).

B. Experimental Design

Reimplementation: The authors reimplemented tokenizers from five leading scFMs (scGPT, Geneformer, scFoundation, scBERT, UCE) within the HEIMDALL framework, denoted with a -tok suffix.
Controlled Training: To isolate the effect of tokenization, all models were trained from scratch using a minimal transformer backbone with fixed hyperparameters (no pretraining initially). This eliminates confounding variables like model scale, pretraining objectives, and dataset size.
Ablation Studies: The framework allows for "mix-and-match" ablation, swapping specific modules (e.g., replacing a random $F_G$ with ESM2) to measure their individual contribution to performance.

C. Evaluation Benchmarks

The framework was tested on four challenging downstream tasks representing distribution shifts:

Cross-Tissue Generalization: Train on colon/small intestine, test on brain.
Cross-Species Generalization: Train on human, test on mouse (without fine-tuning).
Gene-Panel Shift (Spatial Transcriptomics): Train on one gene panel, test on a disjoint or partially overlapping panel.
Reverse Perturbation Prediction: Infer the perturbation (gene knockout) given a target cell state (paired-cell task).

3. Key Results

A. Tokenization vs. Pretraining

In-Distribution: When training and test data match, tokenizer choice has minimal impact; performance is comparable across models and often matches a simple linear baseline.
Out-of-Distribution (Distribution Shift): Tokenizer design becomes the decisive factor for generalization. Pretraining (via Masked Language Modeling) provided only marginal benefits compared to the impact of the tokenizer design.

B. Task-Specific Findings

Cross-Tissue Generalization:
- Key Driver: The ORDER module.
- Finding: Geneformer-tok performed best primarily because it uses expression-based sorting (ordering genes by expression level). This implicitly injects expression information into the token sequence, compensating for the lack of an explicit expression encoder ( $F_E$ ).
Cross-Species Generalization:
- Key Driver: The $F_G$ (Gene Identity) module.
- Finding: UCE-tok (using ESM2 protein sequence embeddings) was the only model to perform well without orthology mapping because its gene identities are species-agnostic. When orthology mapping was applied to all models, those with stronger $F_E$ and $F_C$ components (like scBERT-tok) surpassed UCE-tok.
- Conclusion: A sequence-based $F_G$ is robust for species with poor annotation, but orthology mapping + strong encoders is superior when mappings exist.
Gene-Panel Shift (Spatial Transcriptomics):
- Key Driver: $F_G$ and $F_E$ .
- Finding: scBERT-tok (using Gene2vec embeddings) significantly outperformed others. Gene2vec embeddings, derived from co-expression patterns, stabilized representations for genes appearing only in the test set. Continuous expression encoding ( $F_E$ ) also provided consistent gains.
Reverse Perturbation Prediction:
- Key Driver: $F_E$ and ORDER.
- Finding: Models lacking explicit expression encodings (like UCE-tok) performed poorly. Adding any form of expression encoding ( $F_E$ ) drastically improved performance. The best hybrid configuration combined scBERT-tok's expression encoding with Geneformer-tok's expression sorting.

C. Hybrid Tokenizers

The study demonstrated that hybrid tokenizers (combining the best modules from different models) often outperform any single existing strategy. For example, combining ESM2 ( $F_G$ ), continuous encoding ( $F_E$ ), and expression sorting (ORDER) yielded state-of-the-art results across multiple tasks.

4. Key Contributions

HEIMDALL Framework: A unified, open-source Python package that decouples tokenizer design from model architecture, enabling systematic dissection of scFMs.
Disentangling Design Axes: The paper identifies that Gene Identity ( $F_G$ ), Expression Encoding ( $F_E$ ), and Ordering (ORDER) are the three critical design axes determining transferability, rather than model scale or architecture alone.
Performance Insights:
- Revealed that scBERT-tok (often ranked lower in prior benchmarks) actually possesses superior tokenization components for transfer learning.
- Showed that pretraining offers diminishing returns compared to principled tokenizer design.
Design Principles: Established that robust transfer requires exposing specific biological priors (e.g., co-expression via Gene2vec, protein sequence via ESM2, or expression magnitude via sorting) to the model.

5. Significance

Paradigm Shift: The paper argues that the "universal" transferability of scFMs is constrained by the "non-universal" tokenizer interface. To build truly generalizable models, the community must move away from heuristic tokenization toward principled, modular design.
Practical Guidance: Provides concrete recommendations for users and developers:
- Use sequence-based embeddings (ESM2) for cross-species tasks with unknown orthologs.
- Use co-expression embeddings (Gene2vec) for gene-panel shifts.
- Use expression sorting to implicitly encode magnitude without complex encoders.
Future-Proofing: The modular framework facilitates the integration of new modalities (e.g., epigenomics, proteomics) into the "virtual cell" concept, ensuring that future foundation models can incorporate diverse biological signals coherently.

In summary, HEIMDALL establishes that tokenizer design is the critical bottleneck for single-cell foundation models. By treating tokenization as a modular, testable component rather than a fixed preprocessing step, the authors provide a roadmap for building more robust, generalizable, and biologically grounded AI models for single-cell biology.