HEIMDALL: Disentangling tokenizer design for robust transfer in single-cell foundation models

The paper introduces HEIMDALL, a unified framework that disentangles single-cell foundation model tokenization strategies to demonstrate that robust generalization across diverse biological distribution shifts depends on specific design choices regarding gene identity, expression encoding, and ordering rather than a single universal tokenizer.

Haber, E., Alam, S., Ho, N., Liu, R., Trop, E., Liang, S., Yang, M., Krieger, S., Ma, J.

Published 2026-04-12
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a brilliant but very literal robot how to understand human biology. You have millions of photos of cells (the "data"), and you want the robot to learn what makes a liver cell different from a brain cell, or how a cell reacts when you give it a medicine.

The problem is, cells don't speak "Robot." They speak "Genes." A cell is just a messy list of thousands of genes, some turned on loud, some turned off, some barely whispering.

This paper introduces a new tool called HEIMDALL (named after the all-seeing guardian of the Bifrost in Norse mythology). HEIMDALL isn't a new robot; it's a universal translator and a kitchen for building better robots.

Here is the story of what they found, explained simply:

1. The Problem: The "Translation" Mess

In the past, scientists built these biology robots (called Foundation Models) by trying different ways to translate cell data into a language the robot understands. They called this translation process "Tokenization."

Think of it like this:

  • The Cell: A bag of ingredients (flour, sugar, eggs, salt).
  • The Tokenizer: The recipe card that tells the robot how to list those ingredients.

Some scientists wrote the recipe card as: "1 cup flour, 1 cup sugar, 2 eggs." (Ordered by importance).
Others wrote: "Eggs, Salt, Flour, Sugar." (Ordered by where they sit in the kitchen).
Others just wrote: "Flour, Sugar, Eggs" (Random order).

For a long time, scientists thought the robot's brain (the architecture) was the most important part. They assumed if they made the brain bigger and smarter, it would figure out the recipe on its own.

HEIMDALL's Big Discovery: The robot's brain matters, but the recipe card (the tokenizer) matters way more, especially when the robot faces a new situation.

2. The Kitchen: How HEIMDALL Works

HEIMDALL is a modular kitchen. Instead of building a whole new robot from scratch to test a new idea, HEIMDALL lets scientists swap out just the Recipe Card while keeping the robot's brain exactly the same.

They broke the "Recipe Card" down into three simple parts:

  1. The Name Tag (Gene Identity): How does the robot know what "Flour" is? Does it know it's a grain? Does it know it's related to "Wheat"? (This is how they encode gene names).
  2. The Quantity (Expression Encoding): How does the robot know how much "Flour" there is? Is it a cup? A pinch? A mountain? (This is how they encode gene activity).
  3. The Order (Cell Construction): Does the robot read the list from left to right? Does it sort the ingredients by how much of them there are? Or does it just throw them in a pile?

3. The Experiments: When the Robot Gets Lost

The researchers tested these different recipe cards in four tough scenarios:

  • Scenario A: The Same Neighborhood (Training and Testing in the same place)

    • The Result: It didn't matter much which recipe card you used. The robot did fine no matter what.
    • Analogy: If you are a chef who only cooks in your home kitchen, you can use any recipe book and still make a good meal.
  • Scenario B: Moving to a New City (Cross-Tissue & Cross-Species)

    • The Result: Suddenly, the recipe card became everything. If you used the wrong card, the robot failed miserably when trying to identify a brain cell after only learning about gut cells.
    • Analogy: If you move to a new city where the ingredients are named differently or arranged differently, your old recipe book might say "Add 1 cup of 'Zorg'" when you actually need "1 cup of 'Flour'." The robot gets confused.
    • The Winner: The best recipe cards were the ones that sorted ingredients by how much of them there were (Expression Sorting), rather than just listing them randomly.
  • Scenario C: The Missing Ingredients (Gene Panel Shift)

    • The Result: Sometimes, the robot is tested on a cell type where only a few ingredients are available (like a spatial transcriptomics test).
    • The Winner: The robot that used a recipe card that understood the relationships between ingredients (knowing that "Flour" and "Wheat" are cousins) did the best, even if it had never seen the specific test ingredients before.
  • Scenario D: The Reverse Puzzle (Reverse Perturbation)

    • The Result: Instead of asking "What happens if I add salt?", they asked "I have a salty dish; what ingredient did I add?"
    • The Winner: The robot needed a recipe card that clearly told it how much of each ingredient was present. If the recipe card ignored the quantities, the robot couldn't solve the puzzle.

4. The Big Takeaway

The paper concludes that there is no single "perfect" recipe card for every situation. However, the design of the recipe card is the most critical factor in whether a biology AI will be robust or fragile.

  • Old Way: "Let's build a bigger brain and hope it figures out the data."
  • New Way (HEIMDALL): "Let's build a smart brain, but first, let's make sure the recipe card (tokenizer) is designed to handle the specific challenges of the real world (new tissues, new species, missing data)."

Why This Matters

Imagine you are building a self-driving car. You could build a super-intelligent AI, but if you feed it the wrong map or the wrong traffic signs, it will crash.

This paper tells us that for AI in biology, the map (the tokenizer) is just as important as the driver (the AI model). By using HEIMDALL, scientists can now mix and match the best parts of different recipe cards to build robots that are truly ready for the real world, not just for the lab bench.

In short: Don't just make the brain bigger. Fix the way you feed it the data. That's the secret to making biology AI that actually works.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →