What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

Through an autonomous AI-driven screening of 141 hypotheses, this study demonstrates that biological foundation models like scGPT and Geneformer learn genuine, shared geometric and topological structures in their internal representations that are biologically meaningful yet more localized to specific tissues like immune cells than previously assumed.

Ihor Kendiukhov

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine you have two different cartographers (AI models) who have never met, never shared a map, and were trained in different cities. They are both trying to draw a map of a mysterious, invisible city called "The Cell." In this city, the buildings are genes, and the roads between them represent how genes talk to each other to keep a living organism alive.

The big question this paper asks is: Do these cartographers actually understand the city's layout, or are they just drawing random scribbles that happen to look like a map?

To find out, the author didn't just ask one question. They built a robot "detective" that ran 141 different tests (hypotheses) to see what kind of shape the AI's internal map actually has.

Here is what they found, explained simply:

1. The Maps Agree on the "Shape," But Not the "Street Addresses"

The most surprising discovery is that the two AI models (scGPT and Geneformer), despite being trained separately, drew maps that looked geometrically identical.

  • The Analogy: Imagine two people independently drawing a map of New York City. They both agree that Central Park is in the middle, the Hudson River is on the west, and the bridges connect specific boroughs. They agree on the shape of the city.
  • The Catch: However, if you ask them, "What is the exact street address of the Empire State Building?" they give you different coordinates. They know the relationships (genes are close to each other), but they don't agree on the precise location of individual genes.
  • Takeaway: The AI has learned the "skeleton" of biology, but it's not a perfect, pixel-perfect translation of every single gene.

2. The City Has "Loops" (Topology)

The researchers looked for "loops" in the map—like a circular road where you can drive from A to B to C and back to A. In biology, these loops often represent feedback systems (Gene A turns on Gene B, which turns on Gene C, which turns off Gene A).

  • The Finding: The AI maps definitely have these loops. They aren't just flat lines; they have complex, circular structures that match real biological feedback.
  • The Warning: These loops are fragile. If you shuffle the neighborhood slightly (like moving a few houses around), the loops disappear. This means the AI learned the structure, but it's very sensitive to the specific details of the data it saw.

3. "Straight Lines" Are Wrong; "Curved Roads" Are Right

In a normal map, the shortest distance between two points is a straight line. But in the AI's biological map, the shortest path between two interacting genes is often a curved road that winds through the "neighborhood."

  • The Analogy: Imagine trying to walk from your house to a friend's house. A straight line might take you through a wall or a swamp. The AI realized you have to walk along the sidewalk, following the curves of the neighborhood to get there efficiently.
  • The Result: When the researchers measured distances using these "curved roads" (manifold distances), they were much better at predicting which genes interact than just measuring straight-line distance.

4. The "Good" vs. The "Bad" Neighborhoods

The researchers tested if the AI could tell the difference between genes that work together (activation) and genes that fight each other (repression).

  • The Winner: The AI is surprisingly good at this. It groups genes into "communities" (neighborhoods). If Gene A activates Gene B, they tend to live in the same neighborhood. If Gene A suppresses Gene B, the AI places them in slightly different, distinguishable spots within that neighborhood.
  • The Twist: This worked best in immune system data. When they tried to test this on lung data, the signal got weak and fuzzy.
  • Why? The immune system is like a highly organized military with distinct, modular units (T-cells, B-cells), making it easy to map. The lung is more like a messy, open-plan office where everything blends together. Also, we simply have better "instruction manuals" (data) for the immune system than for the lung.

5. The "Robot Detective" Found a Lot of Nothing (And That's Good!)

This is the most important part of the paper. The robot tested 141 ideas.

  • 40 looked promising at first.
  • 27 survived a basic check.
  • Only about 10 survived the "Strict Max-Null Audit" (the hardest, most skeptical test possible).

The Lesson: Many things looked like biological magic until the researchers applied a stricter filter. For example, some patterns looked great until they realized the AI was just picking up on how often genes are turned on together (co-expression), not on deep regulatory rules.

The Bottom Line

Biological AI models do learn real, meaningful geometry. They understand the "shape" of how genes relate to one another, including loops, curved paths, and communities.

However, this understanding is not universal. It is:

  1. Fragile: It depends heavily on the specific tissue (it works great for immune cells, less so for lungs).
  2. Specific: It captures the relationships between genes, but not necessarily the exact identity of every single gene.
  3. Hard to Prove: You have to be extremely skeptical and use strict controls to separate real biological signals from statistical noise.

In short: The AI has a good sketch of the biological city, but it's not a perfect blueprint. And thanks to this study, we now know exactly where the sketch is accurate and where it's just a guess.