What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses

Imagine you have two different cartographers (AI models) who have never met, never shared a map, and were trained in different cities. They are both trying to draw a map of a mysterious, invisible city called "The Cell." In this city, the buildings are genes, and the roads between them represent how genes talk to each other to keep a living organism alive.

The big question this paper asks is: Do these cartographers actually understand the city's layout, or are they just drawing random scribbles that happen to look like a map?

To find out, the author didn't just ask one question. They built a robot "detective" that ran 141 different tests (hypotheses) to see what kind of shape the AI's internal map actually has.

Here is what they found, explained simply:

1. The Maps Agree on the "Shape," But Not the "Street Addresses"

The most surprising discovery is that the two AI models (scGPT and Geneformer), despite being trained separately, drew maps that looked geometrically identical.

The Analogy: Imagine two people independently drawing a map of New York City. They both agree that Central Park is in the middle, the Hudson River is on the west, and the bridges connect specific boroughs. They agree on the shape of the city.
The Catch: However, if you ask them, "What is the exact street address of the Empire State Building?" they give you different coordinates. They know the relationships (genes are close to each other), but they don't agree on the precise location of individual genes.
Takeaway: The AI has learned the "skeleton" of biology, but it's not a perfect, pixel-perfect translation of every single gene.

2. The City Has "Loops" (Topology)

The researchers looked for "loops" in the map—like a circular road where you can drive from A to B to C and back to A. In biology, these loops often represent feedback systems (Gene A turns on Gene B, which turns on Gene C, which turns off Gene A).

The Finding: The AI maps definitely have these loops. They aren't just flat lines; they have complex, circular structures that match real biological feedback.
The Warning: These loops are fragile. If you shuffle the neighborhood slightly (like moving a few houses around), the loops disappear. This means the AI learned the structure, but it's very sensitive to the specific details of the data it saw.

3. "Straight Lines" Are Wrong; "Curved Roads" Are Right

In a normal map, the shortest distance between two points is a straight line. But in the AI's biological map, the shortest path between two interacting genes is often a curved road that winds through the "neighborhood."

The Analogy: Imagine trying to walk from your house to a friend's house. A straight line might take you through a wall or a swamp. The AI realized you have to walk along the sidewalk, following the curves of the neighborhood to get there efficiently.
The Result: When the researchers measured distances using these "curved roads" (manifold distances), they were much better at predicting which genes interact than just measuring straight-line distance.

4. The "Good" vs. The "Bad" Neighborhoods

The researchers tested if the AI could tell the difference between genes that work together (activation) and genes that fight each other (repression).

The Winner: The AI is surprisingly good at this. It groups genes into "communities" (neighborhoods). If Gene A activates Gene B, they tend to live in the same neighborhood. If Gene A suppresses Gene B, the AI places them in slightly different, distinguishable spots within that neighborhood.
The Twist: This worked best in immune system data. When they tried to test this on lung data, the signal got weak and fuzzy.
Why? The immune system is like a highly organized military with distinct, modular units (T-cells, B-cells), making it easy to map. The lung is more like a messy, open-plan office where everything blends together. Also, we simply have better "instruction manuals" (data) for the immune system than for the lung.

5. The "Robot Detective" Found a Lot of Nothing (And That's Good!)

This is the most important part of the paper. The robot tested 141 ideas.

40 looked promising at first.
27 survived a basic check.
Only about 10 survived the "Strict Max-Null Audit" (the hardest, most skeptical test possible).

The Lesson: Many things looked like biological magic until the researchers applied a stricter filter. For example, some patterns looked great until they realized the AI was just picking up on how often genes are turned on together (co-expression), not on deep regulatory rules.

The Bottom Line

Biological AI models do learn real, meaningful geometry. They understand the "shape" of how genes relate to one another, including loops, curved paths, and communities.

However, this understanding is not universal. It is:

Fragile: It depends heavily on the specific tissue (it works great for immune cells, less so for lungs).
Specific: It captures the relationships between genes, but not necessarily the exact identity of every single gene.
Hard to Prove: You have to be extremely skeptical and use strict controls to separate real biological signals from statistical noise.

In short: The AI has a good sketch of the biological city, but it's not a perfect blueprint. And thanks to this study, we now know exactly where the sketch is accurate and where it's just a guess.

Here is a detailed technical summary of the paper "What Topological and Geometric Structure Do Biological Foundation Models Learn? Evidence from 141 Hypotheses."

1. Problem Statement

Biological foundation models (e.g., scGPT, Geneformer) have demonstrated high performance on downstream tasks like cell-type annotation and perturbation prediction. However, a critical gap remains in understanding what internal representations these models actually encode.

The Core Question: Do the geometric and topological structures within the embedding spaces of these models reflect genuine biological mechanisms (e.g., regulatory loops, manifold curvature, community structures), or are they merely statistical artifacts of training?
The Challenge: Traditional hypothesis-driven research is prone to publication bias, often reporting only positive findings while ignoring the vast space of potential geometric properties that may be meaningless. Furthermore, the lack of rigorous null controls often leads to overconfident claims about "biological knowledge" in AI models.

2. Methodology: Autonomous Hypothesis Screening

The authors employed a novel autonomous executor–brainstormer loop powered by a Large Language Model (OpenAI Codex 5.3) to systematically investigate the embedding space.

Process:
- Brainstormer: Analyzed cumulative results to identify under-explored regions and proposed 2–4 new hypotheses per iteration, explicitly avoiding redundant testing based on prior negative results.
- Executor: Wrote and executed self-contained Python experiments on pre-extracted embeddings, generating quantitative reports with effect sizes and null-calibrated p-values.
- Scale: The loop ran for 52 productive iterations, testing 141 distinct hypotheses across nine families (e.g., persistent homology, manifold distances, cross-model alignment, directed topology).
Data & Models:
- Models: scGPT (12 layers) and Geneformer V2-316M (18 layers).
- Data: Single-cell expression profiles from the Tabula Sapiens atlas across three tissue domains: Lung, Immune, and External-Lung (held-out).
- Ground Truth: Regulatory networks from DoRothEA, TRRUST, STRING, and Gene Ontology.
Rigorous Null Controls: A hierarchy of increasingly stringent null models was used to prevent false positives:
1. Feature-shuffle: Permuting embedding features (weakest).
2. Label-permutation: Permuting regulatory edge labels.
3. Degree-preserving rewiring: Rewiring kNN graphs while preserving node degrees.
4. Coexpression-matched: Controlling for the confound of coexpressed genes being geometrically proximal.
5. Strict Max-Null Audit: Comparing observed signals against the maximum 95th percentile across all null families simultaneously (the most conservative threshold).
Validation: Findings were required to hold across disjoint gene-pool splits (source-disjoint and target-disjoint) and three random seeds to ensure robustness.

3. Key Contributions

Systematic Negative Mapping: The paper provides a comprehensive map of 70+ decisively negative hypotheses, preventing the "hype" often associated with interpretability research by documenting what models do not learn.
Autonomous Screening Framework: Demonstrates the efficacy of AI-driven hypothesis generation and testing in exploring high-dimensional geometric spaces more efficiently and objectively than human-led approaches.
Calibration of Biological Claims: Establishes a precise boundary between real biological structure and statistical artifacts, showing that robust signals are highly localized and dependent on tissue context and null model strictness.

4. Key Results

A. Cross-Model Geometric Consistency (Robust)

Finding: scGPT and Geneformer, trained independently on different data with different architectures, converge on the same global geometric organization of gene space.
Evidence: Canonical Correlation Analysis (CCA) yielded a correlation of 0.80, and gene retrieval accuracy was 72%.
Nuance: While the "shape" of the space (distances, neighborhoods, clusters) is shared, the precise gene-level coordinates are not. 19 different alignment methods failed to recover gene-level correspondences (top-1 retrieval < 1%), indicating the models agree on relationships but not specific placements.

B. Non-Trivial Topology (Robust under specific controls)

Finding: Gene embedding neighborhoods exhibit non-trivial topological loops (persistent homology $H_1$ classes).
Evidence: Significant topological structure was found in 11/12 transformer layers (scGPT) and all layers of Geneformer, even in the weakest domain (Lung).
Fragility: This topology is not a deep geometric invariant. When subjected to degree-preserving graph rewiring, the signal vanished completely. This suggests the loops arise from specific nearest-neighbor patterns rather than a fundamental manifold property.

C. Manifold Distance Hierarchy (Moderate)

Finding: Regulatory relationships are better captured by curved manifold distances than straight-line Euclidean distances.
Evidence: Diffusion distance and triangle-defect spectrum (measuring local curvature) outperformed Euclidean distance in identifying regulatory pairs.
Caveat: The advantage of diffusion distance largely disappeared under coexpression-matched nulls, suggesting much of the signal is confounded by coexpression. However, the triangle-defect spectrum remained robust.

D. Signed Motif–Community Alignment (Strongest Signal)

Finding: The strongest robust finding emerged from combining geometric community structure with signed regulatory motifs (activation vs. repression).
Evidence: The model organizes genes such that activation targets and repression targets occupy geometrically distinguishable positions relative to their transcription factors within a community.
Metric: This approach achieved a $\Delta$ AUROC of +0.094 and was the only hypothesis to achieve complete null-gap coverage (positive in all 22/22 test rows) under the strictest controls.
Counter-intuitive Insight: Adding more biological priors (e.g., STRING, Gene Ontology) increased raw effect sizes but degraded robustness, as these priors correlated with null structures, making signals easier to explain by chance.

E. The "Sobering" Strict Max-Null Audit

Finding: When all signals were audited against the maximum of all null families simultaneously, the robustness of findings dropped drastically.
Result: Only 3/9 domain-split groups showed a positive strict margin.
Localization: Robust signals were concentrated almost exclusively in Immune tissue. Signals in Lung and External-Lung tissues became fragile or negative under strict auditing. This suggests either the immune system has a more modular regulatory architecture or that ground-truth annotations for lung tissue are insufficient.

5. Significance and Implications

Real but Localized Structure: Biological foundation models do encode genuine geometric and topological structures reflecting biology, but these structures are domain-dependent and fragile. They are most reliable in immune tissues and less so in lung tissues.
The Necessity of Strict Nulls: The paper demonstrates that many "significant" geometric findings in literature may be artifacts of weak null controls. The hierarchy of null models (especially the strict max-null audit) is essential for honest interpretation.
Interpretability Guidelines:
- Cross-model consistency is the strongest validator of biological reality.
- Multivariate approaches (combining distance, topology, and community features) outperform single metrics.
- Negative results are as valuable as positive ones for calibrating expectations.
Future Directions: The authors call for extending these screenings to more tissues and species and testing whether these geometric structures translate to practical utility in downstream tasks like drug target prioritization.

In summary, the paper moves beyond the binary question of "do models learn biology?" to a nuanced answer: Yes, but only specific, localized, and rigorously validated geometric structures survive the most stringent statistical scrutiny.