Benchmarking DNA Foundation Models: Biological Blind Spots inEvo2 Variant-Effect Prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a super-smart robot that has read almost every book in a massive library. This robot, called Evo2, is designed to understand the "language" of life (DNA). Its creators claim it can look at a tiny typo in a person's genetic code and instantly tell you if that typo will make them sick (pathogenic) or if it's harmless (benign). They say it does this without ever being explicitly taught which typos are bad; it just "knows" because it has read so much.

This paper is like a group of skeptical mechanics putting that robot through a series of stress tests to see if it actually understands the rules of the game, or if it's just guessing based on patterns it memorized.

Here is what they found, explained with some everyday analogies:

1. The Robot Doesn't Know the "Grammar" of Life

The Test: In human language, we have synonyms (words that mean the same thing but are spelled differently). In DNA, there are "synonymous codons"—different three-letter codes that all mean the same amino acid. Nature has a preference for certain spellings over others, kind of like how a chef prefers a specific brand of salt. This is called Codon Usage Bias.
The Result: The robot failed this test. When asked to predict which "spelling" nature would use, it guessed almost randomly.
The Analogy: Imagine a robot that has read millions of cookbooks but, when asked to bake a cake, randomly picks ingredients. It doesn't realize that some ingredients are preferred by chefs. It knows the words, but it doesn't understand the flavor of the language.

2. The Robot is Easily Confused by "Where" Things Are

The Test: The researchers took a specific part of the DNA (a tRNA, which is like a tiny delivery truck for building proteins) and moved it to a completely different neighborhood in the genome. The truck itself was identical, but its surroundings changed.
The Result: The robot's opinion of the truck changed drastically! When the truck was in its original spot, the robot thought a specific part was broken. When moved to a new spot, the robot suddenly thought it was fine.
The Analogy: Imagine a security guard who decides if a person is a threat based entirely on which street they are standing on, rather than looking at the person's face. If you move the same person to a different street, the guard changes their mind. The robot is looking at the wrong clues.

3. The Robot Can't Tell "Real" DNA from "Fake" DNA

The Test: Sometimes, pieces of mitochondrial DNA (the power plants of our cells) get accidentally copied into the main nucleus of the cell. These are called NUMTs. They are "ghost" copies—they look like real DNA but are broken and useless.
The Result: When the robot saw these ghost copies, it treated them as if they were real, working DNA. It couldn't tell the difference between the "real" power plant and the "fake" blueprint.
The Analogy: Imagine a robot trained to recognize real money. If you show it a perfect photocopy of a $20 bill, it thinks it's real money. It doesn't understand that a photocopy has no value, even if it looks identical.

4. The Robot Gets the Severity Backwards

The Test: The researchers asked the robot to predict how bad different mutations were.
The Result: The robot was surprisingly good at spotting mild, annoying typos, but it struggled the most with the most dangerous mutations—the ones that cause severe, life-threatening diseases.
The Analogy: Imagine a weather forecaster who is great at predicting a light drizzle but completely misses the hurricane. In medicine, missing the hurricane is the biggest problem.

5. The Robot is Good at Math, But Bad at Biology

The robot did get some things right. It understood that some types of DNA typos happen more often than others (like how "A" turning into "G" is more common than "A" turning into "C"). It's good at spotting statistical patterns.
However, it failed to understand the biological reasons behind those patterns. It's like a student who memorized the answers to a math test but doesn't understand why the formula works.

The Bottom Line

The paper concludes that while Evo2 is an impressive piece of technology that can generate text and spot general patterns, it is not yet ready to be a doctor.

If you use this robot to diagnose a patient, it might miss the most dangerous conditions or get confused by harmless variations because it's looking at the wrong things (like the neighborhood instead of the person).

The Takeaway: We can't just feed a robot more data and hope it becomes a genius. To make these tools safe for hospitals, we need to teach them the actual rules of biology, not just let them guess based on patterns. They need a "biology teacher," not just a "library."

1. Problem Statement

DNA foundation models (e.g., Evo, DNABERT-2) have shown promise in genomic applications, particularly in Variant-Effect Prediction (VEP). These models claim to learn biological constraints directly from raw sequence data without explicit supervision. However, there is a lack of rigorous, biologically grounded benchmarks to verify if these models truly understand fundamental genomic rules or if they merely memorize statistical patterns.

The authors argue that current evaluations often rely on aggregate metrics (like AUROC) on imbalanced datasets, which can mask critical failures. Specifically, they question whether models like Evo2 have internalized:

Short-range biological signals (e.g., codon usage bias, mitochondrial genetic code idiosyncrasies).
Context-independent biological mechanisms (e.g., tRNA structure).
Long-range evolutionary constraints (e.g., conservation, pseudogene distinction).

2. Methodology

The authors developed a controlled evaluation framework using mitochondrial DNA (mtDNA) and specific nuclear regions as testbeds due to their compact, well-annotated nature. The framework categorizes benchmarks into three length scales:

A. Short-Range Signals (1–5 nucleotides)

Codon Usage Bias: Tested if Evo2 predicts the empirically observed "wobble base" (third codon position) frequencies in human nuclear genes (TTN exon 305).
Mitochondrial Genetic Code: Evaluated if Evo2 distinguishes between nuclear and mitochondrial start/stop codons (e.g., AUA as Met in mtDNA vs. Ile in nuclear).
Mutational Biases: Checked if the model assigns higher likelihoods to transitions (A↔G, C↔T) over transversions, reflecting natural mutational biases.
Zero-Shot Pathogenicity: Assessed if Evo2 can distinguish pathogenic from benign variants using only sequence likelihood scores (no variant-labeled training data).

B. Medium-Range Signals (~30 nucleotides)

tRNA Context Sensitivity: Performed a cyclic permutation experiment. All 22 mitochondrial tRNAs were moved to new genomic locations while keeping their internal sequences identical. Since tRNA function depends on intramolecular folding (not flanking sequence), a biologically aware model should yield identical pathogenicity scores.

C. Long-Range Signals (>60 nucleotides)

Gene Completion: Tested the model's ability to reconstruct masked middle sections of genes across 10 species, checking if accuracy correlates with evolutionary conservation.
NUMT Differentiation: Tested if Evo2 can distinguish authentic mtDNA from Nuclear Mitochondrial DNA segments (NUMTs) (non-functional pseudogenes in the nucleus) when provided with nuclear context.
Evolutionary Conservation: Correlated Evo2's per-base log-probabilities with PhyloP conservation scores.

3. Key Results

A. Failure to Internalize Short-Range Biology

Codon Bias: Evo2 failed to capture human codon usage bias. It selected the empirically preferred wobble base in only 24.4% of cases (near random chance), with a high Jensen-Shannon Divergence (0.254) from empirical distributions.
Mitochondrial Code: Evo2 misclassified valid mitochondrial start/stop codon variants as pathogenic (100% of start-codon preserving variants and 72.7% of stop-codon preserving variants were flagged as deleterious), showing it defaults to nuclear genetic code rules.
Mutational Bias: The model correctly assigned lower likelihoods to transversions than transitions, indicating it captured basic nucleotide-level statistics.

B. Spurious Context Sensitivity (tRNA)

Permutation Test: When tRNA positions were permuted, Evo2's sensitivity to pathogenic tRNA variants collapsed from 65.8% to 5.1%.
Implication: The model's predictions are driven by flanking genomic context rather than the tRNA's internal structural sequence, which is the actual determinant of function. This is a critical "blind spot."

C. Pathogenicity Prediction Performance

Aggregate Metrics: Evo2 achieved a balanced accuracy of 87.6% and an AUROC of 0.896 on a mitochondrial variant test set. While competitive, it was outperformed by supervised tools like APOGEE2 (AUROC 0.950) on most metrics.
Severity Inversion: Counter-intuitively, Evo2 performed best on mild pathogenic variants (100% accuracy) and worst on severe variants. This is clinically dangerous, as models should be most reliable for the most lethal mutations.
Region Bias: Performance was poor in non-coding D-loop regions and RNA genes, likely due to a lack of pathogenic examples in training data for these categories.

D. Long-Range and Contextual Failures

Gene Completion: Accuracy did not correlate with evolutionary constraint. Highly conserved Complex III genes were completed with the lowest accuracy (85.0%), while less constrained complexes showed >95% accuracy.
NUMTs: Evo2 consistently predicted the authentic mtDNA sequence even when the input was a NUMT (nuclear pseudogene) with divergent mutations. It failed to use nuclear context to identify the sequence as non-functional.
Conservation Correlation: While there was a moderate correlation ( $\rho=0.77$ ) between Evo2 log-probabilities and PhyloP scores, local conservation peaks were not reliably detected.

4. Key Contributions

Novel Benchmarking Framework: Introduced a multi-scale evaluation framework (short, medium, long-range) specifically designed to stress-test biological understanding rather than just statistical fitting.
Identification of Systematic Blind Spots: Demonstrated that despite massive scale, Evo2 fails to internalize fundamental biological rules (codon bias, mitochondrial code, tRNA structure) and relies on spurious contextual correlations.
The tRNA Permutation Test: Provided a rigorous, controlled method to prove that the model's predictions for RNA variants are driven by irrelevant genomic neighborhood features rather than structural biology.
Clinical Readiness Assessment: Argued that current zero-shot foundation models are not ready for clinical deployment due to their inability to distinguish functional constraints from statistical noise and their sensitivity to dataset imbalances.

5. Significance and Implications

Beyond Scaling: The results suggest that simply scaling up unsupervised training on raw sequence is insufficient to capture the hierarchical grammar of functional genomics.
Need for Hybrid Approaches: The authors propose that future models must integrate structured biological supervision (e.g., codon annotations, RNA structure constraints, conservation metrics) via multi-task learning or fine-tuning.
Clinical Caution: Relying on aggregate metrics (like AUROC) for clinical decision-making is dangerous. Models may appear accurate overall while failing catastrophically on specific, high-stakes variant classes (e.g., severe mutations or non-coding regions).
Standardization: This framework provides a standardized baseline for evaluating future genomic foundation models, ensuring they are tested against known biological truths rather than just benchmark datasets.

In conclusion, while Evo2 demonstrates impressive generative capabilities, it exhibits significant "biological blindness" regarding established genomic constraints. The paper calls for a shift from purely unsupervised scaling to biologically grounded training strategies before such models can be safely used in clinical genomics.