EVEE: Interpretable variant effect prediction from genomic foundation model embeddings

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your DNA is a massive, ancient library containing the instruction manual for building and running a human being. Sometimes, a single letter in a book gets changed, a word is deleted, or a sentence is added. These changes are called genetic variants.

Most of the time, we don't know if these changes are harmless typos, helpful edits, or dangerous errors that cause disease. In the medical world, these unknowns are called "Variants of Uncertain Significance" (VUS), and they are a huge headache for doctors trying to diagnose patients.

This paper introduces a new tool called EVEE (Evo Variant Effect Explorer) that acts like a super-smart, bilingual librarian who can not only spot the errors but also explain why they are dangerous in plain English.

Here is how it works, broken down into simple concepts:

1. The "Super-Librarian" (Evo 2)

First, the researchers used a massive AI model called Evo 2. Think of Evo 2 as a librarian who has read every book in the library of life, from bacteria to humans, millions of times. Because it has seen so much, it has learned the "grammar" and "style" of DNA. It knows what a healthy sentence looks like and what a broken one feels like, even without being explicitly taught the rules.

2. The "Fingerprint Scanner" (The Covariance Probe)

Usually, when scientists try to find errors, they look at one letter at a time. But EVEE uses a clever trick called a Covariance Probe.

Imagine you are looking at a crowd of people. A normal scanner might just count how many people are wearing red hats. But the Covariance Probe looks at the relationships between people. It notices: "Hey, whenever someone wears a red hat, they are also standing next to someone with a blue scarf, and they are both holding a specific type of umbrella."

In DNA terms, the model doesn't just look at the changed letter; it looks at how that change ripples through the surrounding neighborhood of letters. It captures the "vibe" or the "pattern" of the change. This allowed them to build a detector that is incredibly accurate at spotting bad variants, whether they are single letter swaps (SNVs) or chunks of missing text (indels).

The Result: It got a 99.7% accuracy score on known bad variants, beating almost every other tool currently in existence.

3. The "Zero-Shot" Magic

One of the coolest things about this tool is that it learned to spot single-letter errors, but then it automatically got really good at spotting missing or extra chunks of text (indels) without ever being trained on those specific types of errors.

It's like teaching a child to recognize a "dog" by showing them pictures of Golden Retrievers. Then, you show them a picture of a Chihuahua they've never seen before, and they say, "That's a dog too!" The model learned the concept of a broken instruction well enough to apply it to new types of breaks.

4. The "Translator" (Making it Interpretable)

Here is the biggest problem with most AI in medicine: It gives you a score (like "85% chance this is bad"), but it doesn't tell you why. Doctors can't use a black box score to make life-or-death decisions; they need evidence.

EVEE solves this with a two-step translation process:

The Detective Work: The system checks the variant against 251 different biological "checklists." Does this change break a protein's shape? Does it mess up the splice site (the glue that holds genes together)? Does it remove a critical switch? It creates a "disruption profile"—a list of exactly what broke.
The Storyteller: They fed this list of broken parts into a powerful AI language model (like a very smart journalist). This AI took the technical data and wrote a human-readable story.

Example: Instead of just saying "Pathogenic," the tool might say:

"This variant is likely harmful because it completely destroys the 'splice acceptor' site at the end of a gene segment. Imagine a train track where the switch is broken; the train (the cell's machinery) can't know where to stop, causing it to derail and produce a broken protein. This matches known patterns of disease in this gene."

5. The "Public Library" (EVEE Website)

The researchers didn't keep this tool to themselves. They built a free, interactive website called EVEE.

You can search for any of the 4.2 million genetic variants in the ClinVar database.
You can see the "disruption profile" (the list of broken parts).
You can read the AI-generated explanation in plain English.

Why This Matters

For years, scientists had to choose between accuracy (a very smart but confusing AI) and interpretability (a simple explanation that might be wrong).

This paper proves that you don't have to choose. By using the deep "understanding" of a genomic foundation model, they created a system that is both a world-class detective and a clear, articulate teacher. It turns a confusing math score into a clear medical story, helping doctors finally understand what those "Variants of Uncertain Significance" really mean for their patients.

1. Problem Statement

The clinical interpretation of genetic variants remains a critical bottleneck in genomic medicine. Despite the exponential growth in sequencing data, the majority of observed variants are classified as Variants of Uncertain Significance (VUS). Existing computational tools face several limitations:

Scope Limitations: Protein-based models (e.g., AlphaMissense) are restricted to missense variants, while others focus only on regulatory non-coding effects.
Opacity: Meta-predictors like CADD integrate over 100 features with complex transformations, obscuring individual contributions and failing to provide human-readable explanations required by ACMG/AMP guidelines.
Lack of Unified Framework: No existing method offers a single framework that accurately predicts pathogenicity across all variant types (SNVs, indels, coding, non-coding) while simultaneously providing mechanistic, interpretable explanations.

2. Methodology

The authors propose EVEE (Evo Variant Effect Explorer), a framework leveraging Evo 2, a 7-billion-parameter genomic foundation model pretrained on DNA sequences across all domains of life. The methodology consists of three core components:

A. Covariance Probe for Pathogenicity Prediction

Instead of using standard mean-pooling of embeddings, the authors introduce a covariance probe:

Input Processing: Evo 2 processes reference and alternate DNA sequences (in both sense and antisense directions) to generate per-position token embeddings.
Representation: Rather than averaging embeddings (mean-pooling), the method computes the Gram matrix ( $X^\top X$ ) of the difference between variant and reference embeddings. This captures second-order structure (correlations between embedding dimensions and co-occurrence of features) which is lost in mean-pooling.
Compression: To handle the high dimensionality of the Gram matrix, a linear down-projection is applied to create a compressed covariance representation.
Training: A linear classifier is trained on this compressed covariance matrix to predict pathogenicity (Pathogenic/Likely Pathogenic vs. Benign/Likely Benign).

B. Supervised Annotation Disruption Profiling

To achieve interpretability, the authors train supervised annotation probes on Evo 2's reference embeddings to predict a panel of 251 biological annotations, including:

Protein structural features (secondary structure, disorder).
Regulatory marks (histone modifications, chromatin breadth).
Protein domains and post-translational modifications.
Genomic region identity and splice site probabilities.

For each variant, a disruption profile is generated by calculating the delta ( $\Delta$ ) between the predicted annotations of the variant sequence and the reference sequence. This captures both local effects (at the mutation site) and long-range effects (up to 5 flanking positions).

C. LLM-Based Synthesis

The top 10 disruptions (ranked by magnitude) and variant metadata are fed into a frontier reasoning Large Language Model (LLM), specifically Claude Opus 4.6. The LLM synthesizes these structured data points into a natural language explanation that contextualizes the molecular mechanism of the variant, mimicking the evidence categorization required for clinical classification.

3. Key Contributions

Unified Variant Effect Prediction: A single covariance probe achieves state-of-the-art performance across all variant consequence types (missense, synonymous, nonsense, splice, UTR, intronic, and indels) from a single model.
Zero-Shot Generalization: The model, trained exclusively on Single Nucleotide Variants (SNVs), generalizes zero-shot to indels with high accuracy, demonstrating that Evo 2 representations capture general principles of sequence disruption.
Interpretability as a Product: The framework reframes interpretability from a trade-off to a complementary output. It moves beyond opaque scores to provide mechanistic, human-readable explanations derived directly from learned biological structures.
EVEE Web Resource: The authors released an interactive web tool providing pre-computed predictions and on-demand explanations for 4.2 million ClinVar variants.

4. Results

Performance Metrics

SNV Prediction: Achieved an overall AUROC of 0.997 on 833,970 ClinVar SNVs. Performance remained high across specific types:
- Missense: 0.971
- Synonymous: 0.961
- Nonsense: 0.900
- Splice: 0.924
Indel Prediction (Zero-Shot): Achieved an overall AUROC of 0.986 on 73,961 ClinVar indels without specific indel training.
- Outperformed CADD v1.7 (0.980) and NTv3 (0.828).
- Performance was robust across insertion/deletion sizes (1bp to >20bp).
Conservation Robustness: Unlike CADD and GPN-MSA, which degrade at conservation extremes, the Evo 2 probe maintained high performance across fast-evolving to highly conserved sites, suggesting it encodes functional constraints complementary to phylogenetic conservation.
Deep Mutational Scanning (DMS) Transfer: The model successfully transferred to experimental DMS datasets for BRCA1, BRCA2, TP53, and LDLR. The covariance probe showed strong correlation with functional scores (e.g., $|\rho| \approx 0.70$ for TP53), outperforming loss-based scoring and matching or exceeding AlphaMissense and CADD.

Interpretability Evaluation

Using an "LLM-as-a-judge" approach against expert-reviewed ClinVar evidence, the system achieved a composite score of 3.89/5.
The inclusion of Evo 2 probe predictions provided the largest performance gain (+1.02) compared to adding gene names or HGVS notation alone.
The generated explanations successfully identified specific molecular mechanisms (e.g., splice acceptor loss, domain disruption) consistent with expert submissions.

5. Significance

This work establishes that genomic foundation model embeddings can serve as a unified substrate for both accurate variant effect prediction and mechanistic interpretation.

Clinical Impact: By providing categorized, human-readable evidence rather than opaque scores, EVEE addresses a key requirement for clinical variant classification under ACMG/AMP guidelines, potentially reducing the number of VUS.
Scientific Insight: The success of the covariance probe suggests that second-order embedding structures capture biological information (functional constraints) that is distinct from and complementary to simple evolutionary conservation or next-token likelihood.
Future Direction: The framework demonstrates a pathway for integrating AI-driven predictions with clinical workflows, moving toward a future where variant interpretation is automated, scalable, and transparent.

Limitations: The model is optimized for strong deleterious effects (Mendelian variants) and may be less calibrated for subtle polygenic effects. Additionally, the interpretability relies on known annotations; truly novel molecular mechanisms may require unsupervised discovery methods (e.g., sparse autoencoders) in future iterations.