Beyond additivity: zero-shot methods cannot predict… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Recipe" Problem

Imagine you are a master chef trying to predict how a dish will taste if you change the ingredients.

Single Mutation: If you take a standard chocolate chip cookie recipe and swap the butter for margarine, you can probably guess the result. It might be a bit less rich, but it's still a cookie.
Epistasis (The Twist): Now, imagine you swap the butter and the sugar and the baking soda all at once. You might expect the result to be just "less rich + sweeter + flatter." But in reality, these ingredients might react with each other in a wild, unpredictable way. Maybe the new mix creates a completely different texture, or the cookie turns into a brick.

In biology, this "unpredictable reaction" is called epistasis. It happens when the effect of one genetic mutation depends entirely on what other mutations are already there.

The Study: Can AI Predict the "Brick"?

The authors of this paper asked a very important question: Can modern Artificial Intelligence (AI) predict these complex, multi-mutation outcomes?

They tested 95 different AI models (specifically "zero-shot" models, which are like AI chefs that have read millions of recipes but have never actually cooked a specific dish before). They used a massive database called ProteinGym, which contains real-world experimental data on how proteins behave when their "recipes" are changed.

The Results: The AI is Good at Simple Things, Bad at Complex Ones

The findings were a bit disappointing for the AI community, but very clear:

The AI is great at single changes: If you change just one "ingredient" (amino acid) in a protein, the AI can usually predict if the protein will still work or break. It's like the AI knowing that swapping salt for sugar makes a dish too sweet.
The AI fails at combinations: When they asked the AI to predict what happens when multiple ingredients are changed at once (especially when those changes interact strongly), the AI got lost.
- The Metaphor: Imagine the AI is trying to navigate a mountain range (the "fitness landscape"). It knows how to walk up a single hill. But when the terrain gets rugged with deep valleys and hidden peaks (caused by epistasis), the AI falls into the valleys. It cannot see the path from one peak to another because it doesn't understand how the mountains interact with each other.

The "Zero-Shot" Limitation

The models tested are called "Zero-Shot." Think of them as students who have memorized the entire encyclopedia of biology but have never taken a specific test.

They are very smart about what "looks natural" because they've seen millions of natural proteins.
However, they don't understand the physics or the chemistry of how two specific mutations might crash into each other and cause a disaster. They are guessing based on patterns, not on understanding the underlying rules of the game.

The Surprising Discovery: Simple is Better

The most interesting part of the paper is what worked better than the fancy AI.

The researchers built two very simple baselines:

Linear Regression: A basic math formula that just adds up the effects of single mutations.
A Simple Neural Network: A slightly more complex but still basic calculator.

The Shock: In many cases, these simple, "dumb" models performed just as well as, or even better than, the most advanced, complex AI models when predicting epistasis.

The Lesson: It turns out that clever data handling (understanding the specific features of the protein, like its 3D shape or evolutionary history) is more important than having a massive, complex AI architecture. The best-performing models weren't necessarily the deepest neural networks; they were the ones that paid attention to the right clues (like the protein's 3D structure).

Why Does This Matter?

This is a wake-up call for scientists who want to:

Design new medicines: If we can't predict how multiple mutations interact, we can't reliably design new proteins to fight viruses or cancer.
Understand evolution: Evolution often relies on these complex interactions to jump from one species to another. If our AI tools can't see these paths, we are blind to how life evolves.

The Bottom Line

Current AI tools are like excellent spell-checkers for single words, but they are terrible at editing a whole paragraph where the meaning of one word changes the meaning of the next.

To move forward, scientists need to stop just making bigger AI models and start focusing on better data and understanding the complex interactions between mutations. We need to teach the AI not just what a protein looks like, but how it feels when you change its parts.

1. Problem Statement

The accurate prediction of how mutations affect protein function and stability is critical for evolutionary biology and protein engineering. While computational methods have advanced, a major bottleneck remains: epistasis. Epistasis occurs when the effect of one mutation depends on the presence of other mutations, creating non-linear interactions that cannot be predicted by simply summing the effects of individual mutations.

Current Variant Effect Prediction (VEP) tools, particularly zero-shot models (which do not require task-specific training data and rely on pre-trained protein language models or evolutionary statistics), perform well for single mutations. However, it is unclear whether these models can generalize to complex, multi-mutational genotypes where epistatic interactions dominate. The authors investigate whether state-of-the-art zero-shot models can predict the functional impact of strongly epistatic combinations of mutations.

2. Methodology

Data Sources:

Benchmark: The study utilized the ProteinGym benchmark, which aggregates data from Multiplexed Assays of Variant Effects (MAVE) experiments.
Selection Criteria: From 217 available MAVE datasets, the authors selected 53 datasets where experimental error estimates were available to rigorously distinguish true biological epistasis from noise.
- GFP Datasets (Somermeyer et al.): 3 datasets (amacGFP, cgreGFP, ppluGFP) measuring fluorescence. These contain genotypes with up to 44 mutations.
- Stability Datasets (Tsuboyama et al.): 50 datasets measuring thermostability ( $\Delta G$ ) for single and double mutants of various proteins.

Epistasis Detection:
The authors defined epistasis as the deviation between the observed phenotype of a multi-mutant genotype and the expected phenotype calculated as the sum of individual single-mutant effects.

GFP: Calculated as $Epistasis = F_{observed} - F_{expected}$ , where $F_{expected}$ is the sum of single-mutant effects relative to wildtype.
Stability: Used a thermodynamic model to reconstruct expected $\Delta G$ values.
Significance: A two-tailed Z-test was applied to ensure the observed deviation exceeded experimental error.
- Threshold $N=1$ for GFP datasets.
- Threshold $N=3$ for stability datasets (due to higher relative experimental noise).

Models Evaluated:

Zero-Shot Models: 95 distinct models were evaluated, including:
- Protein Language Models (PLMs) like ESM-1b, ESM-2, MSA-Transformer.
- Structure-based models (ESM-IF1, ProSST, SaProt).
- Evolutionary/statistical models (GEMME, ESCOTT, PoET).
Baselines: Simple supervised models trained only on single-mutation data:
- Linear Regression (by definition, cannot capture epistasis).
- Multi-Layer Perceptron (MLP).

Evaluation Metric:
Performance was measured using Spearman's rank correlation ( $\rho$ ) between model predictions and experimental values, calculated separately for:

All multi-mutant genotypes.
A subset of epistatic genotypes (those with significant non-linear interactions).

3. Key Contributions

Systematic Benchmarking of Epistasis: This is the first large-scale evaluation specifically isolating the performance of zero-shot VEP methods on epistatic genotypes across diverse proteins and phenotypes.
Demonstration of Failure: The study conclusively shows that current state-of-the-art zero-shot models fail to predict strongly epistatic effects, performing no better than simple baselines (linear regression) trained on single mutants.
Feature Engineering vs. Architecture: The authors highlight that models performing best on epistatic tasks (e.g., ESCOTT, PoET) rely heavily on evolutionary conservation features and 3D structural information rather than complex deep learning architectures. Notably, ESCOTT is a purely statistical model without machine learning components that outperformed many deep learning models.
Phenotype Specificity: The study reveals that models successful for stability prediction (e.g., ESM-IF1, ProSST) are distinct from those successful for fluorescence (e.g., VESPA, PoET), suggesting that different biological properties require different inductive biases.

4. Results

Performance on GFP (Fluorescence) Datasets:

General Performance: Models achieved moderate success ( $\rho > 0.6$ ) for the "all genotypes" set, which includes many non-epistatic or weakly epistatic variants.
Epistatic Performance: For strongly epistatic genotypes, performance collapsed. The best models reached a Spearman correlation of only ~0.2.
Baseline Comparison: Simple linear regression and MLP baselines, trained on single mutants, performed comparably to or better than complex zero-shot models on epistatic sets. This indicates that zero-shot models are not capturing the non-linear interactions required to traverse "fitness valleys."
Top Performers: ESCOTT and PoET were the top performers, likely due to their explicit use of evolutionary context and conservation features.

Performance on Stability (Thermostability) Datasets:

General Performance: Correlations were generally low ( $\rho < 0.4$ ) for most datasets.
Epistatic Performance: Models failed to outperform baselines on epistatic genotypes.
Top Performers: Structure-aware models (ProSST, VenusREM, SaProt, ESM-IF1) showed slightly better performance, reinforcing the importance of 3D structural data. However, even these models could not reliably predict strong epistasis.
No Overlap: There was no overlap between the top-performing models for GFP and Stability datasets, indicating that a "one-size-fits-all" model does not exist for these distinct phenotypes.

Key Observation:
The inability of zero-shot models to predict epistasis suggests they learn a notion of "sequence plausibility" based on evolutionary history (which is dominated by single-point mutations and weak interactions) but fail to generalize to the complex, non-linear genotype-phenotype maps required to cross fitness valleys.

5. Significance and Conclusion

Limitations of Current AI: The study exposes a critical gap in current protein AI. While zero-shot models are powerful for single-mutation screening, they are currently insufficient for protein design tasks that require navigating complex fitness landscapes involving multiple simultaneous mutations.
Implications for Protein Design: Relying solely on current zero-shot models for designing multi-mutant proteins (e.g., for enzyme engineering or antibody optimization) is risky, as these models cannot predict synergistic or antagonistic interactions between mutations.
Future Directions:
- Data Generation: There is an urgent need for more experimental data specifically targeting multi-mutational combinations to train models on epistasis.
- Model Development: Future models must move beyond simple sequence plausibility. Success may depend less on increasing model size (parameters) and more on incorporating explicit structural and evolutionary features (feature engineering) and potentially non-linear transformations of data.
- Hybrid Approaches: The findings suggest that combining evolutionary statistics with structural data (as seen in top-performing models) is a more promising path than relying purely on deep learning architectures trained on masked language modeling.

In summary, the paper argues that while zero-shot methods have revolutionized single-mutation prediction, they have not yet solved the "beyond additivity" problem of epistasis, representing a significant hurdle for the next generation of computational protein design.

Beyond additivity: zero-shot methods cannot predict impact of epistasis on protein properties and function