Understanding protein function with a multimodal retrieval-augmented foundation model

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine proteins as the master chefs of life. They are long chains of ingredients (amino acids) that fold into complex 3D shapes to cook up everything from your immune system's defenses to the enzymes that digest your lunch.

For a long time, scientists have tried to predict what happens if you swap one ingredient for another, or if you accidentally drop a whole handful of ingredients in or out of the recipe. This is crucial for curing diseases or designing new medicines, but it's incredibly hard.

Enter PoET-2, a new AI model from OpenProtein.AI that acts like a super-smart, multilingual recipe book that has read millions of cookbooks and learned the "rules of the kitchen" better than anyone else.

Here is how PoET-2 works, explained through simple analogies:

1. The Problem: The "One-Size-Fits-All" Cookbook

Previous AI models (like the old generation of recipe books) were great at reading a single recipe and guessing what would happen if you changed one ingredient. But they struggled with:

Adding or removing ingredients: If you add a whole new paragraph to a recipe or delete a sentence, the old models got confused.
The "Butterfly Effect": In cooking, changing one spice might change how another spice tastes. Old models couldn't see these hidden connections between multiple changes.
Data Hunger: To learn a new dish, they usually needed thousands of examples. In the real world, scientists often only have a few test results.

2. The Solution: PoET-2's "Three Superpowers"

PoET-2 solves these problems with three clever tricks:

A. The "Family Reunion" (Retrieval-Augmentation)

Imagine you want to bake a new type of sourdough bread. Instead of just guessing, you walk into a room filled with 100 different bakers who have all made sourdough before. You ask them, "Hey, if I add more salt, what happens?"

PoET-2 does this digitally. Instead of memorizing every single recipe in its brain (which takes up too much space), it looks up similar recipes (protein families) from its database while it's thinking. It learns the specific "rules" of that family of proteins on the fly. This allows it to be incredibly smart without needing a massive brain (it's actually quite small and efficient).

B. The "3D Blueprint" (Multimodal Learning)

Most AI models only read the text of the recipe (the sequence of letters). PoET-2 reads the text AND looks at the 3D blueprint of the finished dish.

Analogy: It's like knowing that if you put a heavy stone on a cake, the cake will collapse. PoET-2 understands that the shape of the protein matters just as much as the order of the ingredients. It can even be given a partial 3D shape and asked, "What ingredients fit here?"

C. The "Two-Way Translator" (Dual Decoders)

PoET-2 has two different "minds" working together:

The Storyteller (Generative): This part reads the recipe from start to finish. It's great at creating new proteins or predicting what happens if you change the length of the chain (insertions/deletions).
The Critic (Bidirectional): This part reads the whole recipe at once, looking at the beginning, middle, and end simultaneously. It's great at understanding the deep meaning and function of the protein.

By using both, PoET-2 can both create new designs and analyze existing ones perfectly.

3. What Can PoET-2 Actually Do?

Predicting "Disasters" (Zero-Shot Prediction): If a scientist finds a mutation in a human gene that causes a disease, PoET-2 can look at it and say, "This is bad," without ever having seen that specific mutation before. It's like a mechanic hearing a car engine sputter and knowing exactly which part is broken, even if they've never seen that specific car model.
Handling "Messy" Changes: It is the first model that can accurately predict what happens when you add or delete chunks of a protein (like adding a whole paragraph to a story). Previous models just gave up on these.
Learning from Few Examples: If you only have 10 test results for a new enzyme, PoET-2 can learn from them and make accurate predictions. Other models might need 1,000 examples to get the same result. It's like a student who can learn a new language after hearing just a few sentences, while others need a whole textbook.

4. Why Does This Matter?

Think of PoET-2 as a universal translator for the language of life.

For Doctors: It helps identify which genetic mutations are dangerous, speeding up diagnoses.
For Engineers: It helps design new proteins that can eat plastic, create new medicines, or make biofuels, all by "hallucinating" (generating) new, stable recipes that nature hasn't tried yet.

In short: PoET-2 is a lightweight, highly efficient AI that doesn't just memorize recipes; it understands the chemistry of cooking. By looking up similar dishes and understanding the 3D shape of the food, it can predict how to fix broken proteins or invent new ones, even with very little data.

1. Problem Statement

Protein Language Models (PLMs) have shown promise in predicting the effects of mutations on protein function (fitness). However, existing approaches face three critical limitations:

Inability to handle complex mutations: Most PLMs rely on Masked Language Modeling (MLM), which is limited to predicting single substitution mutations. They struggle with insertions/deletions (indels) and higher-order mutations where epistatic (non-additive) effects are significant.
Data inefficiency in supervised settings: While zero-shot prediction is useful, protein engineering often requires learning from limited experimental data (few-shot). Current models often require massive datasets to generalize well to unseen sequence positions.
Scaling limitations: Simply increasing model parameters (scaling) has improved structure prediction but has shown neutral or negative impacts on fitness modeling and function prediction, leading to concerns about overfitting and high computational costs.
Modality gaps: Existing models typically integrate either sequence, structure, or evolutionary context (retrieval), but rarely combine all three effectively in a single architecture.

2. Methodology: PoET-2 Architecture

The authors propose PoET-2, a multimodal, retrieval-augmented protein foundation model designed to learn generative distributions over protein sequences conditioned on homologous sequences and structures.

Core Architectural Components

Multimodal Inputs: PoET-2 processes:
- Sequence: Amino acid tokens.
- Structure: Backbone atomic coordinates (N, Cα, C) encoded as pairwise Cα distances and local backbone distances.
- Confidence: Predicted Local Distance Difference Test (pLDDT) scores from AlphaFold.
Retrieval-Augmented (Context-Aware) Encoder:
- Instead of training a massive model to memorize all constraints, PoET-2 uses a hierarchical transformer encoder that processes a "prompt" consisting of a set of homologous proteins (the context) and an optional "query" (a partially specified protein).
- Equivariance: The encoder is fully equivariant to the order of proteins in the prompt, meaning the model learns family-specific evolutionary constraints regardless of the order in which homologs are presented.
- Structure-Based Attention Bias: Within individual sequences, attention scores are biased by 3D structural proximity (discretized Cα distances) rather than just linear sequence position.
Dual Decoder Architecture:
1. Autoregressive Decoder (CLM): Trained with Causal Language Modeling. It generates sequences token-by-token and calculates exact log-likelihoods. This is crucial for zero-shot prediction of indels and higher-order mutations, as it models the full joint probability $P(\text{sequence} | \text{prompt})$ .
2. Bidirectional Decoder (MLM): Trained with Masked Language Modeling. It produces rich, context-aware embeddings for supervised learning, capturing global dependencies essential for function prediction.

Training Strategy

Data: Trained on 62 million sets of homologous sequences (UniRef50) with optional predicted structures from AlphaFoldDB.
Objectives: The model minimizes a combined loss: $L = L_{\text{MLM encoder}} + L_{\text{CLM decoder}} + L_{\text{MLM decoder}}$ .
Prompt Engineering: The model is conditioned on a "context" (a subsample of homologs) and a "query" (specific constraints like length, active sites, or structure). This allows for in-context learning of family-specific constraints without retraining.

3. Key Contributions

First Unified Multimodal PLM: PoET-2 is the first model to effectively combine retrieval augmentation (using unaligned homologs), sequence generation, and structure conditioning in a single, parameter-efficient framework (182M parameters).
Indel and Higher-Order Mutation Prediction: By using an autoregressive decoder conditioned on homologs, PoET-2 can naturally score sequences of variable lengths (indels) and model complex epistatic interactions, a capability previously limited to fixed-length models.
State-of-the-Art Data Efficiency: In supervised few-shot settings, PoET-2 embeddings significantly outperform larger models (e.g., ESM-2, 650M params) using substantially less training data.
Efficiency: Despite its capabilities, PoET-2 is compact (182M parameters), requiring minimal GPU resources compared to billion-parameter models.

4. Experimental Results

The model was evaluated on the ProteinGym benchmark, covering Deep Mutational Scanning (DMS) and clinical variant datasets.

Zero-Shot Variant Effect Prediction:
- Indels: PoET-2 achieves a Spearman correlation ( $\rho$ ) of 0.566 on DMS indels, outperforming the previous best (PoET-1) by ~0.05 and non-PoET models by ~0.10 (>20% improvement).
- Higher-Order Mutations: It excels at predicting variants with 3+ mutations, outperforming ensemble methods like VenusREM on complex variants.
- Clinical Variants: Sets new state-of-the-art AUROC for both substitutions (0.928) and indels (0.952) in predicting pathogenicity.
- Ensemble: Combining PoET-2 with VenusREM yields the absolute best performance across all metrics, suggesting they capture orthogonal signals.
Supervised Few-Shot Learning:
- When used as a feature extractor for a Gaussian Process (GP) regressor, PoET-2 outperforms Kermut (the previous SOTA) and ESM-2 across all cross-validation folds (Random, Modulo, Contiguous).
- Data Efficiency: A PoET-2 GP trained with only 100 data points matches the performance of an ESM-2 GP trained with ~2,600 points. This highlights its superior ability to generalize from limited data.
Ablation on Structure Conditioning:
- Structure conditioning significantly boosts zero-shot stability prediction.
- However, for supervised learning and clinical variant prediction, explicit structure conditioning offers little to no benefit, suggesting the embeddings already implicitly encode structural information.

5. Significance and Conclusion

PoET-2 represents a paradigm shift in protein foundation modeling by demonstrating that retrieval augmentation and multimodal conditioning are more effective than simply scaling model size.

Practical Impact: It enables accurate prediction of complex mutations (indels, epistasis) that were previously intractable for standard PLMs.
Accessibility: Its small footprint (182M params) makes high-accuracy protein engineering accessible on standard hardware, democratizing access to advanced design tools.
Future Directions: The work suggests that future protein models should focus on integrating diverse data sources (sequence, structure, homologs) via retrieval and in-context learning rather than solely increasing parameter counts.

In summary, PoET-2 achieves state-of-the-art performance in both zero-shot and supervised protein function prediction while maintaining high data efficiency and computational accessibility, effectively solving key challenges in mutation effect prediction for drug discovery and protein engineering.