PRIZM: Combining Low-N Data and Zero-shot Models to Design Enhanced Protein Variants

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a chef trying to invent a new, super-delicious soup. You have a massive library of recipes (the "foundation models"), but you don't know which one will work best for your specific ingredients. You also have a tiny taste-test group of only 20 people (your "low-N data").

Traditionally, chefs had two bad options:

The Hard Way (Supervised Learning): Hire a team of data scientists to build a custom recipe from scratch. This takes a lot of money, time, and requires you to taste-test hundreds of batches just to train the team. If you only have 20 tasters, the team gets confused and makes bad guesses.
The Guessing Game (Zero-Shot Modeling): Just pick a famous recipe book at random and hope it works. The problem is, there are thousands of recipe books. Some are great for soups, some for cakes, and some are terrible. Without a way to test them, you might pick a book that is perfect for cakes but makes your soup taste like mud.

Enter PRIZM: The "Smart Taste-Tester"

The paper introduces PRIZM (Protein Ranking using Informed Zero-shot Modelling). Think of PRIZM as a super-smart sous-chef who solves both problems.

How PRIZM Works (The Two-Phase Kitchen)

Phase 1: The "Taste-Test" (Model Selection)
You have your tiny group of 20 tasters (your existing experimental data). Instead of trying to build a new AI, PRIZM takes your 20 samples and runs them through all the different recipe books (the pre-trained AI models) at once.

It asks: "Which recipe book's predictions actually match what our 20 tasters liked?"
It quickly identifies the "Best Book" for your specific soup. Maybe Book A is great for spicy soups, but Book B is the winner for your sweet soup.
The Magic: You only need about 20 samples to figure this out. You don't need to train a new model; you just find the one that already knows the most about your specific problem.

Phase 2: The "Menu Creation" (Variant Selection)
Once PRIZM finds the "Best Book," it uses that specific book to scan a library of millions of potential new recipes (a digital library of protein mutations).

It ranks them from "Most Likely to be Delicious" to "Most Likely to be Disgusting."
You then go to the lab and cook only the top 5 or 10 recipes.
The Result: Because you picked the right "Book" in Phase 1, your chances of finding a hit are incredibly high, even though you only tested a handful of new samples.

Real-World Examples from the Paper

The authors tested this "Smart Sous-Chef" on two real biological problems:

The Heat-Resistant Enzyme (Sucrose Synthase):
- The Goal: Make an enzyme that doesn't break down when it gets hot (like a soup that stays hot without curdling).
- The Data: They had a small list of 68 previous experiments.
- The PRIZM Win: PRIZM picked the best "recipe books" and suggested two new mutations. One of them made the enzyme withstand 3°C higher heat and stay active much longer. It found a winner that human experts had missed!
The Sugar-Transfer Enzyme (Glycosyltransferase):
- The Goal: Make an enzyme that works better at adding sugar to a medicine (to make it dissolve better in water).
- The Data: They had a tiny list of only 8 previous experiments. This is a very small sample size!
- The PRIZM Win: Even with only 8 samples, PRIZM figured out which AI model to trust. It suggested new mutations, and 60% of them worked better than the original. One mutation made the enzyme 20% more active.

Why This Matters

For Non-Experts: You don't need to be a machine learning wizard. You don't need to build complex models. You just need a small amount of data to let PRIZM do the heavy lifting of choosing the right tool.
For Experts: It saves time and money. Instead of running expensive, failed experiments, you use PRIZM to filter out the bad ideas before you ever touch a test tube.
The Big Picture: PRIZM bridges the gap between "guessing" (using AI blindly) and "over-engineering" (building custom AI from scratch). It lets us use the massive knowledge of giant AI models with just a tiny drop of real-world data.

In short: PRIZM is like having a magic compass. You give it a tiny map of where you've been (your small data), and it points you to the best path forward through the vast forest of possibilities, ensuring you find the treasure (the perfect protein) without getting lost.

1. Problem Statement

Protein engineering relies on navigating the vast sequence space to find variants with improved functions. While Machine Learning (ML) has accelerated this process, two primary barriers exist for non-experts and data-constrained scenarios:

Supervised Learning Limitations: Traditional ML approaches (e.g., ML-assisted directed evolution) require large, high-quality labeled datasets to avoid overfitting. In "low-N" settings (fewer than ~50 variants), robust train-test splits are statistically impossible, and model retraining for new targets requires significant ML expertise.
Zero-Shot Modeling Limitations: Large pre-trained protein foundation models can predict variant effects without task-specific training (zero-shot). However, the abundance of available models makes it non-trivial to select the single best model for a specific protein property. Global benchmarks often fail to reflect performance on specific targets, and zero-shot models cannot adapt to specific engineering objectives without data.

The Gap: There is a lack of a principled, accessible workflow that leverages the general knowledge of foundation models while utilizing small, existing experimental datasets to guide model selection, without requiring complex model retraining.

2. Methodology: The PRIZM Workflow

The authors introduce PRIZM (Protein Ranking using Informed Zero-shot Modelling), a two-phase framework designed to identify the optimal pre-trained zero-shot model for a specific target and property using minimal experimental data.

Phase 1: Model Selection (Ranking)

Input: A small experimental dataset ("low-N," typically 20–50 variants) containing sequences and measured properties (e.g., stability, activity).
Process:
1. The workflow processes the wild-type (WT) sequence, structure (predicted via AlphaFold3 if unavailable), and Multiple Sequence Alignment (MSA) to generate inputs for a library of 25 pre-trained zero-shot models (covering sequence, MSA, structure, and hybrid inputs).
2. Each model generates zero-shot scores for the variants in the low-N dataset.
3. Performance Metric: The workflow calculates the Absolute Spearman Correlation (ranking ability) and Average Precision (classification ability above a user-defined threshold, e.g., WT performance) between the model scores and experimental values.
4. These metrics are normalized and combined to rank the models. The system identifies the best and worst performers across different input modalities (sequence, MSA, structure).
Key Feature: No model fine-tuning occurs; the experimental data is used solely to select the best pre-trained predictor.

Phase 2: Variant Selection (Ranking)

Process: The best-performing model(s) identified in Phase 1 are applied to a large in silico library of variants (e.g., all single-point mutants).
Output: Variants are ranked based on the selected model's scores.
Strategy: Users can employ a "greedy Top K" selection or combine rankings from multiple high-performing models (e.g., one sequence-based, one structure-based) with expert domain knowledge to nominate candidates for experimental validation.

3. Key Contributions

Novel Framework: PRIZM bridges the gap between supervised and zero-shot learning by using small datasets to inform the selection of foundation models, eliminating the need for model retraining.
Data Efficiency: Demonstrates that reliable model selection is possible with as few as 20 labeled variants, with ~50 variants generally sufficient to reach performance plateaus.
Model Agnosticism: The workflow is not limited to a specific model architecture; it can incorporate any newly released zero-shot model, currently supporting 25 models from the ProteinGym suite.
Accessibility: Designed to be accessible to non-ML experts, requiring minimal computational setup and no specialized machine learning knowledge.

4. Results

Benchmark Validation

Dataset: Validated across 10 diverse Deep Mutational Scanning (DMS) datasets covering properties like thermostability, enzyme activity, receptor binding, and fluorescence.
Performance: PRIZM consistently distinguished between high- and low-performing models with a large effect size (Cohen's $d > 0.5$ ) using only 20 variants.
Convergence: Using ~50 variants allowed PRIZM to identify models that performed nearly as well as the global best model on the full dataset.
Comparison: PRIZM outperformed the consensus approach by Hie et al. (which requires multiple models to agree) in 6 out of 10 benchmarks. Unlike consensus methods, PRIZM always provides a ranked library, avoiding the "no candidates" failure mode.

Case Study 1: Sucrose Synthase (GmSuSy) Thermostability

Context: Used an existing dataset of 68 variants from a previous rational engineering campaign.
Outcome: PRIZM identified Tranception No Retrieval, MIFST, and MSA Transformer as top models.
Validation: Selected two new variants (L731E and F468I).
- F468I showed a ~3.0°C increase in apparent melting temperature ( $T_{m,app}$ ) and retained >60% activity at 60°C (vs. ~23% for WT).
- Achieved a 60% hit rate for variants with improved stability.

Case Study 2: Glycosyltransferase (TOGT1_1) Activity

Context: An extremely low-N setting using only 8 variants from a prior rational design campaign.
Outcome: VenusREM (integrating sequence, structure, and MSA) was selected as the top model.
Validation: Guided by VenusREM mutational landscapes and expert curation, 7 new variants were tested.
- Identified variants G401F and G401I with ~20% higher relative activity compared to WT.
- Achieved a 60% hit rate for improved activity.
- Notably, the best variants were located in coil-rich regions distant from the active site, which would likely be missed by traditional rational design.

5. Significance and Future Outlook

Paradigm Shift: PRIZM offers a "data-efficient" route to leverage foundation models, allowing researchers to repurpose existing low-throughput datasets for new design cycles without generating new training data.
Limitations & Future Work:
- Epistasis: Performance drops when predicting double mutants, as zero-shot models struggle with non-additive (epistatic) effects.
- Context-Specific Properties: The framework struggles with properties not encoded in evolutionary history (e.g., synthetic inhibitor resistance), as zero-shot models rely on natural sequence constraints.
- Integration: The authors suggest integrating PRIZM into supervised pipelines (e.g., using the selected model's embeddings as features for EVOLVEpro) or combining it with Bayesian optimization to handle uncertainty.
Impact: PRIZM democratizes protein engineering, enabling non-experts to utilize state-of-the-art foundation models effectively while providing experts with a systematic method for model selection.

Availability: All code, datasets, and documentation are publicly available via Zenodo and GitHub (PRIZM v1.1.1).