CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a chef trying to create the perfect new recipe for a cake. You know the basic ingredients (flour, sugar, eggs), but you want to make it taste amazing.

If you only change one thing at a time—maybe a pinch more vanilla, or a little less sugar—you can easily figure out what works. This is like "single-mutant" protein engineering, which scientists have studied for years.

But what if you want to change five things at once? Maybe you swap the flour, change the sugar type, add a new spice, alter the baking temperature, and change the mixing speed? The number of possible combinations is astronomical. Worse, these changes interact in weird ways: adding more vanilla might be great unless you also change the sugar, in which case the cake tastes terrible. This complex web of interactions is called epistasis.

For a long time, scientists had a huge gap: they had great tools to predict how changing one ingredient works, but no good way to predict what happens when you change many at once.

Enter CombinGym.

What is CombinGym?

Think of CombinGym as a giant, high-tech training gym for Artificial Intelligence (AI) chefs.

Instead of a real gym with weights and treadmills, CombinGym is a digital playground filled with 14 different "workout routines" (datasets). These routines involve 9 different types of proteins (the "ingredients" of life), ranging from antibodies that fight viruses to enzymes that act like biological scissors, and glowing proteins that light up like fireflies.

The goal of this gym is to train AI models to become expert chefs who can predict the taste of a cake even if they've never baked that specific combination before.

How Does the Training Work?

The researchers didn't just throw random data at the AI. They set up a clever hierarchical training system, like a video game with increasing difficulty levels:

Level 0 (Zero-Shot): The AI has to guess the outcome of a complex recipe having never seen any data about this specific protein. It's like guessing how a cake tastes just by looking at the raw ingredients list.
Level 1 (1-vs-Rest): The AI is shown only recipes with one change (e.g., "What happens if we just add more vanilla?"). It then has to guess what happens if you change five things at once.
Level 2 & 3: The AI gets to see recipes with two or three changes before being tested on the super-complex ones.

The Big Discovery: The study found that if you train the AI on simple, single-change recipes first, it gets much better at predicting the complex, multi-change recipes. It's like learning to ride a bike with training wheels before trying to ride a unicycle on a tightrope. The simple lessons teach the AI how the ingredients "talk" to each other.

The "Noise" Problem

Real-world cooking is messy. Sometimes your scale is off, or the oven temperature fluctuates. In science, this is called measurement noise.

The researchers discovered that if the data the AI learns from is "noisy" (inaccurate), the AI gets confused and performs poorly. However, they found that cleaning up the data (normalizing it) and averaging out the errors made the AI chefs significantly smarter. It's the difference between trying to learn a recipe from a blurry, scribbled note versus a clear, high-definition photo.

The Results: From Simulation to Reality

The researchers didn't just stop at computer simulations. They put their best AI models to the test in the real world:

The Virtual Test: They used the AI to design a glowing protein (CreiLOV) that was brighter than anything nature had made. The AI successfully predicted which combinations of mutations would make it shine the brightest.
The Real-World Test: They used the AI to redesign an enzyme (RhlA) to produce a specific chemical more efficiently. The result? A massive increase in production yield, proving the AI wasn't just guessing; it was actually engineering better biology.

Why This Matters

Before CombinGym, trying to engineer complex proteins was like trying to find a needle in a haystack by blindfolded guessing. You'd have to test millions of combinations, which is expensive and slow.

CombinGym provides a standardized scoreboard (a leaderboard) where different AI models can compete. It tells scientists: "Hey, if you want to design a new drug, use Model A. If you want to make a better enzyme, use Model B."

It also acts as a community hub. Just like GitHub for code, CombinGym allows scientists worldwide to upload their own data, share their best models, and collectively build a smarter future for protein engineering.

The Bottom Line

CombinGym is the bridge between "guessing" and "knowing." It teaches AI how to understand the complex, chaotic dance of multiple mutations, turning the impossible task of designing life's building blocks into a solvable puzzle. By learning from simple changes, these AI models are now ready to help us engineer proteins that can cure diseases, clean our environment, and power our industries.

1. Problem Statement

Protein engineering relies on exploring the "sequence-function landscape" to create variants with enhanced properties. While machine learning (ML) has advanced significantly in predicting the effects of single mutations, a critical gap remains in combinatorial mutagenesis (simultaneous multiple mutations).

The Challenge: The interactions between amino acid residues (epistasis) create rugged, non-linear fitness landscapes. Predicting the function of higher-order mutants (e.g., double, triple, or higher) based on lower-order data is difficult.
Limitations of Existing Benchmarks: Current benchmarks (e.g., ProteinGym, FLIP) primarily focus on single-mutant libraries. They lack standardized datasets for combinatorial variants, often omit experimental validation, and fail to address how data noise and preprocessing affect model generalization to unseen, complex combinations.

2. Methodology

The authors introduced CombinGym, a comprehensive benchmarking platform designed to evaluate ML models on combinatorial protein design.

A. Dataset Curation

Scope: 14 curated Deep Mutational Scanning (DMS) datasets spanning 9 proteins with diverse functions:
- Protein Binding: GB1, CR6261, CR9114 (antibodies).
- Fluorescence: CreiLOV (oxygen-independent), mTagBFP2/mKate2 (eqFP611 variants).
- Enzymatic Activity: SpCas9, SaCas9 (CRISPR nucleases), HIV-1 protease, RhlA (rhamnosyltransferase).
Scale: Over 400,000 characterized variants, including single, double, triple, and higher-order mutants.
Preprocessing:
- Normalization: Min-max normalization applied to ensure scale uniformity across different assay types (fluorescence, binding affinity, activity).
- Noise Handling: Biological replicates were averaged to mitigate measurement noise; specific strategies were tested for handling low-reproducibility data (e.g., Cas9 datasets).
- Structure Generation: AlphaFold3 was used to predict 3D structures for all targets to ensure consistent inputs for structure-based models, even when PDB files existed.

B. Benchmarking Framework

The platform evaluates 9 ML models across 5 methodological categories:

Alignment-based: EVmutation, DeepSequence (MSA-derived).
Protein Language Models (PLMs): ESM-1b, ESM-1v (Transformer-based).
Structure-based: GVP-Mut (Geometric Vector Perceptron).
Sequence-label: CNN, Ridge Regression, MAVE-NN.
Substitution-based: BLOSUM62.

Evaluation Scenarios (Hierarchical Splits):
To test the ability to extrapolate from simple to complex data, the authors implemented four specific training/testing splits:

0-vs-rest (Zero-shot): No training data; models predict all mutants based on pre-training.
1-vs-rest: Trained on Wild-Type (WT) + Single mutants; tested on double/triple+ mutants.
2-vs-rest: Trained on WT + Single + Double; tested on triple+ mutants.
3-vs-rest: Trained on WT through triple mutants; tested on higher-order (>3) mutants.

Metrics:

Spearman's $\rho$ : Measures overall ranking correlation (predictive accuracy).
Normalized Discounted Cumulative Gain (NDCG): Measures the ability to identify the top-performing variants (critical for engineering applications).

3. Key Contributions

First Combinatorial Benchmark: CombinGym is the first platform specifically designed to benchmark ML models on combinatorial mutagenesis rather than just single-point mutations.
Systematic Analysis of Confounders: The study rigorously quantifies how measurement noise and data normalization (log vs. min-max) impact model performance, providing guidelines for data preprocessing.
Hierarchical Validation Strategy: The "N-vs-rest" splitting strategy explicitly tests the "curse of dimensionality" in protein engineering, demonstrating how lower-order data informs higher-order predictions.
Integrated Validation: The platform bridges computation and experiment by validating predictions through:
- In silico simulation (CreiLOV).
- Wet-lab experimental validation (RhlA enzyme engineering).
Open Resource: An interactive website (combingym.org) hosts datasets, code, leaderboards, and offers integration with automated biofoundries for experimental validation.

4. Key Results

Model Performance

Supervised vs. Unsupervised: Supervised models generally outperformed unsupervised ones as training data complexity increased.
Top Performers:
- MAVE-NN and GVP-Mut achieved the highest overall Spearman's $\rho$ and NDCG scores.
- Ridge Regression and CNN also performed exceptionally well in design tasks.
Task Difficulty: Models performed best on Protein Binding, followed by Fluorescence, and worst on Enzymatic Activity (likely due to the complex, multi-step catalytic mechanisms of enzymes).
Zero-Shot Limitations: Unsupervised models (e.g., ESM-1b, EVmutation) showed variable performance in zero-shot scenarios, sometimes failing to predict specific phenotypes (e.g., negative correlations for certain antibody landscapes).

Impact of Data Factors

Noise: High measurement noise (e.g., in Cas9 datasets) significantly degraded model performance. Averaging biological replicates improved results.
Normalization: Min-max normalization was found essential. Combining log transformation with min-max normalization often yielded the best NDCG scores.
MSA Depth: For alignment-based models, MSA depth beyond a minimum threshold (10 sequences/L) did not significantly improve performance, suggesting diminishing returns for deeper alignments in these specific contexts.

Validation Case Studies

CreiLOV (In Silico): Models trained on single/double mutants successfully predicted higher-order (4-15 mutations) variants. The top 384 predicted mutants were 98% brighter than wild-type, with some exceeding the best single/double mutants.
RhlA (Experimental): Using MAVE-NN trained on lower-order mutants, the team designed and synthesized higher-order variants. Experimental results confirmed a substantial increase in specific activity and substrate specificity, validating the platform's utility for real-world protein engineering.

5. Significance

Accelerating Protein Engineering: CombinGym provides a standardized "testbed" to identify which ML models are robust enough for designing complex combinatorial libraries, reducing the trial-and-error cost in the lab.
Understanding Epistasis: By demonstrating that lower-order mutant data (single/double) can effectively train models to predict higher-order mutants, the study offers a practical strategy to navigate vast sequence spaces without needing to screen every possible combination.
Community Standard: The platform establishes a new standard for evaluating protein design algorithms, moving the field beyond single-mutation benchmarks toward the more challenging and practically relevant domain of combinatorial design.
Closed-Loop Workflow: The integration with automated biofoundries creates a feedback loop where computational predictions are rapidly validated experimentally, and new data is fed back into the benchmark, fostering continuous model improvement.