Effects of protein interface mutations on protein quality and affinity

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Broken Toy" Problem

Imagine you are trying to build a custom Lego key to fit into a specific Lego lock (an antibody fitting into an antigen). You want to know exactly which Lego piece you need to change to make the key fit perfectly.

However, there's a catch. When you swap out a Lego piece, two things can go wrong:

The Shape is Wrong: The new piece doesn't fit the lock anymore because it's the wrong shape or size. (This is Protein-Interaction).
The Toy Breaks: The new piece is so weird that the whole Lego key falls apart, crumbles, or refuses to be built in the first place. (This is Protein-Quality).

The Problem: In the past, scientists measured how well the key fit the lock. But if the key fell apart (Protein-Quality issue), it looked like it didn't fit the lock at all. Scientists couldn't tell if the key was just a bad shape or if it was just broken. This made it very hard to teach computers how to design better keys.

What This Paper Did: The "Double-Check" System

The researchers created a clever experiment to separate these two problems. They used a system with two different locks (two different antibodies) that both try to open the same key (the antigen), but they grab onto different parts of the key.

The Main Lock: This is the one they really care about.
The Control Lock: This one grabs a totally different part of the key.

The Logic:

If you change a piece on the key and both locks stop working, it means the key itself is broken or falling apart. That's a Protein-Quality issue.
If you change a piece and only the Main Lock stops working, but the Control Lock still works fine, it means the key is still sturdy, but you specifically broke the connection for the Main Lock. That's a Protein-Interaction issue.

By using this "Control Lock," they could look at thousands of mutations and sort them into two piles: "Broken Keys" (Quality) and "Wrong Shapes" (Interaction).

The Big Discovery: Most "Broken Keys" Are Just Badly Made

When they sorted the data, they found a surprising truth:

Most mutations (about 80-90%) didn't actually change the shape of the lock interface. Instead, they just made the protein unstable, causing it to fold incorrectly or not be produced by the cell.
Very few mutations actually changed the specific "handshake" between the antibody and the antigen.

The Analogy: Imagine you are trying to fix a car engine. You try 100 different parts. You find that 90 of them fail because the part was made of cheap plastic and melted (Quality). Only 10 failed because they were the wrong size for the bolt (Interaction).

The Computer Model Problem

Scientists have been training powerful AI computers (like ESM-IF1 and ThermoMPNN) to predict how mutations affect binding. They fed these computers huge datasets of "how well the key fits."

The Result: The AI models were actually quite smart, but they were learning the wrong lesson.

The models were excellent at predicting which mutations would make the protein break or melt (Protein-Quality).
They were terrible at predicting which mutations would change the specific handshake (Protein-Interaction).

It's like training a chef to cook by giving them a list of meals that were burned. The chef gets really good at knowing why food burns (too much heat, bad ingredients), but they never learn how to make the food taste better. The AI models are great at spotting "broken" proteins, but they can't yet design the perfect "handshake."

Why This Matters

The paper concludes that to build the next generation of "super-AI" that can design perfect medicines and antibodies, we need to stop feeding it mixed-up data.

We need to give the AI data that has already been cleaned up—data where we know exactly which mutations broke the protein and which ones just changed the fit. Only then can the AI learn the true secrets of molecular recognition and design truly effective drugs.

Summary in One Sentence

This paper shows that most mutations break proteins by making them unstable (like a crumbling toy), not by changing how they fit together, and because current AI models mostly learn to spot the "crumbling," we need new, cleaner data to teach them how to design the perfect "fit."

1. Problem Statement

Accurately predicting antibody-antigen binding affinity is critical for therapeutic development and protein engineering. However, a major confounder exists in current high-throughput datasets used to train and benchmark computational models: the conflation of protein-interaction effects (intrinsic binding affinity) with protein-quality effects (biophysical properties like folding, stability, and expression levels).

Observed Affinity ( $oK_D$ ): The experimentally measured binding strength, which is a composite signal. It is influenced by both the true thermodynamic affinity of the active protein complex and the fraction of the protein population that is correctly folded and expressed (protein quality).
The Gap: Most Deep Mutational Scanning (DMS) datasets and machine learning models (e.g., inverse folding models) are trained on $oK_D$ . Consequently, these models often learn to predict whether a mutation destabilizes the protein (protein quality) rather than whether it disrupts specific interface contacts (protein interaction). This limits their ability to guide rational design for affinity maturation.

2. Methodology

The authors developed an experimental and analytical framework to disentangle these two effects using a large-scale DMS dataset.

Experimental Design

System: Four distinct VHH (single-domain antibody)-antigen complexes were selected:
- SARS-CoV-2 Receptor Binding Domain (RBD) bound to two different VHHs.
- Botulinum neurotoxin A bound to two different VHHs.
Mutagenesis: They generated libraries containing single and double mutations in both the paratope (VHH) and the epitope (antigen).
Assay: The AlphaSeq yeast-display assay was used to measure observed affinity ( $oK_D$ ) for thousands of variants. This assay relies on yeast mating efficiency, which is dependent on both binding affinity and surface protein abundance (a proxy for protein quality).
The Control Strategy (Key Innovation): For each primary VHH-antigen complex, they introduced a Control VHH that binds to a non-overlapping epitope on the same antigen.
- Logic: Mutations in the primary VHH's epitope should affect the binding of the Primary VHH (interaction + quality) and the Control VHH (quality only, as the epitope is distinct).
- Decomposition:
  - Protein-Quality Change: If a mutation affects the affinity of both the Primary and Control VHHs similarly, the change is attributed to the antigen's global biophysical properties (folding/stability).
  - Protein-Interaction Change: If a mutation significantly alters the affinity of the Primary VHH but has little effect on the Control VHH (measured as $\Delta\Delta oK_D$ ), the change is attributed to specific interface disruption.

Computational Benchmarking

The authors evaluated state-of-the-art models against this disentangled dataset:

Models Tested: Inverse folding models (ESM-IF1, ProteinMPNN, AbMPNN), stability predictors (ThermoMPNN, RaSP, KORPM), and interface predictors (Rosetta Flex).
Metrics: Spearman correlations between model scores (log-likelihoods, $\Delta\Delta G$ ) and experimental $\Delta oK_D$ , separated into "Protein-Quality" and "Protein-Interaction" subsets.

3. Key Contributions

Disentanglement Framework: Established a rigorous method to separate intrinsic binding affinity from protein quality effects in high-throughput DMS data using orthogonal control binders.
Large-Scale Dataset: Generated a comprehensive dataset of ~7,000 mutations (single and double) across four VHH-antigen complexes, explicitly categorized by interaction vs. quality effects.
Structural/Biophysical Characterization: Mapped the biochemical and structural differences between residues that drive protein quality (often conservative, hydrophobic, non-interface) versus those driving protein interaction (often charge/polarity changes, direct interface bonds).
Model Audit: Demonstrated that current leading models (ESM-IF1, ThermoMPNN) primarily predict protein quality/stability rather than specific protein-protein interaction changes.

4. Key Results

A. Prevalence of Protein-Quality Effects

Dominance of Quality: In antigen mutants, 83.6%–93.2% of single mutations and 89.4%–98.9% of double mutations reduced observed affinity primarily due to protein-quality degradation (folding/stability issues) rather than interface disruption.
Classification: Only a small fraction of epitope positions (5%–27%) showed significant protein-interaction effects ( $\Delta\Delta oK_D > 1$ ). Most positions affected both binders equally, indicating global stability issues.
Biochemical Signatures:
- Protein-Interaction positions: Enriched in charge-altering mutations, polar/ionic contacts, and hydrogen bonds.
- Protein-Quality positions: Dominated by conservative substitutions and hydrophobic/aromatic interactions; mutations here often disrupt the global fold rather than specific contacts.

B. Model Performance Analysis

Inverse Folding Models (ESM-IF1, ThermoMPNN):
- Showed high correlation with Protein-Quality changes (Spearman $R \approx 0.6$ ).
- Showed poor correlation with Protein-Interaction changes.
- Crucially, these models predicted the affinity changes of the Control VHH (which is unaffected by the interface) just as well as the Primary VHH. This proves the models are learning stability/folding signals, not binding specificity.
Rosetta Flex: Performed better at predicting interaction changes for VHH (paratope) mutations but still struggled with antigen (epitope) mutations where quality effects dominate.
Epistasis (Double Mutants): Models performed significantly worse on non-additive double mutants compared to single mutants, indicating a failure to capture complex epistatic interactions essential for affinity maturation.
Structure Dependence: The performance of ESM-IF1 was highly sensitive to the specific crystal structure used for conditioning. Using a structure from a different antibody-antigen complex often yielded better predictions than using the "native" structure, suggesting models rely on training data biases rather than true physical principles of the specific interface.

C. VHH vs. Antigen Mutations

Models performed better on Antigen mutations (where quality effects dominate) than on VHH mutations.
VHH mutations (mostly in CDR loops) are less likely to cause global folding collapse, meaning the signal is more "pure" interaction. However, current models still fail to accurately predict these interaction-specific changes, likely due to the flexibility of CDR loops not being well-represented in static training structures.

5. Significance and Implications

Re-evaluation of Current Models: The study reveals that the high performance of current AI models on DMS datasets is largely an artifact of them predicting protein stability (quality) rather than binding affinity. They are not yet "true" affinity predictors.
Data Requirements for Next-Gen Models: To train models that can truly predict binding affinity for drug design, future datasets must explicitly separate protein-interaction effects from protein-quality effects. Simply collecting more raw affinity data without controls will not solve the problem.
Experimental Best Practices: The authors advocate for the inclusion of orthogonal control binders (like the Control VHH) in all high-throughput affinity screens to filter out stability-driven noise.
Future Directions: The field needs models trained on "real affinity" (interaction-only) datasets. While current models are excellent for stability prediction, they require architectural changes or new training paradigms to capture the specific energetics of protein-protein interfaces, particularly for flexible antibody loops.

In summary, this paper provides a critical "reality check" for the computational biology community, demonstrating that without controlling for protein quality, high-throughput mutational data cannot be used to train models capable of distinguishing between a protein that falls apart and a protein that simply doesn't bind.