Fitness translocation: improving variant effect prediction with biologically-grounded data augmentation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a robot how to bake the perfect cake. You want the robot to understand how changing a single ingredient (like swapping sugar for honey) affects the taste.

The problem? You only have a tiny notebook with recipes for one specific cake. You've tried a few variations, but you haven't tested thousands of possibilities because baking is expensive and time-consuming. If you ask the robot to guess how a new recipe will taste, it will likely fail because it hasn't seen enough examples.

This is the exact problem scientists face with proteins. Proteins are the "machines" inside our bodies and in nature. Scientists want to design new proteins (for medicine, biofuels, etc.), but they can't test every possible version in a lab. There are too many combinations.

This paper introduces a clever trick called "Fitness Translocation" to solve this data shortage. Here is how it works, using simple analogies:

1. The Problem: The "Empty Cookbook"

In protein engineering, "fitness" just means "how well the protein works."

The Goal: Predict how a protein will work if you change its code (its amino acid sequence).
The Hurdle: We only have experimental data for a tiny fraction of possible proteins. It's like trying to learn a language by reading only three sentences. Machine learning models need more data to get smart.

2. The Solution: Borrowing from Cousins

Nature is full of "cousin" proteins. These are proteins that evolved from the same ancestor. They might look different on the outside (different sequences), but they often do the same job and have similar internal structures.

The Analogy: Imagine you want to learn how to drive a Toyota (your target protein), but you have very few driving lessons. However, you have a friend who has driven thousands of miles in a Honda (a homologous protein).
The Insight: Even though the Toyota and Honda are different, the physics of turning a wheel, braking, or accelerating is similar. If your friend says, "Turning the wheel 10 degrees left makes the Honda drift slightly," you can use that logic to guess what happens in the Toyota.

3. How "Fitness Translocation" Works

The authors created a method to mathematically "transfer" the lessons learned from the Honda (the cousin protein) to the Toyota (the target protein).

Here is the step-by-step process:

Step A: The Digital Map (Embeddings)
Instead of looking at the raw letters of the protein code, the scientists use an AI (called a Protein Language Model) to translate every protein into a coordinate on a map.
- Analogy: Think of the Wild-Type (normal) protein as a house located at [0,0] on a map.
- A mutation (a change in the protein) moves the house to a new spot, say [+5, -2].
Step B: Measuring the Shift
They look at the "Honda" (the cousin protein). They see that a specific mutation moved the Honda's house from [0,0] to [+5, -2]. This movement is called an "offset."
Step C: The Translocation (The Magic Trick)
They take that exact same movement (+5, -2) and apply it to the Toyota's house.
- They don't actually change the Toyota's physical parts yet. They just create a synthetic (fake) data point that says: "If we make this change in the Toyota, it should behave like the Honda did."
- They attach the "fitness score" (how well the Honda worked) to this new synthetic Toyota data point.
Step D: Training the Robot
Now, instead of training the robot on just 100 real Toyota recipes, they train it on 100 real recipes plus 1,000 synthetic recipes borrowed from the Honda. The robot learns much faster and becomes much smarter.

4. Why This is a Big Deal

The paper tested this on three different types of proteins (enzymes, glowing proteins, and the virus spike protein).

The Result: It worked! The models predicted protein behavior much better, especially when they didn't have much real data to start with.
The "Cousin" Distance: It even worked when the "cousin" proteins were quite different (only 35% similar). This is surprising because usually, you'd think they need to be 90% identical to share lessons. But because the physics of how they work is conserved, the "driving lessons" still transferred.

5. The "Homolog Selection" Filter

You might ask: "What if I borrow from a cousin that is too different? Will that confuse the robot?"
Yes, it might. The authors also built a smart filter (an algorithm) that acts like a curator. It tests different cousins and only keeps the ones that actually help improve the prediction, discarding the ones that add noise.

Summary

Fitness Translocation is like taking a library of driving manuals from every car brand in the world, translating the physics of "how a turn affects a car" into a universal language, and using those lessons to teach a robot how to drive a specific car you haven't even built yet.

This allows scientists to design better proteins for medicine and industry without having to run millions of expensive, slow experiments in a lab. It turns "data scarcity" into "data abundance" by using the wisdom of evolution.

1. Problem Statement

Data Scarcity in Protein Engineering:
Characterizing protein fitness landscapes (the mapping between amino acid sequences and functional performance, such as enzymatic activity or binding affinity) is critical for protein engineering. However, experimental measurement of fitness for all possible variants is infeasible due to the combinatorial explosion of sequence space ( $20^k$ for $k$ sites).

Consequence: Most experimentally characterized fitness landscapes are sparsely sampled.
Limitation: Machine learning models for variant effect prediction (VEP) struggle to generalize beyond observed regions when training data is limited.
Gap: Traditional data augmentation techniques used in computer vision (e.g., image rotation) or NLP (e.g., paraphrasing) do not translate well to protein sequences, where single mutations can drastically alter function. Leveraging homologous proteins is a promising but underexplored avenue.

2. Methodology: Fitness Translocation

The authors propose Fitness Translocation, a data augmentation strategy that generates synthetic training data for a target protein by transferring variant fitness information from homologous proteins.

Core Concept

The method operates on the hypothesis that fitness landscapes are partially conserved across homologous proteins due to common ancestry. Instead of aligning sequences, it operates in the embedding space of Protein Language Models (pLMs).

Technical Workflow

Embedding Generation:
- Sequences (Wild Type and Variants) from a homologous protein are encoded into fixed-length vectors using pre-trained pLMs (specifically ESM-2 or ESM-1v).
Mutation Offset Calculation:
- For each variant in the homolog, a mutation offset vector is calculated:
  $\text{Offset} = \text{Embedding}(\text{Homolog Variant}) - \text{Embedding}(\text{Homolog Wild Type})$
- This vector captures the directional shift in the embedding space caused by the mutation.
Synthetic Variant Generation:
- The calculated offset is applied to the target protein's Wild Type embedding:
  $\text{Synthetic Embedding} = \text{Embedding}(\text{Target Wild Type}) + \text{Offset}$
- The resulting synthetic embedding represents a hypothetical variant of the target protein.
Label Assignment:
- The synthetic variant is assigned the normalized fitness value of the original homologous variant.
Training Integration:
- These synthetic pairs (embedding + fitness label) are added to the real experimental data of the target protein to train supervised regression models (SVR, Random Forest, Lasso).

Homolog Selection Algorithm

To avoid noise from irrelevant homologs, the authors developed a two-stage selection algorithm:

Stage 1 (Significance Testing): Evaluates individual homologs using a one-sided paired t-test across multiple train/validation splits. Only homologs yielding a statistically significant positive improvement ( $\alpha = 0.05$ ) are retained.
Stage 2 (Greedy Combination): Iteratively adds the next best-performing homolog to the set, retaining it only if it further improves the model performance over the previous combination.

3. Key Contributions

Novel Augmentation Strategy: Introduced "Fitness Translocation," a biologically grounded method that transfers mutational effects between proteins via embedding space offsets, bypassing the need for sequence alignment.
Algorithm for Source Selection: Developed a statistical framework to automatically identify which homologous datasets are most beneficial for a specific target, preventing the inclusion of detrimental data.
Validation Across Diverse Families: Demonstrated the method's efficacy on three distinct protein families with varying sequence identities and biological functions:
- IGPS (Enzymatic activity, ~35-40% sequence identity).
- GFP (Fluorescence intensity, ~18-45% identity).
- SARS-CoV-2 Spike (ACE2 binding/cell entry, ~99% identity).

4. Results

The study evaluated 60 configurations (3 protein families $\times$ 2 pLMs $\times$ 3 predictors) across 26 different training data sizes (45 to 1125 variants).

Performance Improvement:
- SARS-CoV-2 Spike: Showed the most significant gains, particularly for cell entry prediction. The method effectively leveraged the high similarity between strains (XBB.1.5 and BA.2).
- IGPS: Consistently improved prediction performance, even with remote homologs sharing as little as 35% sequence identity. This suggests the method captures conserved structural/functional constraints beyond simple sequence similarity.
- GFP: Results were mixed; improvements were observed primarily in low-data regimes or specific configurations, indicating that not all homologs are equally transferable for all protein types.
Low-Data Regime: The method provided the most substantial relative improvements when the target training data was small (e.g., <200 variants), addressing the primary bottleneck in protein engineering.
Embedding Space Visualization: PCA analysis confirmed that translocation aggregates homolog embeddings around the target wild type, effectively "translocating" the mutational landscape.
Robustness: The homolog selection algorithm successfully filtered out non-beneficial combinations and consistently selected the optimal set of homologs.

5. Significance and Implications

Data Efficiency: Fitness translocation enables more data-efficient protein engineering by reusing existing experimental data from related proteins, reducing the need for costly new high-throughput screening.
Complement to Zero-Shot Models: Unlike zero-shot pLM predictions (which rely on likelihood scores without experimental data), this method integrates actual experimental fitness values from homologs into a supervised learning framework.
Biological Insight: The success with remote homologs (IGPS) supports the theory that fitness landscapes are conserved due to evolutionary constraints on structure and function, even when sequences diverge significantly.
Practical Application: The method is particularly valuable for Directed Evolution, where it can help select high-quality variants for the next round of mutagenesis, potentially reducing the number of experimental rounds required to optimize a protein.

Availability: The code and implementation are open-source at https://github.com/adrienmialland/ProtFitTrans.