CoPeP: Benchmarking Continual Pretraining for Protein Language Models

Imagine you are trying to teach a robot to understand the "language of life." This language isn't made of words like "apple" or "run," but of proteins—the tiny, complex machines that build every living thing, from bacteria to blue whales.

Scientists have built "Protein Language Models" (pLMs) that are incredibly good at reading these protein sequences. They can predict how a protein will fold, what it does, and even help design new medicines.

But here's the problem: The library of protein knowledge is never finished. Every year, biologists discover millions of new proteins and fix errors in old ones. The database is like a living, breathing encyclopedia that changes every single day.

If you want your robot to stay smart, you usually have to stop it, wipe its brain, and re-teach it everything from scratch using the newest data. This is slow, expensive, and wasteful.

Enter CoPeP: The "Time-Traveling" Protein Teacher.

This paper introduces a new benchmark called CoPeP (Continual Pretraining of Protein Language Models). Think of it as a training ground to see if we can teach our robot to learn continuously, like a human student, rather than restarting school every year.

The Core Idea: Learning from History

The authors realized that the history of the protein database is actually a secret cheat code.

The "Keepers": If a protein sequence has been in the database for 10 years, it's probably a real, important protein. It's like a classic song that has been on the radio for decades.
The "Outsiders": If a sequence was added last year and removed this year, it was probably a mistake or a "fake" protein. It's like a typo in a book that the editor caught and fixed.

The CoPeP benchmark tests if AI models can use this "temporal metadata" (history) to learn better. Instead of just memorizing the latest list, the model learns to trust the "veteran" proteins and ignore the "transient" ones.

The Experiment: A 10-Year Time Jump

The researchers set up a simulation where they trained a model year-by-year, from 2015 to 2024.

The Old Way (Naive): Just feed the model the new data for 2024 and hope it doesn't forget 2015. (This usually fails; the robot forgets old lessons).
The New Ways (Continual Learning): They tried 6 different "study techniques" to see which one helped the robot learn best without forgetting:
- Time-Travel Replay: The robot is allowed to peek at old notes, but it prioritizes the notes that have been around the longest (the "Keepers").
- The "Hare and Tortoise": The robot has two brains. One learns fast (the Hare), and one learns slowly (the Tortoise). The slow brain acts as a safety net to prevent the fast brain from going crazy.
- The "Eraser": If the robot learns something wrong (a protein that was later deleted), this method actively forces the robot to "unlearn" that mistake.

The Results: Why It Matters

The findings were surprising and exciting:

History is Gold: The model that used the "Time-Travel Replay" technique (focusing on proteins that stayed in the database) learned the "natural language" of proteins better than any other method. It was 7% more accurate than just training on all the data at once.
- Analogy: It's like learning a language by reading only the books that have survived for 50 years, rather than reading every pamphlet ever printed, including the ones with typos.
Different Tools for Different Jobs:
- If you want the robot to understand the general flow of protein language, "Replay" is the best teacher.
- If you want the robot to predict specific mutations (like "what happens if we change this one letter?"), methods like "Hare and Tortoise" or "Gradient Ascent" work better.
- If you want the robot to solve general biology puzzles, "Shrink and Perturb" (a method that shakes the robot's brain to keep it flexible) works best.

The Big Picture

This paper proves that we don't need to burn down the school and rebuild it every time new data arrives. By using continual learning, we can keep our AI models up-to-date, efficient, and smarter.

In simple terms: CoPeP shows us how to build a protein AI that grows up with us, learning from the past while adapting to the future, making the discovery of new drugs faster, cheaper, and more sustainable. It turns the chaotic, ever-changing world of biology into a structured classroom where the AI never stops learning.

1. Problem Statement

Protein Language Models (pLMs) have revolutionized computational biology by learning evolutionary statistics from large databases to predict protein structure and function. However, these models face a critical challenge: data dynamism. The primary training source, the UniProt Knowledgebase (UniProtKB), is continuously updated with millions of new sequences annually while millions of others are curated out (removed due to redundancy or errors).

The Inefficiency: Retraining pLMs from scratch on every new data release is computationally prohibitive.
The Opportunity: The temporal evolution of these databases contains valuable meta-information. Sequences that persist over time are likely valid, functional proteins, while those removed may represent noise or pseudogenes.
The Gap: Existing Continual Learning (CL) benchmarks (e.g., CIFAR-10, MNIST) are small-scale and synthetic, failing to capture the complexity, scale, and specific distribution shifts of real-world biological data. There is a lack of benchmarks designed to study the temporal evolution of pre-training distributions themselves.

2. Methodology: The CoPeP Benchmark

The authors introduce CoPeP (Continual Pretraining of Protein Language Models), a large-scale, realistic benchmark designed to evaluate CL strategies on pLMs.

Dataset Construction

Source: 10 consecutive yearly releases of UniRef100 (a non-redundant clustering of UniProtKB) from 2015 to 2024.
Scale: The datasets cover approximately 580 million unique entries, reflecting non-linear growth and curation shifts.
Task Definition: Each yearly release ( $D_i$ ) is treated as a sequential task. Unlike traditional CL where tasks have distinct distributions, here all tasks sample from a shared underlying natural protein distribution ( $P^*$ ), but exhibit systematic drift due to curation practices and research interests.
Temporal Meta-Information: The benchmark allows models to leverage the multiplicity of sequences (how many consecutive years a sequence has persisted). This serves as a signal for data reliability.

Evaluation Suite

Models are evaluated on three distinct types of tasks:

Natural Distribution (UniProt Validation Set): A high-quality set of 10,000 experimentally verified proteins. Metrics: Perplexity and Sequence Recovery.
Fitness Prediction (ProteinGym): A zero-shot benchmark measuring the ability to predict the effects of mutations on protein fitness (Spearman correlation).
Multi-Task Understanding (PEER & DGEB): Benchmarks covering function, subcellular localization, structure, and genomic embedding tasks.

Experimental Setup

Base Model: AMPLIFY (120M parameters), a bidirectional protein language model.
Training Protocol: 100k steps per task using AdamW. The authors replaced the standard cosine decay with Warmup-Stable-Decay (WSD) to mitigate the difficulty of re-warming learning rates in continual settings.
Baselines:
- Joint Training: Trained on all 2015–2024 data combined.
- Single Year Incremental: Trained only on the current year's data.
- Single Year Matched: Trained on a single year but for the cumulative number of steps of the CL methods.

Methods Evaluated

The study evaluates 7 state-of-the-art CL methods across three categories:

Standard Continual Learning:
- Sequential Training (Naive): No intervention.
- Temporal Replay: An unbounded replay buffer where samples are sampled with probability proportional to their temporal persistence (multiplicity).
Plasticity-Preserving:
- Shrink and Perturb: Periodically scales down weights and injects noise.
- Hare and Tortoise: Maintains "fast" and "slow" weight sets; the slow set is an EMA of the fast set, with periodic resets.
Unlearning (Forgetting):
- Gradient Ascent: Maximizes loss on the "forget set" (sequences removed in the current step) while minimizing loss on current data.
- Random Labels: Overwrites ground truth for the forget set with random tokens to erase correlations.

3. Key Contributions

CoPeP Benchmark: The first large-scale benchmark for continual pretraining on real-world biological data, spanning a decade of UniRef100 evolution.
Scale Evaluation: The first application of diverse CL methods (including unlearning and plasticity-preserving techniques) to models and datasets of this magnitude (120M parameters, hundreds of millions of sequences).
Temporal Meta-Information Discovery: Demonstrates that leveraging the history of data persistence (temporal metadata) improves model performance beyond standard i.i.d. training, even outperforming joint training on all data.

4. Key Results

Performance on Natural Distribution (UniProt Validation)

Temporal Replay achieved the best performance, significantly outperforming naive sequential training and even the Joint Training baseline.
Insight: The Joint baseline performs worse because it includes sequences that were later curated out (likely noise/pseudogenes). Continual methods, by implicitly filtering out these transient sequences via temporal persistence, learn a distribution better aligned with valid proteins.
Result: Incorporating temporal meta-information improved perplexity by up to 7% compared to joint training.

Performance on Fitness Prediction (ProteinGym)

Gradient Ascent and Hare and Tortoise outperformed all other methods, including Joint Training.
Contrast: Temporal Replay performed poorly here. The authors hypothesize that ProteinGym requires sensitivity to local mutations and specific wild-type sequences, whereas Replay prioritizes persistent sequences which may not align with the specific mutation landscape of ProteinGym.

Performance on Multi-Task Benchmarks (PEER & DGEB)

Shrink and Perturb achieved the highest win rate on PEER (general understanding).
Random Labels achieved the highest win rate on DGEB (diverse genomic embeddings).
General Trend: Specialized CL methods consistently outperformed naive sequential baselines and often matched or exceeded Single Year baselines, even when the Single Year models were trained for equivalent cumulative steps.

5. Significance and Implications

Validation of Continual Learning: The study proves that continual pretraining is not just a necessity for data volume but a superior strategy for learning robust protein representations, provided the right method is chosen for the downstream task.
Trade-offs in Method Selection: There is no "one-size-fits-all" CL method.
- Use Temporal Replay for modeling natural protein distributions and therapeutic design.
- Use Plasticity/Unlearning methods (Hare & Tortoise, Gradient Ascent) for zero-shot fitness prediction and mutation analysis.
Sustainable AI: CoPeP demonstrates that models can maintain state-of-the-art performance without expensive retraining from scratch, paving the way for more sustainable and accessible drug discovery pipelines.
Data Quality Signal: The results suggest that the "persistence" of a sequence in a database is a powerful proxy for its biological validity, a signal that can be exploited to curate better pre-training corpora.

In conclusion, CoPeP establishes a new standard for evaluating how AI models can adapt to the evolving nature of scientific knowledge, specifically in the critical domain of protein biology.