What comes after de novo? Automated lead optimization of proteins with CRADLE-1

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef who has just discovered a new, delicious recipe for a soup. It tastes good, but it's not quite perfect yet. Maybe it's a bit too salty, it spoils too quickly in the fridge, or it doesn't hold up well when you try to ship it across the country.

Lead optimization is the process of tweaking that recipe until it's perfect for the real world. In the pharmaceutical and biotech world, this "soup" is a protein (like an antibody, an enzyme, or a vaccine), and the "ingredients" are the specific instructions (amino acids) that make up its structure.

Traditionally, fixing this recipe has been like trying to find the perfect soup by guessing. A human chef would say, "Let's add a pinch more salt," test it, taste it, and then maybe try adding a little pepper next time. This is slow, expensive, and often requires hundreds of failed attempts before you get it right.

Enter CRADLE-1.

Think of CRADLE-1 as an AI-powered sous-chef that has read every cookbook in the universe and can taste a soup just by looking at the list of ingredients. It doesn't just guess; it learns from every single test you run.

Here is how CRADLE-1 works, broken down into simple steps:

1. The "Read-Only" Library (Pre-training)

First, the AI reads millions of existing protein "recipes" from nature. It learns the basic rules of how proteins are built, kind of like a child learning that "eggs usually go in pancakes" or "salt makes things taste better." This gives it a strong foundation.

2. The "Taste Test" Loop (The Workflow)

The real magic happens in a cycle called Design-Build-Test-Learn. Imagine a conveyor belt in a factory:

Design (The Brain): The AI looks at your "imperfect" soup (the template protein) and says, "If we swap this ingredient for that one, and tweak these three others, we might get a soup that is both less salty and lasts longer in the fridge." It generates hundreds of new, slightly different recipes instantly.
Build (The Kitchen): A robot arm (or a lab technician) quickly cooks up these new recipes in tiny test tubes (96-well plates).
Test (The Tasting): The lab runs tests to see how these new soups perform. Do they stick to the target virus? Do they survive heat? Do they taste good (bind well)?
Learn (The Feedback): This is the secret sauce. The AI takes the results from the lab and says, "Okay, the ones with 'Ingredient X' worked great, but 'Ingredient Y' made it bitter. I'll remember that." It updates its internal brain to be smarter for the next round.

3. Why It's a Game-Changer

The paper shows that CRADLE-1 is 4 to 7 times faster than the old "human chef" method.

Old Way: It might take 3 years and millions of dollars to tweak a protein, trying 10 or 20 different rounds of changes.
CRADLE-1 Way: It can do the same job in a fraction of the time, often needing only 1 or 2 rounds of testing.

Real-World Examples from the Paper

The authors didn't just talk about soup; they cooked up real solutions for complex problems:

The Snake Venom Antidote: They created a "nanobody" (a tiny antibody) that can neutralize venom from three different types of snakes at once, while also being stable enough to survive without a refrigerator.
The Virus Fighter: They tweaked a protein to fight both the original SARS-CoV-2 virus and its "Omicron" cousin, making it stronger and more heat-resistant.
The Industrial Enzyme: They took an enzyme used in manufacturing and made it work twice as fast while surviving boiling temperatures.

The "Black Box" Superpower

One of the coolest things about CRADLE-1 is that it doesn't need to know why something works.

Old Way: Scientists had to understand the complex chemistry and 3D shape of the protein to know what to change.
CRADLE-1 Way: It treats the protein like a "black box." You tell it, "Here is the input (the sequence), and here is the output (the test result)." It figures out the pattern without needing to understand the deep physics behind it. It's like a driverless car that learns to drive by watching millions of miles of video, without needing to understand the internal combustion engine.

The Bottom Line

CRADLE-1 is an automated, self-improving system that turns the slow, expensive, and frustrating process of drug and protein development into a fast, efficient, and reliable assembly line. It allows scientists to stop guessing and start engineering, turning "good enough" proteins into life-saving medicines and industrial tools much faster than ever before.

In short: It's the difference between trying to fix a car by randomly swapping parts and using a super-intelligent mechanic who knows exactly which part to change to make the engine run perfectly.

1. Problem Statement

Lead optimization is identified as the most time-consuming and expensive phase of pre-clinical drug discovery, typically requiring 12–36 months and costing $5M–$ 15M per candidate. The process involves iteratively refining a "lead" molecule (e.g., an antibody, enzyme, or vaccine) to simultaneously optimize multiple properties (binding affinity, thermostability, expression, immunogenicity, etc.).

Current approaches rely heavily on human-in-the-loop rational design or traditional Design-Build-Test-Learn (DBTL) cycles. These methods are often heuristic, slow, and struggle to navigate the complex, multi-dimensional trade-offs required to satisfy a "Target Product Profile" (TPP). While de novo binder design has seen significant advances, the specific problem of optimizing existing leads across diverse protein modalities remains a bottleneck.

2. Methodology: The CRADLE-1 System

CRADLE-1 is an automated, machine learning-driven framework designed to accelerate lead optimization. It operates as a closed-loop system that consumes wet-lab data and returns optimized protein sequences.

Core Architecture

The system integrates Protein Language Models (PLMs) with a multi-stage workflow:

Pre-training: Utilizes foundational PLMs trained on large-scale protein databases (e.g., UniRef).
Fine-tuning (The "Evotuned" Model):
- Unsupervised (Evotuning): The foundation model is fine-tuned on the evolutionary neighborhood (Multiple Sequence Alignment) of the specific template sequence using masked language modeling. This captures evolutionary constraints.
- Supervised (Logiter & Predictor):
  - Logiter: Fine-tuned via Group Direct Preference Optimization (g-DPO) on preference pairs (e.g., "variant A is better than variant B"). This model learns to rank sequences based on desired properties without needing explicit regression targets.
  - Predictor: A regression head added to the model to directly predict quantitative properties (e.g., $T_m$ , $K_D$ ) from sequence data.
Generation & Sampling:
- Beam Search: The system performs a beam search over mutations to generate candidates.
- Double Beam Strategy: Maintains two parallel beams: one for currently accepted sequences and a "backup" beam for sequences that may become optimal at higher "temperatures" (exploration vs. exploitation).
- Diversity-Aware Ranking: Ensures the generated library covers a diverse functional landscape rather than converging prematurely.

Workflow

The system follows an iterative Design-Build-Test-Learn cycle:

Input: A template sequence and, optionally, wet-lab sequence-function data.
Design: CRADLE-1 generates a library of variants (typically 96 candidates per round).
Build/Test: Variants are synthesized and tested in the wet lab (e.g., binding assays, thermostability, expression).
Learn: The new data is fed back into the system to update the Logiter and Predictor models for the next round.
Automation: The entire pipeline is automated, capable of running with "black box" data consumption (no need to understand underlying biochemical mechanisms).

3. Key Contributions

Speed: CRADLE-1 achieves 4–7× faster lead optimization compared to human-in-the-loop rational design, measured by the number of wet-lab rounds required.
Multi-Property Optimization: Successfully optimizes 1–6 properties simultaneously (up to 8 in private benchmarks) across diverse modalities, including VHHs, scFvs, IgGs, peptides, enzymes, CRISPR systems, and vaccines.
Data Efficiency: The system is effective even with zero-shot starts (no prior wet-lab data beyond the template) and robust against noisy wet-lab data with batch effects. It requires as little as a single 96-well plate (approx. 85–96 sequences) per round for successful learning.
Generalizability: Demonstrated success across a wide range of protein types and properties, including binding to transmembrane proteins, polyspecific binding, and enzymatic activity.
Black-Box Capability: The system can optimize proteins without requiring structural data or knowledge of the target's mechanism, relying solely on sequence-function pairs.

4. Key Results

The paper presents results from dozens of campaigns (commercial and internal), highlighting specific successes:

scFv (EGFR): Optimized framework regions of an anti-EGFR scFv. Achieved binding down to 339 pM (winning the Adaptyv Bio competition), significantly outperforming the template (6.64 nM) and the second-place entry (5.18 nM).
VHH (SARS-CoV-2): Simultaneously optimized binding to Wild Type and Omicron, thermostability, and expression. Achieved 186 pM binding to WT, 11.4 nM to Omicron, and a melting temperature ( $T_m$ ) of 70.9°C (vs. 58.5°C template).
VHH (Snake Venom): Engineered a polyspecific VHH binding three distinct neurotoxins with <100 pM affinity, 76.7°C $T_m$ , and improved expression.
Enzymes (Haloalkane Dehalogenase): Increased $T_m$ by +20.0°C (to 65.1°C) and expression by 2.06× while preserving activity.
Enzymes (P450): Improved catalytic activity by 40.6× in 3 rounds, compared to a 17.9× improvement over 8 rounds using rational design.
IgG: Successfully optimized an IgG for potency, aggregation, immunogenicity, and cell binding, delivering 10 successful candidates where previous human-led efforts failed.
CRISPR: Improved on-target editing activity from <25% to 68% and reduced off-target effects significantly in 3 rounds.
Peptides: Achieved a 50% success rate in meeting tight, simultaneous constraints (potency, specificity, expression, stability) for late-stage projects that were previously dead-ends.

Baselines Comparison:
In comparative studies against open-source ESM-2 pipelines and traditional CDR-scan/stacking methods, CRADLE-1 consistently pushed the Pareto frontier (the optimal trade-off between conflicting properties) further and faster. Baseline methods often failed to maintain trajectory in later rounds or produced sequences with poor expression.

5. Significance and Implications

Economic Impact: By reducing the time and cost of lead optimization, CRADLE-1 addresses a major cost driver in drug development, potentially enabling the pursuit of "harder" targets or rare diseases.
Organizational Shift: The high reliability (>90% success rate) and speed allow organizations to tolerate higher risk in other areas of drug discovery. It also enables better capital allocation by shortening timelines to fit within budget cycles.
Scientific Paradigm: The work demonstrates that sequence-function data can largely supersede structural data for optimization tasks. It validates the use of foundation models fine-tuned with preference optimization (g-DPO) for complex, multi-objective engineering problems.
Automation: CRADLE-1 represents a shift from heuristic, human-guided design to fully automated, data-driven protein engineering, capable of operating with minimal human intervention.

Conclusion

CRADLE-1 establishes a new standard for protein lead optimization. By leveraging foundation models, preference optimization, and automated wet-lab integration, it delivers a 4–7× speedup in development cycles while successfully optimizing complex, multi-property profiles across diverse protein modalities. The system's ability to learn from limited data and operate as a "black box" makes it a powerful tool for accelerating the discovery of therapeutics, industrial enzymes, and gene-editing tools.