Large scale prospective evaluation of co-folding across… — Plain-Language Explanation

Original authors: Kim, J., Correy, G. J., Hall, B. W., Rachman, M. M., Mailhot, O., Togo, T., Gonciarz, R. L., Jaishankar, P., Neitz, R. J., Hantz, E. R., Doruk, Y. U., Stevens, M. G. V., Diolaiti, M. E., Reid, R., Gop

Published 2026-03-18

📖 5 min read🧠 Deep dive

View on bioRxiv ↗PDF ↗

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to find a specific key that fits into a very complex, shape-shifting lock. In the world of medicine, the "lock" is a protein in a virus (like the SARS-CoV-2 virus), and the "key" is a drug molecule designed to stop the virus from working.

For decades, scientists have used two main ways to find these keys:

The "Physics" Method (Docking): This is like trying to force keys into the lock by calculating how well their teeth fit the grooves based on the laws of physics. It's fast and can test millions of keys, but it's often a bit clumsy and misses the subtle ways the lock might wiggle to accept a key.
The "AI" Method (Co-folding): This is a new, super-smart AI that has "read" millions of pictures of locks and keys. It tries to guess what the lock and key look like when they are holding hands. It's incredibly good at guessing shapes, but we didn't know if it was just memorizing the pictures or actually understanding how the lock works.

This paper is a massive "stress test" for these new AI methods. The researchers gathered 557 brand-new keys (drug molecules) that were designed after the AI had finished its training. They wanted to see: Did the AI actually learn the rules, or did it just cheat by memorizing the answers?

Here is the breakdown of what they found, using some everyday analogies:

1. The "Shape" Test: Can the AI guess where the key goes?

The Setup: They took 557 new drug molecules and asked three different AIs (AlphaFold3, Chai-1, and Boltz-2) to predict exactly how they would sit inside the virus protein. They compared the AI's guess to the real, crystal-clear photo of the molecule sitting in the protein.

The Result: The AI was amazing.

The Analogy: Imagine a blindfolded chef trying to guess how a new, weirdly shaped vegetable fits into a specific slot in a kitchen drawer. The AI got the position right more than 50% of the time (and often much better), even for shapes it had never seen before.
The Surprise: The old "Physics" method (Docking) was actually worse at getting the shape right. The AI was like a master sculptor who could visualize the fit perfectly, while the physics method was like a carpenter trying to hammer the piece in.

The Catch: The AI was great at the position, but it sometimes failed to notice that the lock (the protein) was stretching or twisting to let the key in. It was like the AI knew exactly where the key went, but didn't realize the doorframe had moved to let it through.

2. The "Confidence" Test: Does the AI know when it's right?

The Setup: When you ask an AI a question, it usually gives you a "confidence score" (e.g., "I'm 90% sure"). The researchers asked: Does a high confidence score mean the drug will actually work?

The Result:

For Shape: Yes! If the AI said, "I'm super confident this fits," it usually did fit perfectly.
For Strength (Potency): This is where it got tricky. One AI (Boltz-2) was surprisingly good at guessing how strong the drug would be (like guessing if a key turns the lock easily or with a struggle). It was better than the old physics methods at this.
The Analogy: The AI is like a weather forecaster. It's great at saying, "It will rain tomorrow" (predicting the shape). It's okay at saying, "It will rain hard" (predicting strength). But it's not perfect.

3. The "Needle in a Haystack" Test: Can the AI find the good keys in a pile of junk?

The Setup: In real drug discovery, you don't just have 500 good keys; you have billions of keys, and 99.9% of them are junk (fake keys that look good but don't work). The researchers took lists of "top candidates" found by the old physics method and asked the AI to re-rank them to separate the real winners from the fakes.

The Result: The AI struggled.

The Analogy: Imagine the physics method is a fast metal detector that scans a beach and finds 1,000 shiny objects. The AI is a slow, expensive expert who looks at those 1,000 objects. You'd expect the expert to be better at spotting the real gold.
What happened: The expert (AI) was actually worse at spotting the gold than the metal detector (physics) in this specific scenario. The AI got confused by the sheer variety of "junk" keys. It seemed to rely too much on patterns it had seen before, and when faced with totally new, weird junk, it couldn't tell the difference between a real drug and a fake one.

The Big Takeaway: Teamwork is Key

The paper concludes that neither method is perfect on its own. They are like two different tools in a toolbox:

The Physics Method (Docking) is the Speedster. It's fast, cheap, and great for scanning millions of options to find a shortlist of "maybe" candidates.
The AI Method (Co-folding) is the Refiner. It's slower and more expensive, but once you have a shortlist, it's incredible at figuring out the exact 3D shape and how strong the bond will be.

The Verdict:
Don't throw away the old tools. Instead, use the fast physics method to narrow down the billions of options, and then use the smart AI to fine-tune the best candidates. Together, they can help us design better drugs faster than ever before.

In short: The new AI is a brilliant artist who can draw the perfect picture of a drug fitting into a virus, but it still needs the old-school detective to help it find the right suspects in a crowd of millions.

1. Problem Statement

Accurate prediction of ligand-bound protein structures and the ranking of these complexes by affinity are central challenges in structure-based drug discovery (SBDD). While deep learning (DL) "co-folding" methods (e.g., AlphaFold3, Chai-1, Boltz-2) have emerged as promising tools for predicting protein-ligand complexes, their evaluation has been limited by:

Data Leakage: Difficulty in ensuring test sets are independent from training data.
Scale: Insufficiently large prospective test sets to rigorously assess generalizability.
Bias: Concerns that these models rely on structural memorization rather than learning physical principles, potentially failing to predict conformational changes or distinguish true binders from false positives in diverse virtual screens.

The study aims to prospectively evaluate the performance of three state-of-the-art co-folding methods against traditional physics-based docking to determine their utility in both ligand pose prediction and hit prioritization.

2. Methodology

The authors conducted a two-part prospective evaluation using datasets determined after the training cutoff dates of the evaluated models.

Benchmark 1: Mac1 Ligand Pose Prediction (Structure Recovery)

Dataset: 557 previously unreported X-ray crystal structures of ligands bound to the SARS-CoV-2 NSP3 macrodomain (Mac1). These ligands were synthesized during an active drug discovery campaign and are largely dissimilar to structures in the training sets of the models (cutoffs: AF3/Chai-1 ~2021, Boltz-2 ~2023).
Methods Compared:
- Co-folding: AlphaFold3 (AF3), Chai-1, and Boltz-2.
- Traditional Docking: DOCK3.7.
Metrics: Ligand Root Mean Square Deviation (L-RMSD) against experimental structures, confidence scores (pLDDT, ipTM), and predicted affinity (Boltz-2 pIC50).
Analysis: Evaluated pose accuracy, ability to capture protein conformational changes (e.g., loop opening, peptide flips), and correlation between model confidence and experimental potency.

Benchmark 2: Virtual Screen Hit Prioritization (Classification)

Dataset: Experimentally tested "hit-lists" from three large-scale prospective docking campaigns against diverse targets:
1. AmpC $\beta$ -lactamase: 1,293 molecules (from a 1.7B library).
2. $\sigma_2$ Receptor: 506 molecules (from a 490M library).
3. Dopamine D4 Receptor: 541 molecules (from a 138M library).
Task: Distinguish true binders (actives) from high-scoring docking false positives (inactives).
Methods: Re-scoring the hit-lists using AF3 L-pLDDT, Boltz-2 pIC50, and DOCK3.7 energies.
Metrics: Area Under the Curve (AUC) for ROC analysis, hit enrichment rates, and correlation with experimental $pK_i$ / $pIC_{50}$ .

3. Key Contributions

Large-Scale Prospective Dataset: Creation of a high-quality benchmark of 557 Mac1 complexes with atomic-resolution structures, representing a realistic industrial drug discovery scale.
Rigorous Independence: The test sets were explicitly selected to be outside the training windows of the DL models, minimizing data leakage concerns.
Dual-Task Evaluation: The study uniquely contrasts the performance of co-folding in two distinct scenarios: (1) refining structures of known binders vs. (2) classifying binders vs. non-binders in diverse virtual screens.
Complementarity Analysis: Demonstrated that co-folding and physics-based docking have uncorrelated error profiles, suggesting they are complementary rather than mutually exclusive.

4. Key Results

A. Ligand Pose Prediction (Mac1 Benchmark)

Superior Accuracy: Co-folding methods significantly outperformed traditional docking.
- AF3 & Chai-1: Reproduced >70% of ligand poses with <2 Å RMSD.
- Boltz-2: Achieved ~52% accuracy (<2 Å RMSD).
- DOCK3.7: Only achieved ~41% accuracy.
Conformational Changes: Co-folding methods failed to predict large-scale protein conformational changes (e.g., the "open" state loop rearrangement or "twisted" backbone flips) observed experimentally. Despite this, they often placed the ligand correctly, suggesting they learn interaction patterns rather than full physical dynamics.
Memorization Check: Pose accuracy showed no meaningful correlation with structural similarity to the training set, indicating the models generalize well to novel chemotypes.
Confidence vs. Accuracy: Model confidence scores (AF3 L-pLDDT, Chai-1 ipTM) strongly correlated with pose accuracy (AUC ~73-76%).
Affinity Prediction:
- Boltz-2: Showed the strongest correlation with experimental potency ( $r \approx 0.6$ ) and, after linear calibration, achieved a Mean Absolute Error (MAE) of ~0.7 kcal/mol, outperforming a baseline predictor.
- AF3/Chai-1: Confidence scores showed weak but significant correlations with potency.
- DOCK3.7: Showed very weak correlation with affinity ( $r \approx -0.2$ ).

B. Hit Prioritization (Virtual Screen Benchmark)

Failure to Rescore: When applied to diverse virtual screen hit-lists containing both true binders and false positives, co-folding methods did not outperform traditional docking.
- $\sigma_2$ Receptor: DOCK3.7 (AUC 78.8) and Boltz-2 (AUC 83.8) separated actives well; AF3 performed poorly (AUC 56.1).
- AmpC & D4: DOCK3.7 and Boltz-2 showed moderate separation; AF3 performed near random or worse (AUC ~46 for D4).
Enrichment: Co-folding scores (specifically AF3 L-pLDDT) failed to enrich true ligands in the top ranks of the hit-lists compared to docking scores.
Reasoning: The hit-lists were highly diverse and dissimilar to known ligands. Co-folding models, which rely on learned patterns, struggled to categorize molecules that did not fit known interaction motifs, whereas physics-based docking remained robust for initial categorization.

5. Significance and Conclusion

The study reveals a nuanced "complementarity" between deep learning co-folding and traditional physics-based docking:

Optimization Phase (Structure Refinement): Co-folding methods (AF3, Boltz-2) are superior for predicting accurate ligand poses and estimating affinity for known binders or closely related series. They excel at "affinity maturation" once a hit is identified.
Discovery Phase (Hit Identification): Traditional docking remains superior for categorizing binders vs. non-binders in large, diverse virtual screens. Co-folding is currently too slow and less effective at distinguishing false positives in highly novel chemical space.
Future Direction: The errors of the two approaches are largely uncorrelated. The authors propose a hybrid workflow where docking is used for high-throughput screening to identify candidate hits, followed by co-folding to refine the structures and predict affinities for lead optimization.

Conclusion: While co-folding represents a major leap in structural prediction for known complexes, it is not yet a replacement for physics-based docking in the initial discovery of novel chemotypes. The integration of both approaches offers the most promising path forward for structure-based drug discovery.

Large scale prospective evaluation of co-folding across 557 Mac1-ligand complexes and three virtual screens