CA-DEL: An Open Multi-Target, Multi-Modal Benchmark for… — Plain-Language Explanation

Original authors: Mutian He, Hanqun Cao, Cheng Tan, Zijun Gao, Xiaojun Yao, Chunbin Gu, Pheng-Ann Heng

Published 2026-05-11

📖 5 min read🧠 Deep dive

Original authors: Mutian He, Hanqun Cao, Cheng Tan, Zijun Gao, Xiaojun Yao, Chunbin Gu, Pheng-Ann Heng

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a treasure hunter trying to find a specific type of gold nugget hidden inside a massive, chaotic pile of rocks. This is essentially what drug discovery looks like: scientists need to find a single molecule that sticks perfectly to a disease-causing protein, out of billions of possibilities.

This paper introduces a new tool called CA-DEL, which is like a "training gym" for computer programs (AI) to learn how to find these gold nuggets more accurately. Here is a simple breakdown of what they did and why it matters, using everyday analogies.

1. The Problem: The "Noisy" Signal

In modern drug discovery, scientists use a technology called DNA-Encoded Libraries (DEL). Think of this as a giant library where every book (molecule) has a tiny barcode (DNA) attached to it.

The Process: They dump billions of these books into a pool to see which ones stick to a target protein.
The Catch: To count how many books stuck, they use a machine that reads the barcodes. However, this reading isn't perfect. It's like trying to count people in a crowded stadium by listening to the noise they make. The signal is noisy. Some books stick because they are sticky, not because they are the "right" fit. Some stick because of static electricity (impurities).
The Goal: The AI needs to ignore the static and figure out which books are actually the best fits.

2. The New Benchmark: CA-DEL

Previously, there wasn't a good "test" to see if AI programs were actually getting smarter or just memorizing the noise. The authors created CA-DEL, a new benchmark with three special features:

The "Twin" Challenge (Selectivity):
Imagine you are looking for a key that fits a specific lock. The problem is, there are three locks that look almost identical (called Carbonic Anhydrase isoforms: CAII, CAIX, and CAXII). They are so similar that a key might fit all three, but you only want the one that fits the "cancer" lock (CAIX/CAXII) and not the "healthy body" lock (CAII).
- The Test: Can the AI tell the difference between these nearly identical twins? Most previous tests didn't focus on this fine-grained difference.
The "Sim-to-Real" Gap (The Hard Mode):
This is the most unique part of the paper.
- Training Phase: The AI is trained on the "noisy" library data (the stadium noise).
- Testing Phase: The AI is tested on "perfect" data from a different source (ChEMBL), which contains precise measurements of how tightly molecules stick (called $K_i$ ).
- The Analogy: It's like training a student on a noisy, blurry textbook, and then testing them on a crystal-clear, high-definition exam. If the student passes, it means they actually learned the principles of the subject, not just the blurry pictures.
3D Vision:
Molecules aren't flat; they are 3D shapes. Previous AI models often looked at molecules like flat drawings (2D). CA-DEL forces the AI to look at the molecules in 3D, like rotating a puzzle piece to see how it fits.

3. What They Found (The Results)

The authors ran various AI models through this new gym to see who performed best.

Simple Models Failed: Models that just looked at basic properties (like "does it have a benzene ring?" or "how heavy is it?") were terrible. They couldn't handle the noise or the 3D complexity.
Old School Docking Failed: Traditional methods that try to mathematically fit the pieces together without learning from data also struggled.
3D Deep Learning Won: The winners were advanced AI models that used 3D structures (specifically models named DEL-Dock and DEL-Ranking).
- These models were much better at ignoring the noise and finding the true "gold nuggets."
- They were also better at picking the top candidates (the "Top-N" hits), which is what matters most in real drug discovery.

4. The Limitations: The "Uncanny Valley" of AI

Even the best models had trouble in a specific scenario called Zero-Shot Generalization.

The Scenario: The AI was trained on one specific 3D shape of a protein (like a photo of a person smiling). Then, it was tested on a slightly different shape of the same protein (the same person frowning) or a very similar protein (the person's twin).
The Result: The AI often got confused. It struggled to realize that the "frowning" version was still the same person, or that the "twin" was different enough to require a different key.
The Takeaway: Current AI is still too sensitive to tiny changes in how the protein looks. It hasn't fully learned the underlying "physics" of how molecules stick; it's still relying too much on the specific patterns of the training data.

Summary

CA-DEL is a new, tougher test for AI in drug discovery. It forces computers to learn from messy, real-world screening data and prove they can find the right drug candidates by understanding 3D shapes and telling apart very similar proteins.

The paper concludes that while 3D-aware AI is much better than older methods, it still needs to get smarter about handling different shapes of the same protein before it can fully replace human intuition in designing life-saving drugs.

Technical Summary: CA-DEL Benchmark for DNA-Encoded Library Screens

Problem Statement
The application of Machine Learning (ML) in drug discovery is fundamentally constrained by the nature of DNA-Encoded Library (DEL) data. While DEL technology enables the screening of billions of molecules, the primary experimental output—sequencing read counts—is an indirect, noisy proxy for true molecular binding affinity. This signal is confounded by non-specific binding, synthesis impurities, and PCR amplification biases. Furthermore, existing public benchmarks for DEL analysis suffer from critical limitations: they often rely solely on noisy enrichment scores without ground-truth validation, lack fine-grained 3D structural information, and fail to address the specific challenge of predicting selectivity among highly homologous protein isoforms. Consequently, there is a scarcity of robust benchmarks to evaluate models capable of denoising these signals and generalizing from high-throughput screening data to lead-optimized compounds.

Methodology and Dataset Construction
To address these gaps, the authors introduce CA-DEL, a multi-dimensional, open benchmark dataset focused on Carbonic Anhydrase (CA) isoforms. The dataset construction involves three primary components:

Multi-Target Selectivity Challenge: The benchmark targets three human CA isoforms: CAII (a ubiquitous anti-target), and CAIX and CAXII (cancer-specific targets). Despite high active-site homology, these proteins require models to learn fine-grained structural features to distinguish isoform-specific binding, simulating the clinical imperative of selective inhibitor design.
Sim-to-Real Evaluation Paradigm: The dataset is structured to bridge the gap between screening and validation.
- Training Sets: Derived from two distinct DEL libraries (CAS-DEL and DOS-DEL-1), containing approximately 360,000 compounds with enrichment factor data.
- Test Sets: Sourced entirely from ChEMBL, comprising drug-like molecules with precise, experimentally determined binding affinities ( $K_i$ ).
- Distributional Shift: A deliberate Out-of-Distribution (OOD) challenge is engineered where the training data (combinatorial libraries) and test data (ChEMBL) occupy distinct, non-overlapping regions of chemical space, differing in physicochemical properties (e.g., QED, molecular weight).
Multi-Modal Representation Generation: To incorporate 3D structural context, the authors established a pipeline generating protein-ligand conformations. Using high-resolution PDB structures and SMINA for docking, they generated ensembles of up to nine plausible binding poses for each ligand-protein pair. This approach decouples model performance from the unreliability of single docking scores by providing robust discrete approximations of the conformational posterior distribution.

Key Contributions
The paper outlines four primary contributions:

Multi-Target Selectivity Benchmark: A specialized testbed for evaluating models on highly homologous CA isoforms, addressing the challenge of selectivity in conserved catalytic mechanisms.
Multi-Modal Molecular Representations: Integration of traditional 2D topology with systematically generated 3D protein-ligand conformations for over 200,000 compounds, enabling large-scale evaluation of geometric deep learning.
Ground-Truth Validated OOD Challenge: A task design that tests generalization from noisy DEL enrichment signals to precise ChEMBL $K_i$ values, mirroring real-world hit-to-lead optimization.
Practical Evaluation Metrics: Introduction of Top-N hit rate analysis to assess model utility in resource-constrained discovery, moving beyond traditional correlation metrics to measure the percentage of true hits among top-ranked compounds.

Results and Analysis
The authors evaluated a range of models, from physicochemical baselines to advanced deep learning architectures, using Spearman's rank correlation ( $\rho$ ) and Top-N hit rates.

Performance Hierarchy: Results demonstrate a clear hierarchy where 3D-aware deep learning models (specifically DEL-Dock and DEL-Ranking) significantly outperform 2D-based models and classical docking methods. On the critical SubSp metric (correlation with true $K_i$ ), 3D models achieved values as strong as -0.323, whereas 2D baselines and classical docking showed near-zero or positive correlations.
Denoising Capability: The success of 3D models suggests they learn fundamental biophysical interaction features rather than library-specific statistical artifacts, effectively denoising the DEL signals.
Top-N Enrichment: In practical Top-N hit rate analysis, 3D models consistently dominated the top rankings, successfully concentrating potent chemical matter at the head of the list, whereas 2D models showed sharp performance degradation.
Generalization Limits: Zero-shot experiments revealed significant limitations. Models trained on one crystal structure or target isoform struggled to generalize to unseen conformations or homologous targets (e.g., CAIX to CAXII), often resulting in inverted rankings. This indicates current 3D models remain sensitive to conformational variations and may overfit to specific rigid states or library-specific features.

Significance and Claims
The paper claims that CA-DEL provides a comprehensive platform for developing and evaluating 3D-aware, selectivity-focused machine learning models. By establishing a rigorous "Sim-to-Real" evaluation paradigm, the benchmark validates the hypothesis that explicit geometric modeling is essential for distilling true binding signals from noisy DEL data. The authors posit that while current state-of-the-art approaches show promise, the observed limitations in zero-shot generalization and fine-grained selectivity highlight the need for future research into physics-informed architectures, uncertainty quantification, and explainable AI to fully realize the potential of rational drug design. The work aims to catalyze advances in geometric deep learning for structure-based drug design by providing a standardized, multi-modal resource.

CA-DEL: An Open Multi-Target, Multi-Modal Benchmark for Learning from DNA-Encoded Library Screens