Imagine you are a master architect trying to design a custom key that fits perfectly into a specific lock. For decades, the best architects (AI models) have been trained to build keys for deep, cave-like locks. These caves are easy to work with because the walls surround the key on all sides, giving the architect clear boundaries to follow. The AI learns to "snuggle" the key into these deep holes, creating a tight, secure fit.

However, in the real world of medicine, many of the most dangerous "locks" (disease targets like KRAS and MYC) aren't deep caves at all. They are flat, open surfaces, like a tabletop or a smooth wall. These are the "undruggable" targets that have historically been impossible to treat.

This paper introduces a new testing ground called ShallowBench to see how well our AI architects can design keys for these flat surfaces.

The Problem: The "Flat Surface" Struggle

The authors found that current AI models are like architects who have only ever built keys for caves. When you ask them to design a key for a flat table:

They get lost: Without deep walls to guide them, the AI doesn't know where to place the key. The key might just "float" in the air above the table instead of sticking to it.
They make mistakes: The AI struggles to hold the key together properly, sometimes creating shapes that don't make chemical sense.
They lose their grip: Even when they try, the key doesn't stick as well as it does in a cave.

How They Built the Test (ShallowBench)

To prove this, the researchers needed a fair test. They couldn't just use the old datasets because those were full of deep caves. So, they created a new dataset called ShallowBench from a massive library of 166,500 protein structures.

They used a clever "volume measurement" trick to find the flat ones:

Imagine placing a clear, domed lid over a protein surface.
They calculated the space inside the lid versus the space taken up by the protein atoms themselves.
If the difference (the "empty space" under the lid) was small, it meant the surface was flat and shallow.
They filtered out the deep caves and kept 5,780 flat targets that still had enough surface area to hold a drug.

They then split this into a "training" set and a "testing" set, making sure the AI couldn't cheat by memorizing similar proteins.

The Results: The AI Stumbles

The researchers tested three top-tier AI models on this new flat-surface test. Here is what happened:

The "Cave" Models Failed: Every single model performed worse on the flat surfaces than on the deep caves. Their predicted ability to "stick" to the target dropped significantly.
The "Floaty" Problem: One model (TargetDiff) tried to hug the flat surface but ended up making chemically broken keys (molecules that wouldn't work in real life). It was so desperate to fit the shape that it forgot the rules of chemistry.
The "Valid but Loose" Problem: Another model (DiffSBDD) made perfect, chemically valid keys, but they were so loose and unshaped that they didn't fit the flat surface at all. It was like making a perfect key but putting it on the wrong side of the table.
The "Scoring" Model: A third model (SimpleSBDD) did the best at sticking, but it wasn't really "designing" new keys from scratch; it was just picking existing ones from a library that happened to fit okay.

The Takeaway

The paper concludes that while AI is amazing at designing drugs for deep, cave-like pockets, it is currently blind to flat surfaces.

The authors suggest that to fix this, we can't just keep training the same way. We need to:

Teach the AI differently: Show it more examples of flat surfaces during training.
Change the rules: Create new "loss functions" (rules the AI tries to minimize) that punish it for letting the key "float" away from the flat surface.
Build new tools: Maybe the AI needs to learn to look at the whole protein landscape, not just the immediate hole, to understand how to anchor a drug to a flat wall.

In short: Our drug-design AI is a great cave explorer, but it's currently terrible at building on flat ground. ShallowBench is the map that shows us exactly where it's failing, so we can build better tools to tackle the "undruggable" diseases.

Technical Summary: ShallowBench

Problem Statement

Generative AI models have achieved significant success in structure-based drug design (SBDD), particularly in generating chemically valid, high-affinity ligands for targets with deep, structurally defined binding pockets. These deep cavities provide clear geometric constraints and extensive surface area for Van der Waals interactions, effectively anchoring generated coordinates. However, a critical vulnerability remains: these models struggle to generate effective ligands for shallow or intrinsically disordered protein surfaces.

Many high-priority therapeutic targets, such as the oncology targets KRAS and MYC, lack traditional high-concavity binding pockets. Ligands attempting to bind these flat interfaces face increased competition from bulk solvent, lack well-defined structural enclosures, and suffer from sparse contact areas. Furthermore, standard benchmark datasets (e.g., CrossDocked2020, PDBbind) are dominated by deep-pocket targets, leading to a training and evaluation bias where models learn skewed distributions. Consequently, the performance degradation of state-of-the-art SBDD models on flat surfaces remains unquantified, hindering the development of architectures capable of addressing these "undruggable" targets.

Methodology

Dataset Curation: ShallowBench

To address the lack of a dedicated benchmark for shallow targets, the authors curated ShallowBench, a dataset of 5,780 shallow-pocket targets extracted from the CrossDocked2020 dataset. The curation process involved a rigorous two-step volumetric approach to isolate interfaces with low concavity while ensuring sufficient surface area for binding:

Volume Calculation: For each protein-ligand complex, the local binding environment was defined by extracting protein atoms within an 8.0Å radius of the native ligand's center of mass (COM).
Concavity Metric: The authors defined concavity as the difference between two volumes:
- $V_{atom}$ : The volume strictly occupied by protein atoms, mapped to a 3D voxel grid (1.0Å voxel size).
- $V_{lid}$ : A bounding volume generated by an Alpha Shape mesh (using $\alpha = 0.15$ ) acting as a simulated "lid" over the interface.
- Formula: $\text{Concavity} = V_{lid} - V_{atom}$ .
Filtering Criteria: Targets were selected if they met a strict upper bound of Concavity < 500.0 Å³** and a lower bound of **Surface Area > 50.0 Å² (calculated using a 2.0Å voxel grid). This process reduced the initial 166,500 targets to 5,780.

Data Splitting and Control

Train/Test Split: To prevent data leakage, the dataset was split based on 30% sequence identity clustering. This resulted in 4,995 training targets and 785 test targets, ensuring no test target shares significant homology with the training set.
Control Dataset: A structurally diverse control set of 5,780 targets was created from CrossDocked2020 using round-robin stratified sampling. This set mirrors the size of ShallowBench but represents the standard distribution of deep-pocket targets, allowing for a one-to-one comparative evaluation.

Evaluation Framework

The authors evaluated three state-of-the-art generative SBDD models—DiffSBDD, SimpleSBDD, and TargetDiff—on both the ShallowBench and control datasets without fine-tuning. The evaluation utilized the following metrics:

Chemical Validity: Proportion of molecules passing valence, aromaticity, and sanitization checks (RDKit).
Mean QED: Quantitative Estimate of Druglikeness.
Vina Affinity: Predicted binding energy (kcal/mol) via AutoDock Vina.
Shape Complementarity (Sc): Geometric fit between ligand and protein (SCASA algorithm), serving as a negative control to confirm the success of the shallow surface filtering.

Key Results

The evaluation revealed a systematic decline in model performance when transitioning from deep-pocket controls to shallow-pocket targets:

Systematic Decline in Binding Affinity: All models exhibited weaker predicted binding affinities on ShallowBench. For instance, TargetDiff's mean Vina affinity dropped from -7.33 kcal/mol (control) to -5.26 kcal/mol (ShallowBench). SimpleSBDD similarly dropped from -7.52 to -6.47 kcal/mol.
Trade-offs in SimpleSBDD: SimpleSBDD achieved the strongest Vina affinities across both datasets but at the cost of lower Shape Complementarity (Sc) and chemical validity (~85%). Its performance suggests it functions more as a scoring filter for drug-like libraries rather than a pure de novo 3D generator.
DiffSBDD's Weak Conditional Signal: While DiffSBDD maintained high chemical validity (~~98%), it produced molecules with near-zero Shape Complementarity and low QED scores (~~0.25). This indicates a weak conditional signal, where the model generates valid molecules that fail to anchor effectively to the shallow pocket geometry.
Degradation of TargetDiff's Validity: TargetDiff showed a sharp decline in chemical validity, dropping from 87.51% on the control dataset to 79.71% on ShallowBench. This suggests the diffusion process struggles to enforce fundamental molecular assembly when 3D protein constraints are sparse.
Shape Complementarity Anomaly: While DiffSBDD and SimpleSBDD dropped to near-zero Sc on shallow surfaces, TargetDiff maintained a relatively high Sc (0.6088). This implies TargetDiff attempts to conform more aggressively to surface topology, even if it compromises chemical validity.

Significance and Contributions

The paper positions ShallowBench as a necessary tool to expose the limitations of current generative biology models. The authors claim the following contributions:

Benchmark Creation: The introduction of a strictly curated benchmark of 5,780 shallow-pocket targets, filling a critical gap in the field where no large, dedicated dataset for low-concavity targets previously existed.
Methodological Innovation: The development of a two-step volumetric approach using Alpha Shape "lid" calculations to effectively isolate low-concavity interfaces while preserving binding-relevant surface area.
Resource Provision: The release of a rigorously split training and testing dataset (via Hugging Face) designed to prevent homology leakage, enabling researchers to fine-tune models specifically for challenging targets.
Empirical Evidence of Vulnerability: The demonstration that state-of-the-art SBDD models suffer from systematic performance degradation on flat surfaces. The results highlight that current architectures, trained on deep-pocket biases, struggle to navigate non-traditional binding sites, necessitating new architectural innovations or loss functions.

The authors conclude that ShallowBench provides a rigorous baseline for evaluating generative models and underscores the urgent need for architectural advancements capable of handling the unique physical constraints of shallow-pocket targets.

ShallowBench: Benchmarking Generative Drug Design Models on Shallow-Pocket Targets