Original authors: Pawel Dabrowski-Tumanski, Bartosz Topolski, Dariusz Plewczynski, Tomasz Jetka

Published 2026-06-01

📖 5 min read🧠 Deep dive

Original authors: Pawel Dabrowski-Tumanski, Bartosz Topolski, Dariusz Plewczynski, Tomasz Jetka

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Idea: It's Not the Mountain, It's the Map

Imagine you are a hiker trying to predict the terrain of a mountain range (the "Activity Landscape"). You know that sometimes, two hikers standing very close together might be at vastly different altitudes—one is on a sunny peak, the other in a deep, dark valley. In chemistry, this is called an Activity Cliff: two molecules that look almost identical but have very different biological effects.

For a long time, scientists thought these cliffs were just a natural feature of the molecules themselves.

This paper argues that is wrong. The authors claim that whether you see a cliff or a smooth slope depends entirely on how you draw the map.

If you use a map that measures distance by "walking through walls" (a specific mathematical method), two hikers might look far apart. If you use a map that measures distance by "flying in a straight line," those same hikers might look right next to each other. The paper proves that the "cliff" isn't always in the molecule; sometimes, it's an illusion created by the ruler you chose to measure it.

The Experiment: The Six-Step Detective Pipeline

To prove this, the researchers built a "six-step detective pipeline" to test 15 different types of maps (representations) and rulers (metrics) across three different biological targets (like different types of locks the molecules try to open).

Here is what they found at each step, translated into analogies:

1. The "Zero-Distance" Trap (Geometry)

The Test: Do different molecules look exactly the same on the map?
The Finding: Some maps (like "ChemBERTa") are so blurry that almost every molecule looks like it's standing in the exact same spot. It's like a map where every city is drawn on top of the same dot. Other maps (like "Morgan fingerprints") are sharp and distinct, but they treat 3D twins (stereoisomers) as identical, even though one is a left-handed glove and the other is a right-handed glove.

2. The "Cliff Hunt" (Enrichment)

The Test: If you look at the 100 most similar-looking pairs of molecules, how many of them are actually cliffs?
The Finding: This is where the maps disagree wildly. On the same dataset, one map found 142 cliffs, while another found 7,903 cliffs.
The Metaphor: It's like looking for potholes in a road. One map says, "There are no potholes here, just a smooth road." Another map says, "It's a minefield!" The road didn't change; the map did.

3. The "Steepness" Check (Gradients)

The Test: How sudden are the drops in the landscape?
The Finding: Some maps show a landscape that is mostly smooth with gentle slopes. Others show a landscape full of sudden, terrifying drops. Interestingly, the "Dopamine D2" target (a specific protein) seemed to have a naturally rougher landscape than the others, no matter which map you used.

4. The "Island" Test (Topology)

The Test: Do the cliffs form distinct islands, or are they all mashed together in one big blob?
The Finding: Good maps show cliffs as distinct islands, which helps scientists understand why the cliff exists (e.g., "Oh, this whole group of molecules fails because of this specific shape"). Bad maps collapse everything into a single, confusing blob where you can't tell anything apart.

5. The "Prediction" Game (Machine Learning)

The Test: Can a computer learn to predict cliffs just by looking at the map?
The Finding: If the map is blurry (like the "ChemBERTa" map), the computer gets confused and guesses randomly. If the map has clear structure, the computer can learn the patterns. This confirmed that the "cliff" is a property of the map's geometry, not just the biology.

6. The "Real World" Check (Stereoisomers & Pairs)

The Test: They looked at two specific, real-world scenarios:
- Stereoisomers: Molecules that are mirror images (like left and right hands).
- Matched Pairs: Molecules that differ by just one tiny chemical swap.
The Finding:
- Fingerprints (old-school maps) are terrible at seeing mirror images (they think left and right hands are the same) but great at seeing tiny chemical swaps.
- Learned Embeddings (AI maps) are great at seeing mirror images but sometimes miss the tiny swaps.
- Conclusion: No single map is perfect at everything.

The Main Takeaways

1. There is no "Best" Map
The paper concludes that you cannot just pick one "best" way to measure molecules.

If you want to find cliffs between molecules that look very similar (high similarity), Morgan fingerprints are the best.
If you need to tell the difference between left-handed and right-handed molecules (stereochemistry), MolFormer is the only one that works well.
If you are looking at tiny chemical swaps, MACCS or RDKit fingerprints are best.

2. The "Cliff" is a Choice
When a scientist says, "These two molecules are an activity cliff," they are actually saying, "These two molecules are an activity cliff according to the specific map and ruler I chose." If you change the map, the cliff might disappear or appear out of nowhere.

3. The "No Free Lunch" Rule
Just like in economics, there is no "free lunch" in chemistry. You can't have a map that is perfect at seeing mirror images, perfect at seeing tiny swaps, and perfect at predicting cliffs all at once. Different maps highlight different features of the molecular world.

Summary

This paper is a warning to scientists: Don't trust the map blindly. The way you choose to visualize and measure molecules fundamentally changes the story you tell about how they work. To understand the true nature of a drug, you need to know which "lens" you are looking through, because the lens itself creates the cliffs you see.

Technical Summary: The Geometry of Activity Cliffs

Problem Statement

Activity cliffs—pairs of structurally similar compounds exhibiting large differences in biological potency—are widely regarded as intrinsic features of chemical datasets that define the boundaries of predictability in structure-activity relationships (SAR). However, the definition of an activity cliff is operational, relying on two user-defined thresholds: a potency gap (typically $\ge$ 1 log unit) and a structural similarity cutoff.

The central problem addressed in this work is that structural similarity is not an intrinsic property of a molecule pair but a property of the metric space in which molecules are embedded. Consequently, the choice of molecular representation (embedding) and similarity metric fundamentally dictates which pairs qualify as cliffs, how many exist, and whether they are predictable. The authors argue that the field has converged on Morgan fingerprints with Tanimoto similarity as a default without systematically characterizing how different representations organize the activity landscape. This lack of systematic study leads to conclusions about activity landscapes that may reflect the choice of metric rather than the underlying biology.

Methodology

The authors propose a six-step analysis pipeline designed to systematically test the hypothesis that activity cliffs are a convolution of representation geometry and target biology. This pipeline probes geometrically distinct properties of the activity landscape, ordered by scale and logical dependence. Failure at an earlier step renders subsequent steps uninterpretable.

The pipeline was applied to fifteen (embedding, metric) configurations across three bioactivity datasets (SARS-CoV-2 Main Protease, Factor Xa, and Dopamine D2 receptor), known for their activity cliff challenges. The configurations included:

Classical Fingerprints: Morgan (radius 2, 1024 bits), RDKit topological, and MACCS keys (166 bits).
Learned Embeddings: MolFormer, ChemBERTa, and Chemeleon (MPNN trained on Mordred descriptors).
Metrics: Tanimoto, Dice, Cosine, L1, and L2 distances.

The Six-Step Pipeline

Pairwise Distance Geometry: Analyzes the distribution of pairwise distances to identify fundamental limitations. Metrics include the fraction of zero-distance pairs ( $p_0$ ), coefficient of variation (CV) for discriminative range, relative contrast (RC), and hubness skewness ( $S_{Nk}$ ) to detect neighborhood reliability issues.
Activity Cliff Enrichment: Evaluates the cumulative fraction of cliffs ( $F(n)$ ) among the top $n\%$ most similar pairs. A lower curve indicates better performance (fewer cliffs among similar pairs). The enrichment coefficient $G$ quantifies the magnitude of cliff depletion.
Activity Gradient Distribution: Computes the Structure-Activity Landscape Index (SALI), $L(i,j) = |\Delta pK_i| / d(x_i, x_j)$ , for all pairs. The distribution of these gradients is fitted to a Kohlrausch–Williams–Watts (KWW) survival function to determine the shape parameter $b$ . $b=2$ indicates a smooth, light-tailed landscape (Rayleigh ceiling), while $b<2$ indicates heavy tails and frequent extreme gradients.
Persistent Homology of the Cliff Subspace: Uses Vietoris–Rips filtration on cliff-involved molecules to track connected components ( $H_0$ ). Mean persistence ( $\mu_{pers}$ ) and maximum persistence ( $p_{max}$ ) measure the topological separation of cliff-prone clusters.
Geometric Probes of Representational Structure: Trains classifiers (Logistic Regression, XGBoost, Siamese networks) on the absolute embedding difference $|e_i - e_j|$ to predict cliff existence. Gap statistics ( $\Delta_{lin}$ and $\Delta_{arch}$ ) characterize the linear vs. non-linear and feature-interaction richness of the embedding space.
Chemical Ground Truth Benchmarking: Validates representations against two structurally defined sub-populations independent of the pipeline's own similarity measure:
- Stereoisomers: Pairs with identical graphs but different 3D configurations.
- Matched Molecular Pairs (MMPs): Pairs related by a single chemical transformation.
- Performance is ranked by the coefficient of variation (CV) of the distance distribution among cliff pairs within these sub-populations.

Key Results

1. Representation Dependence of Cliff Counts

The choice of representation drastically alters the observed number of activity cliffs. On the SARS-CoV-2 dataset at 90% similarity, the number of identified cliff pairs varied by a factor of 55 across configurations:

Morgan Tanimoto: 142 pairs.
Chemeleon Cosine: 752 pairs.
RDKit Dice: 7,903 pairs.
This demonstrates that the "cliffiness" of a dataset is largely a geometric artifact of the chosen representation.

2. Performance by Representation Type

Morgan Tanimoto: Exhibits the strongest cliff enrichment ( $G$ ) and cross-scaffold generalization. Its geometry is bimodal (Beta-distributed), organizing space around scaffold identity. However, it suffers from complete stereochemical blindness ( $p_{0,stereo} = 100\%$ ).
MolFormer Cosine: The only configuration demonstrating meaningful stereochemical sensitivity (high CV for stereoisomers, $p_{0,stereo} = 0$ ). It encodes stereocenter information as directional variation, making cosine distance (sensitive to angular differences) superior to L1/L2.
MACCS and RDKit Dice: Most sensitive to matched-molecular-pair (MMP) transformations, achieving the highest CV for MMPs. They encode fragment-level patterns effectively but share the stereochemical blindness of other fingerprints.
ChemBERTa: Fails uniformly across all criteria due to "embedding collapse." It produces severely concentrated distances (low CV, high hubness), resulting in a geometrically degenerate space where most molecules appear similar regardless of activity.
Chemeleon: Produces the richest topological cliff structure (high persistence) but shows dramatic metric dependence; L1/L2 distances collapse topologically on the Dopamine D2 target, while Cosine retains structure.

3. Target-Level Landscape Roughness

The analysis reveals intrinsic differences in target landscapes independent of representation:

SARS-CoV-2: The smoothest landscape (highest $b$ values, approaching the Rayleigh ceiling $b=2$ ).
Factor Xa: Intermediate roughness.
Dopamine D2: The roughest landscape. No configuration reached $b=2$ on this target, indicating structured discontinuities persist regardless of embedding. The authors attribute this to the conformational flexibility of GPCRs and the aggregation of heterogeneous assay data in ChEMBL.

4. Non-Redundancy of Pipeline Steps

Each step revealed failure modes invisible to others. For instance, RDKit showed high discriminative range (Step 1) but poor cliff enrichment (Step 2) and heavy gradient tails (Step 3). Persistent homology (Step 4) revealed topological collapses in RDKit and Chemeleon that were not fully captured by pairwise statistics.

Significance and Claims

The paper claims that activity cliffs are not intrinsic properties of molecule pairs but are emergent properties of the chosen (embedding, metric) pair. The authors do not propose a single "best" representation; rather, they argue that different representations encode different, partially non-overlapping aspects of molecular recognition:

Fingerprints excel at scaffold and fragment-level transformations but fail at stereochemistry.
Learned embeddings (specifically with Cosine distance) excel at stereochemical sensitivity but may lack the fragment-level specificity of fingerprints for MMPs.
No "Free Lunch": No single configuration excels on all criteria simultaneously.

The significance of this work lies in providing a framework to diagnose the geometric properties of activity landscapes. It suggests that selecting a representation without characterizing its geometry leads to conclusions that reflect the metric rather than biology. The authors propose that the field should move away from a universal default (Morgan/Tanimoto) toward task-specific selection:

Use Morgan Tanimoto for SAR analysis within structural series.
Use MolFormer Cosine for stereochemistry-sensitive tasks.
Use MACCS/RDKit Dice for MMP transformation annotation.
Use Chemeleon Cosine for global topological exploration.

Finally, the paper suggests that the "roughness" of a target's activity landscape (e.g., the intrinsic difficulty of predicting Dopamine D2 activity) can be identified through consensus across multiple representations, distinguishing biological complexity from representation artifacts.

The Geometry of Activity Cliffs: Representation Dependence and Multi-Scale Characterization of Activity Landscapes