Information Leakage in Enzyme Substrate Prediction

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Great Enzyme Illusion: Why AI Predictions Were Cheating

Imagine you are trying to teach a robot how to match keys to locks. In the world of biology, enzymes are the locks, and small molecules (like drugs or nutrients) are the keys. Scientists want to build an AI that can look at a new lock and a new key and instantly say, "Yes, this key fits!" or "No, this key won't work."

For a while, it looked like we had built a genius AI. Several computer models claimed they could predict these matches with 95% accuracy. They were hailed as breakthroughs that could revolutionize drug discovery.

But this paper, written by researchers Vahid Atabaigi Elmi, Roman Joeres, and Olga Kalinina, pulls back the curtain to reveal a dirty secret: The AI wasn't actually learning how locks and keys work. It was just memorizing the answers from a cheat sheet.

Here is the story of how they caught the models cheating, explained simply.

1. The Setup: The "Cheat Sheet" Problem

To train an AI, you give it a huge list of examples: "Lock A fits Key B," "Lock C does not fit Key D."

The problem arises when you split this list into a Training Set (for the AI to study) and a Test Set (to check if it learned).

In the popular dataset used by these famous models (called the ESP dataset), the scientists made a mistake in how they shuffled the cards. They made sure the locks (enzymes) in the test set were different from the ones in the training set. They thought, "Great! The AI has never seen these specific locks before, so if it gets them right, it's truly smart."

But they forgot about the keys.

2. The Analogy: The "Famous Key"

Imagine you are taking a math test.

The Training Set: You study a list of problems. One problem asks: "What is 2 + 2?" The answer is 4.
The Test Set: You are given a new problem: "What is 5 + 5?" But wait! The test also includes a problem that looks exactly like the one you studied, just with a different name. It asks: "What is 2 + 2?"

If your AI sees "2 + 2" in the test set, it doesn't need to know math. It just remembers, "Oh, I saw this exact question in my homework! The answer is 4!"

In the enzyme world, the "2 + 2" is a small molecule (the key).

The AI was trained on a specific key (let's call it "Key X") interacting with Lock A.
In the test set, they gave the AI "Key X" again, but this time paired with a new Lock B.
The AI didn't figure out how Lock B works. It just said, "I know Key X! It works!"

Because the same keys kept showing up in both the study and the test, the AI looked like a genius. It was actually just cheating by recognizing familiar keys rather than understanding the biology.

3. The Investigation: Removing the Cheat Sheet

The authors of this paper decided to fix the test. They used a new tool called DataSAIL to reshuffle the data.

Think of DataSAIL as a strict proctor who ensures that no key used in the test set was ever seen in the training set, and no lock was either. They created a "True Out-of-Distribution" test.

Old Test: The AI saw familiar keys. Score: 95% (Looks amazing!).
New Strict Test: The AI saw only brand new keys and brand new locks. Score: ~50% (This is basically a random guess, like flipping a coin).

When they removed the "familiar keys" (the information leakage), the models' performance crashed. They went from being "super-AIs" to being barely better than a coin toss.

4. The Results: A Reality Check

The paper tested three famous models: ESP, ProSmith, and FusionESP.

On the old, leaky test: They all looked incredible, with accuracy scores near 0.95.
On the new, strict test:
- FusionESP (the "best" model) dropped to an accuracy of roughly 0.55.
- ProSmith dropped to 0.58.
- ESP dropped to 0.54.

In the world of binary predictions (Yes/No), a score of 0.5 is random guessing. The models had lost their "magic." They realized that these models were excellent at memorizing patterns but terrible at actually understanding how enzymes and molecules interact.

5. Why Does This Matter?

This is a huge wake-up call for the field of drug discovery.

If we rely on these models to find new medicines, we might be wasting millions of dollars testing drugs that the AI thinks will work because it "remembers" similar molecules, but in reality, they won't work on the new biological targets.

The Takeaway:
The paper doesn't say AI is useless. It says we have been too easy on ourselves. We have been testing our AI on easy questions where the answers were hidden in the room. Now that we've locked the doors and given the AI a completely new room to solve, we see that it still has a lot of learning to do.

In short: The models weren't smart; they just had a really good memory. And in science, a good memory isn't the same as understanding.

1. Problem Statement

The paper addresses a critical flaw in the evaluation of deep learning models for enzyme-small molecule interaction prediction (specifically substrate prediction). While recent models (such as ESP, ProSmith, and FusionESP) have reported exceptional performance (AUC > 0.95), the authors hypothesize that these results are artificially inflated due to information leakage (data leakage).

The Core Issue: Standard dataset splitting methods often fail to account for structural and sequence similarities between training and test sets. If a model encounters a small molecule or enzyme in the test set that is highly similar to one in the training set, it may "memorize" the interaction rather than learning generalizable biochemical rules.
The Consequence: Models appear to perform well on standard benchmarks but fail to generalize to truly Out-of-Distribution (OOD) scenarios, such as predicting interactions for novel enzymes or structurally distinct small molecules.

2. Methodology

The authors employed a rigorous re-evaluation framework using the DataSAIL toolkit to re-split the popular ESP dataset (which contains experimentally validated enzyme-substrate pairs).

A. Data Splitting Strategies

Instead of the original random or simple protein-cluster splits, the authors generated six distinct splitting schemes to control for information leakage:

ESP Split (Baseline): The original method where enzymes are clustered by similarity (80% threshold) and split, but small molecules are split randomly.
DataSAIL Splits:
- I1L / I1P: Random splits based on unique Ligand or Protein IDs (ensuring no exact ID overlap).
- S1L / S1P: Similarity-based splits ensuring no ligands or proteins in the test set share high similarity (based on clustering) with those in the training set.
- S2 (Strictest): A two-dimensional split that simultaneously minimizes similarity for both enzymes and ligands. This ensures the test set contains interactions involving enzymes and substrates that are structurally distinct from the training set.
- ESPS2: A control split applying the original ESP algorithm only to the data remaining after the S2 split.

B. Leakage Quantification

The authors quantified information leakage using a similarity-based metric (Equation 1 in the paper):
$\text{Leakage}_{s,s'} = \frac{\sum_{u \in s, v \in s'} w(u, v)}{\sum_{(u,v) \in \binom{D}{2}} w(u, v)}$
Where $w(u,v)$ is a similarity function between data points. They calculated leakage for:

MSL: Molecule-Similarity Leakage.
PSL: Protein-Similarity Leakage.
TSL: Total Similarity Leakage (MSL + PSL).

C. Model Evaluation

Three state-of-the-art models were retrained from scratch on these new splits:

ESP (Enzyme-Substrate Prediction)
ProSmith
FusionESP

Performance was measured using Accuracy, AUROC, and Matthews Correlation Coefficient (MCC) on the test sets.

3. Key Contributions

Identification of Systemic Leakage: The study demonstrates that the high performance of current enzyme-substrate prediction models is largely driven by inter-sample similarities (leakage) rather than true predictive capability.
Rigorous Benchmarking: The authors introduce the S2 split as a gold standard for evaluating generalization, showing that previous "OOD" tests (like checking for <40% sequence identity) were insufficient because they did not account for ligand similarity.
Quantitative Correlation: They established a strong positive correlation (Pearson $r > 0.8$ ) between the amount of data leakage and model performance (AUROC). As leakage decreases, performance drops precipitously.
Reproducibility: The authors provided the code and all calculated data splits (available on Zenodo/GitHub) to facilitate future rigorous evaluation.

4. Key Results

The results reveal a dramatic collapse in model performance when leakage is minimized:

Split Type	Leakage Level	Model Performance (AUROC)	Observation
ESP Split	High	0.93 – 0.96	Matches original paper claims; models rely on memorization.
S1P / I1P	Medium	0.88 – 0.95	Performance remains high when only protein similarity is controlled.
S1L / I1L	Medium-Low	0.55 – 0.79	Performance drops significantly when ligand similarity is controlled.
S2 (Strict)	Very Low	~0.51 – 0.55	Near-random guessing. Models fail to generalize to new enzymes and new substrates simultaneously.

Correlation: The Pearson correlation between Total Similarity Leakage (TSL) and AUROC was 0.899 for FusionESP, confirming that performance is directly tied to data leakage.
Class Imbalance: The authors note that a naive baseline (always predicting "non-interacting") would achieve 0.735 accuracy due to class imbalance. In the strict S2 split, the models' accuracy drops to **0.54**, performing worse than a naive baseline.
Specific Failure: Models generalize reasonably well to new enzymes (if the substrate is familiar) but fail completely when the substrate is structurally novel.

5. Significance and Implications

Re-evaluation of State-of-the-Art: The paper challenges the validity of current benchmarks in computational enzymology. Claims of "high accuracy" in enzyme-substrate prediction must be scrutinized for data leakage.
Methodological Shift: It highlights that splitting datasets based solely on protein sequence identity is insufficient. Future datasets must be split based on both protein and ligand similarity to ensure true OOD evaluation.
Real-World Applicability: The findings suggest that current deep learning models are not yet ready for de novo drug discovery or predicting interactions for entirely new chemical scaffolds, as they rely heavily on memorizing known chemical structures rather than learning underlying biochemical principles.
Community Standard: The authors advocate for the adoption of strict similarity-based splitting (like DataSAIL's S2) as the new standard for reporting performance in protein-ligand interaction prediction.

In conclusion, the paper serves as a critical "reality check" for the field, demonstrating that without rigorous control of information leakage, reported model performances are misleading and do not reflect true generalization capabilities.