Critical Assessment of ML models for ADMET Prediction… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a massive, high-stakes Olympics for Drug Discovery.

In this Olympics, scientists from all over the world bring their best computer programs (AI models) to a stadium called TDC (Therapeutics Data Commons). The goal? To predict how new drugs will behave in the human body—will they be absorbed? Will they be toxic? Will they work?

The TDC provides a giant scoreboard (the "Leaderboard") where the top-performing AI models get gold, silver, and bronze medals. Everyone assumes these winners are the smartest, most reliable tools for saving lives.

But Ihor Koleiev and his team from Receptor.AI decided to act like the ultimate "Game Inspectors." They didn't just look at the scores; they went behind the scenes to check if the games were actually being played fairly. They found that while the scoreboard looks impressive, the reality is a bit messy.

Here is what they discovered, broken down into simple stories:

1. The "Ghost" Athletes (Reproducibility Issues)

Imagine a runner crossing the finish line first, but when you ask to see their shoes or hear their race strategy, they say, "Oh, I lost my shoes," or "My strategy is a secret, and also, my shoes only work on Tuesdays."

The researchers tried to run the code for the top 10 models on the leaderboard.

The Problem: For many of the "winners," the code was broken, missing, or impossible to install. It was like trying to watch a movie where the projector is broken.
The Result: Only 3 out of the top models could actually be run and verified. The rest were "ghosts"—they existed on paper, but you couldn't actually use them.

2. The "Cheating" Athletes (Data Leakage)

Imagine a math test where the teacher accidentally leaves the answer key on the desk in the classroom before the exam starts. If a student memorizes the answers, they get a perfect score, but they aren't actually smart; they just cheated.

The researchers found that several top models had done exactly this.

The Problem: The models were trained on data that secretly included the "test questions" (the test set).
- MiniMol: This model was trained on a massive library of molecules. The creators said, "We removed the test molecules!" But the researchers found that because of tiny differences in how molecules were written (like writing a name with or without a middle initial), the model still "saw" the test answers during training.
- GradientBoost & XGBoost: These models accidentally split their data wrong. They used the "test" answers to help them study for the "practice" exam.
The Result: These models got inflated scores. They looked like geniuses, but they were just memorizing the test. When the researchers fixed the cheating, these models dropped from 1st place to 3rd or even 8th place.

3. The "Perfect Score" Trap (Overfitting)

Imagine a student who studies only for the specific questions on last year's test. They get 100% on that test. But if you give them a new test with different questions, they fail miserably.

The researchers built their own "honest" AI models and then created a "cheating" version that was allowed to peek at the test answers while studying.

The Experiment: They let the "cheating" model tune itself specifically for the TDC test set.
The Result: The cheating model shot up the leaderboard, jumping from the middle of the pack to the top 3 in nearly half of the categories.
The Lesson: This proves that the leaderboard is fragile. If you tweak your model just enough to fit the specific "test set" (even by accident), you can fake a gold medal.

4. The Few Honest Winners

Out of all the chaos, only three models (CaliciBoost, MapLight, and MapLight+GNN) passed every single check.

They had working code.
They didn't cheat.
Their scores held up when re-tested.
The Takeaway: These are the only ones we can truly trust right now. The others might be great, but we can't prove it yet.

5. The Missing Rulebook (Versioning)

Finally, the researchers pointed out a flaw in the stadium itself. The TDC changes its datasets (the "questions") over time but doesn't tell anyone exactly when or what changed.

The Analogy: It's like a cooking competition where the ingredients keep changing, but the judges don't announce it. A chef might win today with "Tomatoes," but tomorrow the judge swaps them for "Potatoes," and the chef's recipe fails.
The Consequence: Because the data isn't locked down with a specific "version number," it's hard to know if a model is actually good or if it just happened to match a specific, temporary version of the data.

The Big Picture: What Should We Do?

The paper concludes that while the TDC leaderboard is a great starting point, we shouldn't blindly trust the top ranks.

To fix this, the authors suggest:

Hide the Test Answers: Don't let the public see the test set until the very end (like a real exam).
Lock the Rules: Use strict "version numbers" for datasets so everyone is tested on the exact same questions.
Submit the Whole Car: Instead of just submitting a score, researchers should submit their entire "car" (code + environment) so judges can drive it themselves to see if it actually works.

In short: The current leaderboard is like a race where some runners memorized the track, some lost their shoes, and the finish line keeps moving. We need to clean up the track and check the shoes before we crown the winners.

1. Problem Statement

The Therapeutics Data Commons (TDC) serves as a "gold standard" benchmark for evaluating Machine Learning (ML) models in ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) prediction. However, the authors identify critical flaws in the current benchmarking ecosystem:

Reproducibility Crisis: Many top-ranked models lack accessible source code, have broken dependencies, or cannot be executed in standard environments.
Data Leakage: Publicly available test sets allow for unintentional or deliberate "tuning" of models to specific test data, inflating performance metrics without genuine predictive power.
Overfitting: Some models achieve accuracy exceeding the inherent noise limits of experimental assays (e.g., $R^2 > 0.8$ ), suggesting they are learning dataset-specific noise rather than biological principles.
Lack of Versioning: The TDC platform does not maintain strict dataset versioning, making it impossible to reproduce results exactly due to silent changes in data composition.

2. Methodology

The authors conducted a rigorous, multi-stage audit of the top 3 models for each of the 22 ADMET endpoints on the TDC leaderboard.

A. Model Selection & Filtering

Target: The top 3 models per endpoint were selected, resulting in a final set of 10 representative architectures (e.g., MiniMol, MapLight, GradientBoost, XGBoost, ZairaChem).
Filtering Process: Models were subjected to a unified protocol:
1. Reproducibility Check: Attempting to run code in a local environment.
2. Data Leakage Assessment: Calculating Tanimoto similarity (Morgan fingerprints) between training and test sets to detect identical or near-identical molecules.
3. Optimization Verification: Reviewing code to ensure hyperparameter tuning was not performed on the test set.
4. Re-evaluation: Running surviving models on the official TDC splits to verify metrics.

B. In-House Baseline & Overfitting Simulation

Honest Models: The authors developed their own models using LightGBM with Mol2Vec embeddings and optimized via Sequential Forward Selection (SFS) and Bayesian Hyperparameter Optimization (HPO). Crucially, these models were tuned only on training/validation data, never touching the test set.
Deliberate Overfitting: To quantify the impact of test-set leakage, the authors created "overfitted" variants of their in-house models by performing SFS and HPO directly on the public TDC test sets. This served as a control to measure how much leaderboard positions could be artificially inflated.

3. Key Findings & Results

A. Reproducibility Failure

High Attrition Rate: Only 3 out of the top models (CaliciBoost, MapLight, and MapLight+GNN) passed all checks.
Common Issues:
- Missing/Broken Code: Models like CFA had broken links; ADMETrix and SimGCN had unresolvable dependency conflicts.
- Runtime Errors: ZairaChem failed due to library incompatibilities (PyTorch/TabPFN).
- Conclusion: High leaderboard rank does not guarantee a functional or reproducible model.

B. Data Leakage Detection

MiniMol: Despite claims of filtering, the MiniMol foundation model (pretrained on 6M molecules) showed significant leakage. Simple SMILES filtering failed to account for stereochemistry (e.g., danazol existed as a specific isomer in the test set but an unspecified form in the training set) and tautomerism.
GradientBoost & XGBoost: Both models (by the same author) utilized a flawed data splitting function during hyperparameter tuning. The validation set inadvertently included molecules from the public test set (due to a hard-coded random seed), leading to direct data leakage and inflated metrics. Correcting this dropped their ranks significantly (e.g., XGBoost fell from 2nd to 8th).

C. Reproducibility of Surviving Models

MapLight & MapLight+GNN: These models demonstrated high reproducibility. The re-evaluated metrics differed from TDC-reported values by less than 0.01 in most cases.
Discrepancies: Minor rank shifts were observed, largely attributed to the lack of dataset versioning on TDC (silent updates to the data).
Bioavailability Issue: Both models failed to reproduce the high performance on the bioavailability_ma endpoint, suggesting the original leaderboard results may have been based on a different dataset version or overfitting.

D. Impact of Overfitting

Deliberate Tuning: When the authors' "honest" models were tuned directly on the test set, their leaderboard performance improved dramatically.
- Result: Overfitted models reached the Top 3 in 10 out of 22 endpoints (45%), whereas honest models reached the Top 3 only twice.
- Specific Gains: Ranks improved from 14th to 1st (clearance_microsome_az) and 13th to 1st (ames).
Comparison: While overfitting boosts performance, the dominant TDC leaders (MapLight, MiniMol) appear in the Top 3 far more frequently (13–17 times) than the overfitted in-house model. This suggests the top leaders likely possess superior architectures/descriptors, not just overfitting, though the open test set makes it difficult to distinguish the two.

4. Key Contributions

Systematic Audit: First comprehensive, end-to-end verification of the TDC ADMET leaderboard, exposing that ~70% of top models are non-reproducible or methodologically flawed.
Leakage Identification: Uncovered specific mechanisms of data leakage in popular models (stereochemistry issues in MiniMol, split errors in GradientBoost/XGBoost).
Quantification of Overfitting: Demonstrated that deliberate test-set tuning can artificially elevate mediocre models to top-tier status, validating concerns about the validity of public leaderboards.
Reproducibility Baseline: Provided a set of reproducible, "honest" models (MapLight variants) that serve as a reliable benchmark for future studies.

5. Significance & Recommendations

The paper argues that the current TDC leaderboard format is insufficient for determining the true state-of-the-art in ADMET prediction due to the "open test set" paradox.

Proposed Solutions for Next-Gen Benchmarks:

Hidden Test Sets: Test data must remain private to prevent tuning.
Strict Versioning: Datasets must have checksums and version IDs to ensure exact reproducibility.
Containerized Submissions: Authors should submit models with standardized inference environments (e.g., Docker/Singularity) rather than just code or results, ensuring the evaluation environment is controlled.

Conclusion: While TDC is a valuable resource, high leaderboard rankings should not be interpreted as definitive proof of model superiority without independent verification. The field must transition toward hidden-test benchmarks to ensure models generalize to truly novel chemical space.

Critical Assessment of ML models for ADMET Prediction in TDC leaderboards