This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer
Imagine a massive, high-stakes Olympics for Drug Discovery.
In this Olympics, scientists from all over the world bring their best computer programs (AI models) to a stadium called TDC (Therapeutics Data Commons). The goal? To predict how new drugs will behave in the human body—will they be absorbed? Will they be toxic? Will they work?
The TDC provides a giant scoreboard (the "Leaderboard") where the top-performing AI models get gold, silver, and bronze medals. Everyone assumes these winners are the smartest, most reliable tools for saving lives.
But Ihor Koleiev and his team from Receptor.AI decided to act like the ultimate "Game Inspectors." They didn't just look at the scores; they went behind the scenes to check if the games were actually being played fairly. They found that while the scoreboard looks impressive, the reality is a bit messy.
Here is what they discovered, broken down into simple stories:
1. The "Ghost" Athletes (Reproducibility Issues)
Imagine a runner crossing the finish line first, but when you ask to see their shoes or hear their race strategy, they say, "Oh, I lost my shoes," or "My strategy is a secret, and also, my shoes only work on Tuesdays."
The researchers tried to run the code for the top 10 models on the leaderboard.
- The Problem: For many of the "winners," the code was broken, missing, or impossible to install. It was like trying to watch a movie where the projector is broken.
- The Result: Only 3 out of the top models could actually be run and verified. The rest were "ghosts"—they existed on paper, but you couldn't actually use them.
2. The "Cheating" Athletes (Data Leakage)
Imagine a math test where the teacher accidentally leaves the answer key on the desk in the classroom before the exam starts. If a student memorizes the answers, they get a perfect score, but they aren't actually smart; they just cheated.
The researchers found that several top models had done exactly this.
- The Problem: The models were trained on data that secretly included the "test questions" (the test set).
- MiniMol: This model was trained on a massive library of molecules. The creators said, "We removed the test molecules!" But the researchers found that because of tiny differences in how molecules were written (like writing a name with or without a middle initial), the model still "saw" the test answers during training.
- GradientBoost & XGBoost: These models accidentally split their data wrong. They used the "test" answers to help them study for the "practice" exam.
- The Result: These models got inflated scores. They looked like geniuses, but they were just memorizing the test. When the researchers fixed the cheating, these models dropped from 1st place to 3rd or even 8th place.
3. The "Perfect Score" Trap (Overfitting)
Imagine a student who studies only for the specific questions on last year's test. They get 100% on that test. But if you give them a new test with different questions, they fail miserably.
The researchers built their own "honest" AI models and then created a "cheating" version that was allowed to peek at the test answers while studying.
- The Experiment: They let the "cheating" model tune itself specifically for the TDC test set.
- The Result: The cheating model shot up the leaderboard, jumping from the middle of the pack to the top 3 in nearly half of the categories.
- The Lesson: This proves that the leaderboard is fragile. If you tweak your model just enough to fit the specific "test set" (even by accident), you can fake a gold medal.
4. The Few Honest Winners
Out of all the chaos, only three models (CaliciBoost, MapLight, and MapLight+GNN) passed every single check.
- They had working code.
- They didn't cheat.
- Their scores held up when re-tested.
- The Takeaway: These are the only ones we can truly trust right now. The others might be great, but we can't prove it yet.
5. The Missing Rulebook (Versioning)
Finally, the researchers pointed out a flaw in the stadium itself. The TDC changes its datasets (the "questions") over time but doesn't tell anyone exactly when or what changed.
- The Analogy: It's like a cooking competition where the ingredients keep changing, but the judges don't announce it. A chef might win today with "Tomatoes," but tomorrow the judge swaps them for "Potatoes," and the chef's recipe fails.
- The Consequence: Because the data isn't locked down with a specific "version number," it's hard to know if a model is actually good or if it just happened to match a specific, temporary version of the data.
The Big Picture: What Should We Do?
The paper concludes that while the TDC leaderboard is a great starting point, we shouldn't blindly trust the top ranks.
To fix this, the authors suggest:
- Hide the Test Answers: Don't let the public see the test set until the very end (like a real exam).
- Lock the Rules: Use strict "version numbers" for datasets so everyone is tested on the exact same questions.
- Submit the Whole Car: Instead of just submitting a score, researchers should submit their entire "car" (code + environment) so judges can drive it themselves to see if it actually works.
In short: The current leaderboard is like a race where some runners memorized the track, some lost their shoes, and the finish line keeps moving. We need to clean up the track and check the shoes before we crown the winners.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.