DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to build a massive, life-saving library of knowledge about how different medicines work. For years, scientists have been writing these books by hand, running expensive lab tests, and doing complex math. It's slow, costly, and exhausting.

Recently, a new tool has arrived: Large Language Models (LLMs). Think of these as super-smart, hyper-fast librarians who have read almost every book, paper, and website ever written. They can summarize information, guess connections, and even write new stories in seconds.

But here's the problem: We don't know if these AI librarians are actually good at the job, or if they are just making things up. In the world of medicine, a "made-up" fact could be dangerous.

This paper, titled "DrugPlayGround," is like a giant, rigorous test drive for these AI librarians. The authors built a playground to see which AI models are the best at understanding drugs, which ones are reliable, and where they might trip up.

Here is a breakdown of what they did, using some everyday analogies:

1. The "Description" Test: Can the AI Write a Good Resume?

First, the researchers asked the AI to write a "resume" (a text description) for various drugs. They wanted to see if the AI could accurately list a drug's weight, how it works, and its chemical makeup.

The Analogy: Imagine asking five different students to write a biography of a famous actor. You want to know: Did they get the facts right? Did they miss the key points? Did they invent a fake movie the actor never starred in?
The Findings:
- The Star Student: GPT-4o was the best at writing accurate, high-quality descriptions. It was like the student who actually studied the source material.
- The "Temperature" Knob: The researchers turned a dial called "temperature" (which controls how creative or random the AI is). They found that for writing facts, low creativity (low temperature) is best. If you turn the dial up too high, the AI starts getting "drunk" on its own creativity and makes up facts.
- The Prompt Matters: How you ask the question changes the answer. Giving the AI a specific "persona" (e.g., "You are a pharmaceutical expert") worked better than just asking "Tell me about this drug."

2. The "Translation" Test: Can the AI Speak "Drug Language"?

AI doesn't just read words; it turns them into numbers (called embeddings) so computers can do math with them. Think of this as translating a drug's story into a secret code that a computer can understand.

The Analogy: Imagine you have a dictionary that translates "Drug A" into a specific set of coordinates on a map. If two drugs are similar, their coordinates should be close together. The researchers tested if different AI models create accurate maps.
The Findings:
- Some models created maps where similar drugs were far apart (bad maps).
- Others, like Gemini and Mistral, created very accurate maps.
- Crucial Insight: The best model for one job wasn't always the best for another. It's like having a Swiss Army knife: one blade is great for cutting rope, but another is better for screwing in a bolt. You need the right tool for the specific task.

3. The "Prediction" Games: Can the AI Guess the Future?

The researchers put the AI's "maps" (embeddings) to work in three real-world scenarios:

Scenario A: The Power Couple (Drug Synergy)
- The Task: Predict if Drug A + Drug B work better together than alone.
- The Analogy: Like predicting if a specific cheese and wine pairing tastes amazing.
- The Result: The AI was surprisingly good at this, often beating traditional methods. However, it struggled when the "cell" (the environment) was messy and chaotic. If the biological system is too noisy, the AI gets confused.
Scenario B: The Lock and Key (Drug-Target Interaction)
- The Task: Predict if a drug will stick to a specific protein in the body.
- The Analogy: Will this key fit into this lock?
- The Result: The AI did well, but it relied heavily on the text description of the drug. If the description was vague, the AI couldn't guess the lock. This showed that good data is more important than a fancy model.
Scenario C: The Ripple Effect (Drug Perturbation)
- The Task: Predict how a drug changes the activity of thousands of genes in a cell.
- The Analogy: Dropping a stone in a pond and predicting exactly how the ripples will move.
- The Result: The AI could predict these changes better than older methods, but only if the drug description included rich biological context (e.g., "This is an antibiotic that kills bacteria"). If the description was just dry chemical facts, the AI failed to see the bigger picture.

The Big Takeaways (The "Cheat Sheet")

AI is Powerful, but Flawed: These models are amazing at summarizing vast amounts of knowledge, but they can "hallucinate" (make up facts like wrong molecular weights). You can't just trust them blindly; you need a human expert to double-check.
One Size Does Not Fit All: There is no single "best" AI for drug discovery.
- Need a perfect description? Use GPT-4o.
- Need to predict drug combinations? Use Gemini.
- Need to predict gene changes? Use Qwen or Mistral.
Garbage In, Garbage Out: The quality of the AI's prediction depends entirely on the quality of the text description you give it. If you feed it vague info, it gives you vague answers.
The Human Element: The paper emphasizes that AI shouldn't replace scientists. Instead, it should be a co-pilot. The AI does the heavy lifting of sorting data, and the human chemist provides the final "sanity check" and deep reasoning.

In Summary

DrugPlayGround is a reality check for the hype around AI in medicine. It tells us: "Yes, these tools are incredibly promising and can speed up drug discovery, but we need to be smart about how we use them. We need to pick the right model for the right job, keep the temperature low to avoid lies, and always have a human expert in the loop to make sure the medicine is safe."

1. Problem Statement

Despite the rapid adoption of Large Language Models (LLMs) in drug discovery for tasks like hypothesis generation and candidate prioritization, there is a critical lack of objective, systematic benchmarks to evaluate their performance against traditional platforms. Current concerns include:

Hallucinations: LLMs may generate chemically unrealistic structures or incorrect factual data (e.g., molecular weights).
Safety Risks: Indiscriminate ingestion of natural language data can lead to medical misinformation.
Performance Uncertainty: It is unclear whether LLMs outperform deep learning models trained from scratch for specific tasks like binding affinity prediction or synergy analysis.
Lack of Standardization: There is no unified framework to assess how different LLMs, prompt strategies, and temperature settings impact drug-related reasoning and representation learning.

2. Methodology: The DrugPlayGround Framework

The authors developed DrugPlayGround, a unified benchmarking platform designed to evaluate LLMs across two complementary branches:

Text Generation Evaluation: Assessing the quality of LLM-generated textual descriptions of drugs.
Embedding Evaluation: Assessing the utility of LLM-derived embeddings for downstream predictive tasks.

Key Components:

Datasets:
- MolTextNet: Used as the ground truth for drug descriptions (2 million molecule-text pairs).
- BAITSAO: For drug synergy prediction (datasets D1, D2, D3).
- TDC (Human, DrugBank, C.elegans): For Drug-Protein Interaction (DPI) prediction.
- Tahoe 100M & ChemCPA: For chemical perturbation (gene expression) prediction.
Models Evaluated:
- Generative LLMs: GPT-4o, Claude-sonnet-4, DeepSeek-v3, Gemini-1.5-pro, Mistral-large-2411.
- Embedding Models: text-embedding-3-large (GPT), Gemma-300m, gemini-embedding-exp, mistral-embed, Qwen3-Embedding-8B.
Experimental Variables:
- Prompt Strategies: Standard, Chain-of-Thought (CoT), and Meta-cognition (Meta) prompts.
- Temperature Settings: Ranging from 0.0 to 1.0 to test stability and variability.
Evaluation Metrics:
- Text: BLEU, ROUGE-1/2/L, BERTScore, and a Normalized Total Score.
- Embeddings: Cosine similarity against ground truth.
- Downstream Tasks: Accuracy (ACC), AUROC, Pearson Correlation Coefficient (PCC), and $R^2$ .

3. Key Contributions

First Unified Benchmark: Established the first comprehensive framework specifically for benchmarking LLMs and their embeddings in the context of drug discovery.
Dual-Mode Evaluation: Simultaneously evaluates the linguistic quality of drug descriptions and the functional utility of the resulting vector embeddings.
Expert-Driven Validation: Integrated feedback from domain experts (chemists/biologists) to interpret model failures and successes, moving beyond purely numerical metrics.
Actionable Guidelines: Provided specific recommendations on model selection, prompt engineering, and parameter tuning for different stages of the drug discovery pipeline.

4. Key Results

A. Text Generation Performance

Model Hierarchy: GPT-4o consistently demonstrated the highest overall performance across all metrics, followed closely by Mistral-large-2411. DeepSeek-v3 ranked lowest.
Prompt Impact: Meta-cognition prompts (defining the model as a pharmaceutical expert) yielded the highest quality and reference-aligned descriptions. Chain-of-Thought (CoT) prompts often introduced redundant reasoning artifacts and reduced lexical alignment, despite sometimes improving stability (lower variance).
Temperature Effects: Lower temperatures generally improved metric scores but did not guarantee stability. The optimal temperature was model-dependent (e.g., GPT-4o and Gemini performed best at lower temps; Mistral preferred the lowest; Claude and DeepSeek peaked at moderate temps).
Hallucinations: LLMs frequently hallucinated specific numerical values (e.g., molecular weights) and chemical formulas, even when the drug name was correct. CoT prompts were particularly prone to generating unnecessary intermediate reasoning text.

B. Embedding Performance & Downstream Tasks

Drug Representation: Embeddings derived from LLM-generated text generally achieved high cosine similarity (>0.7) with ground truth, with Mistral-Emb and Gemma-Emb leading. Parameter scale did not strictly correlate with embedding quality.
Drug Synergy Prediction:
- LLM-derived embeddings outperformed both Molecular Foundation Models (UniMol) and direct LLM inference (GPT-5.1).
- Gemini-Emb and Mistral-Emb were the top performers.
- Insight: Predictability was highly dependent on biological context. Models succeeded on cell lines with clear, dominant signaling axes (e.g., VCaP cells driven by Androgen Receptor) but failed on heterogeneous cell lines with complex, noisy networks (e.g., MSTO-211H).
Drug-Protein Interaction (DPI):
- LLM embeddings consistently outperformed domain-specific baselines.
- Task Specificity: No single model dominated all datasets. GPT-Emb excelled in Human interactions, while Gemini and Qwen3 performed better on C. elegans and curated databases.
- Insight: Embeddings effectively encoded probabilistic cues from natural language descriptions (e.g., explicit mechanism-of-action statements) that structural models lacked.
Chemical Perturbation Prediction:
- LLM embeddings significantly improved $R^2$ scores over the RDKit-based ChemCPA baseline.
- Qwen3-Emb combined with GPT-4o descriptions (Temp 0.4) achieved the best balance of accuracy and low variance.
- Insight: Descriptions rich in biological context (e.g., antibiotic classification) led to better predictions than those focusing solely on physicochemical properties.

5. Significance and Future Directions

Standardization: DrugPlayGround provides a necessary standard to validate LLMs before their deployment in high-stakes pharmaceutical pipelines.
Hybrid Workflows: The study suggests that the most effective future pipelines will integrate LLM-generated descriptions (for semantic richness) with structural models (for chemical precision) to mitigate hallucinations.
Practical Guidance:
- Use GPT-based models with chemical-centric system prompts for generating drug descriptions.
- Use Gemini-series or advanced open-source embedding models for downstream tasks, selecting based on the specific dataset (e.g., Human vs. non-human).
- Avoid relying solely on CoT prompts for drug descriptions due to redundancy and hallucination risks.
Limitations Identified: Current LLMs struggle with precise 2D structural generation and exact numerical values (MW, pKa). Future work must incorporate structural data into training to unify the structure-function-property framework.

In conclusion, DrugPlayGround demonstrates that while LLMs offer powerful new capabilities for drug discovery, their application requires careful benchmarking, specific prompt engineering, and human-in-the-loop validation to ensure reliability and safety.

DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

1. The "Description" Test: Can the AI Write a Good Resume?

2. The "Translation" Test: Can the AI Speak "Drug Language"?

3. The "Prediction" Games: Can the AI Guess the Future?

The Big Takeaways (The "Cheat Sheet")

In Summary

1. Problem Statement

2. Methodology: The DrugPlayGround Framework

3. Key Contributions

4. Key Results

A. Text Generation Performance

B. Embedding Performance & Downstream Tasks

5. Significance and Future Directions

More like this

Functional-space alignment resolves the eco-evolutionary landscape of siderophore biosynthesis across bacteria

Exploring molecular signatures of senescence with markeR, an R toolkit for evaluating gene sets as phenotypic markers

Longevity Bench: Are SotA LLMs ready for aging research?

TFBindFormer: A Cross-Attention Transformer for Transcription Factor-DNA Binding Prediction

A little longer, a lot better: simulation-guided exploration of extended-length single-end barcoded reads for structural variant detection