Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

Imagine you are a treasure hunter with a very limited budget. You have a map (a dataset of 40,000 potential locations), but you can only dig in 500 spots before your money runs out. Your goal is to find as many real gold coins (active drugs) as possible while avoiding digging up useless rocks (false alarms).

This paper is about building a better ruler to measure how good your treasure-hunting strategy is, and then using that ruler to test if AI Chatbots (LLMs) are actually helping you find gold better than your old, trusted compass.

Here is the breakdown in simple terms:

1. The Problem: The Old Rulers Were Broken

In the past, scientists used standard rulers like "Accuracy" or "Area Under the Curve" to judge their treasure hunters.

The Flaw: These rulers measure performance across all 40,000 spots, even though you can only dig 500. It's like judging a chef based on how well they cooked a banquet for 1,000 people, when you only have money to feed 5 people.
The Real Issue: In real life, digging up a rock costs money (wasted experiment), and missing a gold coin costs opportunity (lost discovery). You need a ruler that cares about budget and mistakes.

2. The Solution: The "Budget-Sensitive Discovery Score" (BSDS)

The authors invented a new, super-strict ruler called BSDS (and its summary version, DQS).

How it works: It's a scorecard that penalizes you in three ways:
1. Missing Gold: If you don't pick enough real treasures.
2. Digging Rocks: If you pick spots that turn out to be empty (False Positives).
3. Being Too Picky: If you refuse to dig on anything unless you are 100% sure (Abstention), you miss out on potential gold.
The "Math Police": The coolest part? They didn't just guess this formula works. They used a computer program called Lean 4 (a digital math police officer) to prove with 20 formal theorems that this scorecard is mathematically perfect and cannot be tricked. It's "formally verified."

3. The Experiment: Do AI Chatbots Help?

The authors asked a burning question: "If we already have a smart, trained computer (a Random Forest model) that knows where to dig, does adding a fancy AI Chatbot (like ChatGPT or Claude) make us find more gold?"

They tested 39 different strategies:

The Old Guard: Simple, proven methods (like the Random Forest).
The New Kids: 28 different ways to use AI Chatbots (some just reading the chemical names, some trying to "rerank" the list, some using a few examples to learn).

4. The Shocking Results

The results were a bit of a "reality check" for the AI hype:

The Winner: The simple, old-school Random Forest (let's call it "The Reliable Compass") won every time. It found the most gold with the least wasted money.
The Losers:
- AI Chatbots (Zero-Shot): When you just asked the Chatbot to look at a chemical name and guess if it was gold, it performed worse than random chance. It was like asking a tourist who has never seen a map to find gold; they just guessed wildly.
- AI Chatbots (Reranking): Even when you gave the Chatbot the Compass's list and asked it to "re-order" the spots, it actually made things worse. It added noise and confusion, pushing good spots down the list.
- Complex AI: Trying to make the AI "think harder" (using few-shot examples or complex reasoning) didn't help enough to beat the simple Compass.

5. The Big Takeaway: "Don't Fix What Isn't Broken"

The paper concludes that for this specific type of scientific discovery (finding drug candidates):

Specialized, trained models (like the Random Forest) are currently superior to general-purpose AI Chatbots.
The Chatbots are great at writing poems or summarizing text, but they aren't great at the specific, high-stakes math of predicting chemical activity from scratch.
The Framework is the Hero: Even though the Chatbots lost, the BSDS scorecard is the real star. It proved why they lost by showing exactly where they failed (too many false alarms or missing too much gold).

The Analogy Summary

Imagine you are hiring a team to find the best apples in an orchard.

Old Method: You hire a team of experts who have picked apples in this orchard for 10 years. They know exactly which trees have the best fruit.
New Method: You hire a famous, smart generalist (the AI Chatbot) who has read every book about apples but has never been to this orchard.
The Test: You give the generalist a list of trees the experts picked and ask them to reorder the list.
The Result: The generalist messes it up. The experts' original list was already the best. The new ruler (BSDS) proved that the generalist's "confidence" was actually just noise.

In short: We built a mathematically perfect ruler to measure scientific discovery. We used it to test AI, and found that for now, simple, specialized tools are still beating the fancy, general-purpose AI chatbots in the high-stakes game of drug discovery.

1. Problem Statement

Scientific discovery (e.g., drug discovery, materials screening, autonomous vehicle safety) increasingly relies on AI to select candidates for expensive experimental validation. However, current evaluation frameworks suffer from three critical gaps:

Budget Agnosticism: Standard metrics like AUROC and F1 integrate performance over all operating points, obscuring performance at the specific, limited budget where decisions are actually made.
Asymmetric Error Costs: Existing metrics often treat false positives and false negatives equally, whereas in reality, a false positive wastes expensive experimental resources, while a false negative represents a missed opportunity.
Lack of Abstention Modeling: Real-world agents should be rewarded for "abstaining" (declining to commit) on ambiguous candidates rather than guessing.
LLM Evaluation Gap: Large Language Models (LLMs) can generate plausible scientific proposals, but there is no principled, budget-aware metric to determine if they add marginal value over existing trained classifiers.

2. Methodology: The BSDS Framework

The authors introduce the Budget-Sensitive Discovery Score (BSDS) and its aggregate form, the Discovery Quality Score (DQS).

Core Definitions

Budget-Sensitive Discovery Score (BSDS): A metric calculated at a specific budget level $B$ . It jointly penalizes false discoveries and excessive abstention:
$\text{BSDS}(B) = \text{HR}@B - \lambda \cdot \text{FDR}@B - \gamma \cdot (1 - \text{Cov}@B)$
- HR (Hit Rate): Recall over true hits.
- FDR (False Discovery Rate): Fraction of selected candidates that are false positives.
- Cov (Coverage): Fraction of candidates receiving a definitive decision (selected or rejected, excluding abstentions).
- Parameters: $\lambda$ (penalty for false positives) and $\gamma$ (penalty for abstention) allow domain experts to encode cost structures.
Discovery Quality Score (DQS): The average of BSDS across a spectrum of budgets ( $B_1 \dots B_M$ ). This prevents "cherry-picking" a single budget to inflate scores.

Formal Verification

A key innovation is that the framework is formally verified. The authors used the Lean 4 proof assistant to machine-check 20 theorems regarding the metric's properties, including:

Boundedness: Scores are within a calibrated range.
Incentive Compatibility: Improving recall or reducing FDR strictly improves the score.
Oracle Dominance: The theoretical optimal proposer always achieves the highest score.
Bayes-Optimal Abstention: A formal rule for when an agent should abstain based on estimated costs.

3. Experimental Setup

The framework was applied as a case study to evaluate 39 proposer strategies on the MoleculeNet HIV dataset (41,127 compounds, 3.5% active) and validated across other domains.

Proposers Evaluated:
- Baselines: Random selection, Greedy-ML (Ranking by Random Forest probability).
- Mechanistic Ablations: Informed-Prior (drug-likeness), Retrieval (RAG-style similarity), Generative (temperature sampling), Ensemble, and BSDS-Guided.
- Optimization Variants: BSDS-Recursive (MLP trained to maximize BSDS directly) and ablation controls.
- LLMs: 7 production models (ChatGPT-5.2, Claude, Gemini, DeepSeek, Qwen, Llama, GLM) tested in Zero-Shot and Few-Shot modes, with and without "Reranking" (refining ML predictions).
Datasets: MoleculeNet HIV (primary), Tox21, ClinTox, MUV-466, SIDER-Ear, and an Autonomous Vehicle (AV) Safety dataset.
Validation: 1,000 bootstrap replicates, both random and scaffold-based splits, and sensitivity analysis across a $9 \times 7$ grid of $(\lambda, \gamma)$ parameters.

4. Key Results

A. The RF Baseline Dominates

The simple Greedy-ML strategy (ranking compounds by a Random Forest classifier) achieved the best DQS (−0.046).
MLP Reranking Fails: Adding MLP layers to re-rank the RF output (including BSDS-Recursive) degraded performance. The MLP variants consistently scored lower than the RF baseline (e.g., BSDS-Recursive DQS = −0.323).
Deployment Impact: At a budget of 50 compounds, Greedy-ML achieved a 96% hit rate, compared to 78% for the best MLP variant and 62% for BSDS-Recursive.

B. LLMs Provide No Marginal Value

Zero-Shot: All 7 LLMs in direct mode performed near-randomly (DQS $\approx$ −0.6 to −0.8), with some performing worse than random selection. They could not extract discriminative signal from SMILES strings alone.
Reranking: When LLMs were asked to re-rank the RF's top candidates, performance improved significantly but still failed to surpass the RF baseline. The best LLM reranker trailed the RF baseline by 0.095 DQS points.
Few-Shot: Providing 3 active and 3 inactive examples improved LLM performance slightly but did not close the gap with the trained classifier.
Generalization: This negative result held true for both HIV and Tox21 datasets.

C. Mechanistic Insights

Structural Retrieval: Strategies using structural similarity (Retrieval) performed second-best, suggesting that "scaffold hopping" (finding molecules similar to known actives) is the most effective heuristic after the ML model.
Temperature Sensitivity: Generative strategies using temperature sampling performed poorly; low temperatures ( $\tau=0.1$ ) were optimal but still inferior to deterministic ranking.
Metric Sensitivity: Standard metrics (EF, AUROC) were identical for 7 different proposers using the same RF surrogate. Only BSDS/DQS distinguished their performance, revealing trade-offs between precision, recall, and abstention that standard metrics miss.

D. Robustness and Generalization

Parameter Robustness: The proposer hierarchy (RF > LLMs) remained stable across 63 different $(\lambda, \gamma)$ parameter combinations (Kendall $\tau \ge 0.636$ ).
Cross-Dataset/Domain: The hierarchy generalized across five MoleculeNet benchmarks (prevalence 0.18%–46.2%) and the AV Safety domain, confirming the framework's applicability beyond drug discovery.

5. Key Contributions

Formally Verified Framework: Introduced BSDS/DQS, the first evaluation metric for scientific selection with 20 machine-checked theorems ensuring mathematical correctness and incentive compatibility.
Comprehensive Evaluation: Evaluated 39 strategies, providing the first rigorous, budget-aware comparison of LLMs against traditional ML pipelines in drug discovery.
Empirical Evidence Against LLM Superiority: Demonstrated that, in a realistic deployment scenario (where an ML model already exists), current LLMs (zero-shot or few-shot) do not add marginal value and often degrade performance when used for reranking.
Revealing Hidden Trade-offs: Showed that standard metrics (AUROC, EF) fail to distinguish between strategies with identical score distributions but different budget-level behaviors, whereas BSDS captures these critical differences.

6. Significance

For Practitioners: The paper argues that in resource-constrained scientific discovery, simple, well-trained classifiers (like Random Forests) currently outperform complex LLM pipelines. It provides a tool (BSDS) to rigorously evaluate new strategies without falling for "cherry-picked" metrics.
For AI Research: It highlights a limitation of current LLMs: they struggle with quantitative molecular reasoning and structure-activity relationships when not fine-tuned or provided with tools. The "marginal value" of LLMs in this specific context is negligible compared to established ML methods.
Methodological Impact: The use of formal verification (Lean 4) sets a new standard for evaluation metrics in AI, ensuring that the metric itself is not a source of error or bias, which is crucial when evaluating opaque models like LLMs.

Conclusion: The paper concludes that while LLMs are powerful for generating hypotheses, they currently lack the precision to outperform specialized ML classifiers in candidate selection under budget constraints. The BSDS framework provides the necessary rigor to evaluate future advancements, including tool-augmented LLMs and chain-of-thought reasoning.