Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

This paper introduces the formally verified Budget-Sensitive Discovery Score (BSDS) and its summary metric DQS to evaluate AI-guided scientific selection under budget constraints, demonstrating through extensive drug discovery experiments that current LLM configurations provide no marginal value over a simple Random Forest baseline.

Abhinaba Basu, Pavan Chakraborty

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are a treasure hunter with a very limited budget. You have a map (a dataset of 40,000 potential locations), but you can only dig in 500 spots before your money runs out. Your goal is to find as many real gold coins (active drugs) as possible while avoiding digging up useless rocks (false alarms).

This paper is about building a better ruler to measure how good your treasure-hunting strategy is, and then using that ruler to test if AI Chatbots (LLMs) are actually helping you find gold better than your old, trusted compass.

Here is the breakdown in simple terms:

1. The Problem: The Old Rulers Were Broken

In the past, scientists used standard rulers like "Accuracy" or "Area Under the Curve" to judge their treasure hunters.

  • The Flaw: These rulers measure performance across all 40,000 spots, even though you can only dig 500. It's like judging a chef based on how well they cooked a banquet for 1,000 people, when you only have money to feed 5 people.
  • The Real Issue: In real life, digging up a rock costs money (wasted experiment), and missing a gold coin costs opportunity (lost discovery). You need a ruler that cares about budget and mistakes.

2. The Solution: The "Budget-Sensitive Discovery Score" (BSDS)

The authors invented a new, super-strict ruler called BSDS (and its summary version, DQS).

  • How it works: It's a scorecard that penalizes you in three ways:
    1. Missing Gold: If you don't pick enough real treasures.
    2. Digging Rocks: If you pick spots that turn out to be empty (False Positives).
    3. Being Too Picky: If you refuse to dig on anything unless you are 100% sure (Abstention), you miss out on potential gold.
  • The "Math Police": The coolest part? They didn't just guess this formula works. They used a computer program called Lean 4 (a digital math police officer) to prove with 20 formal theorems that this scorecard is mathematically perfect and cannot be tricked. It's "formally verified."

3. The Experiment: Do AI Chatbots Help?

The authors asked a burning question: "If we already have a smart, trained computer (a Random Forest model) that knows where to dig, does adding a fancy AI Chatbot (like ChatGPT or Claude) make us find more gold?"

They tested 39 different strategies:

  • The Old Guard: Simple, proven methods (like the Random Forest).
  • The New Kids: 28 different ways to use AI Chatbots (some just reading the chemical names, some trying to "rerank" the list, some using a few examples to learn).

4. The Shocking Results

The results were a bit of a "reality check" for the AI hype:

  • The Winner: The simple, old-school Random Forest (let's call it "The Reliable Compass") won every time. It found the most gold with the least wasted money.
  • The Losers:
    • AI Chatbots (Zero-Shot): When you just asked the Chatbot to look at a chemical name and guess if it was gold, it performed worse than random chance. It was like asking a tourist who has never seen a map to find gold; they just guessed wildly.
    • AI Chatbots (Reranking): Even when you gave the Chatbot the Compass's list and asked it to "re-order" the spots, it actually made things worse. It added noise and confusion, pushing good spots down the list.
    • Complex AI: Trying to make the AI "think harder" (using few-shot examples or complex reasoning) didn't help enough to beat the simple Compass.

5. The Big Takeaway: "Don't Fix What Isn't Broken"

The paper concludes that for this specific type of scientific discovery (finding drug candidates):

  • Specialized, trained models (like the Random Forest) are currently superior to general-purpose AI Chatbots.
  • The Chatbots are great at writing poems or summarizing text, but they aren't great at the specific, high-stakes math of predicting chemical activity from scratch.
  • The Framework is the Hero: Even though the Chatbots lost, the BSDS scorecard is the real star. It proved why they lost by showing exactly where they failed (too many false alarms or missing too much gold).

The Analogy Summary

Imagine you are hiring a team to find the best apples in an orchard.

  • Old Method: You hire a team of experts who have picked apples in this orchard for 10 years. They know exactly which trees have the best fruit.
  • New Method: You hire a famous, smart generalist (the AI Chatbot) who has read every book about apples but has never been to this orchard.
  • The Test: You give the generalist a list of trees the experts picked and ask them to reorder the list.
  • The Result: The generalist messes it up. The experts' original list was already the best. The new ruler (BSDS) proved that the generalist's "confidence" was actually just noise.

In short: We built a mathematically perfect ruler to measure scientific discovery. We used it to test AI, and found that for now, simple, specialized tools are still beating the fancy, general-purpose AI chatbots in the high-stakes game of drug discovery.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →