MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

This paper audits and corrects errors in the MedCalc-Bench dataset, demonstrating that providing models with calculator specifications ("open-book" prompting) significantly outperforms current state-of-the-art methods, thereby revealing that the benchmark primarily measures formula memorization and arithmetic precision rather than clinical reasoning.

Artus Krohn-Grimberghe

Published 2026-03-04
📖 4 min read☕ Coffee break read

The Big Idea: The "Open-Book" Surprise

Imagine you are taking a difficult math test in a hospital. The test asks you to calculate a patient's risk score based on their blood work.

The Old Way (Closed-Book):
For years, researchers have been testing AI doctors by giving them these problems without letting them look up the formulas. It's like asking a student to memorize the entire periodic table and then solve complex chemistry equations on the spot.

  • The Result: Even the smartest AI models were failing miserably, getting only about 35% of the answers right. The leaderboard looked like a disaster zone.

The New Discovery:
This paper argues that the test was unfair. In real life, doctors don't memorize formulas; they use calculators or apps. They look up the rules, plug in the numbers, and get the answer.

The authors tried a simple trick: They gave the AI the "cheat sheet" (the formula and rules) right in the prompt.

  • The Result: Suddenly, the AI's score jumped from 35% to over 85%. It didn't need to be retrained or upgraded; it just needed to be allowed to "open the book."

The Three Main Findings (The "Plot Twist")

1. The Test Was Broken (The "Typos in the Answer Key")

Before testing the AI, the authors audited the test itself. They found that the "official" calculator code used to grade the answers was full of mistakes.

  • The Analogy: Imagine a teacher grading a math test, but the answer key has typos. If the student writes "10" but the key says "100" because of a typo, the student gets marked wrong.
  • What they found: They fixed over 20 errors in the test code. Some formulas had the wrong numbers, some had missing steps, and some had broken file paths. The "gold standard" wasn't actually gold; it was rusty.

2. The "Open-Book" Intervention

The authors realized the AI wasn't bad at clinical reasoning (figuring out what the patient needs); it was just bad at memorizing the specific math formula for that one test.

  • The Analogy: It's like asking a chef to bake a cake without a recipe. They might know how to bake, but if they forget the exact ratio of sugar to flour, the cake fails. If you hand them the recipe card, they bake a perfect cake every time.
  • The Outcome: By simply pasting the recipe (the calculator specification) into the AI's instructions, the AI outperformed all the complex, expensive, "reinforcement learning" systems that had been trained for months.

3. The Ceiling is Much Higher

The authors then asked: "If we give the AI the recipe, fix the broken answer key, and use the smartest AI available, how perfect can it get?"

  • The Result: They found that the AI could get 95–97% of the answers right. The few mistakes left weren't because the AI was "dumb"; they were because the patient's story was vague or the test data itself was confusing.

Why Does This Matter? (The "So What?")

The paper concludes that MedCalc-Bench isn't actually testing if an AI can be a good doctor.

  • What it was testing: Can the AI memorize a specific math equation and do decimal math perfectly without making a tiny rounding error?
  • What it should test: Can the AI read a patient's chart, find the right numbers (like blood pressure or age), and use a tool to calculate the risk?

The Final Verdict:
The authors suggest we stop treating these calculators as a memory test. Instead, we should treat them as a tool-use test.

  • Old Mindset: "Show me you know the formula."
  • New Mindset: "Here is the formula. Now, look at this patient's chart, find the right numbers, and tell me the result."

The "Solo Researcher" Superpower

One last interesting note: The entire study was done by one person over a few weekends.

  • The Analogy: In the past, auditing a dataset like this would have required a team of 10 people working for months.
  • The Secret: The researcher used a "swarm" of different AI models to do the heavy lifting. One AI wrote the code, another checked the medical facts, another searched the internet, and another wrote the paper. It's like having a full research lab in your pocket for the price of a coffee.

Summary

The paper says: "Stop testing AI on memory. Start testing them on how well they use tools. And by the way, the test we were using was broken, and we fixed it."

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →