BiomniBench: Process-level Evaluation of LLM Agents for… — Plain-Language Explanation

Original authors: Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.-H., Zhao, B., Shen, M., Yang, H., Yan, J., Zhang, R., Wu, X., Li, T., Zhou, B., Wang, N., Ma, A., Cong, L., Hu, X., Jiang, Y., Dong, J

Published 2026-05-18

📖 3 min read☕ Coffee break read

View on bioRxiv ↗PDF ↗

CC BY 4.0

Original authors: Qu, Y., Lu, Y., Tu, X., Zhang, S., She, T., Shaw, A. G., Shih, J.-H., Zhao, B., Shen, M., Yang, H., Yan, J., Zhang, R., Wu, X., Li, T., Zhou, B., Wang, N., Ma, A., Cong, L., Hu, X., Jiang, Y., Dong, J., Peng, T., Leskovec, J., Huang, K.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are hiring a team of junior scientists to solve a complex puzzle based on a famous, real-world medical discovery. In the past, to see if they did a good job, you would only look at their final answer. If they got the right number, you gave them a gold star. If they got it wrong, you gave them a red X.

The paper argues that this "final answer only" approach is broken for two main reasons:

The Lucky Guess: A student might get the right answer not because they understood the science, but because they memorized the solution, cheated, or just guessed correctly by accident.
The Wrong Path: A student might use a brilliant, valid, and creative way to solve the problem that is different from the teacher's specific method. Under the old rules, they would get a red X just because their path didn't match the textbook exactly.

To fix this, the authors created BiomniBench. Think of this not as a final exam, but as a detailed video review of the student's entire thought process. Instead of just checking the final score, they watch the whole movie of how the AI agent worked. They use a special "rubric" (a checklist) designed by real human experts to grade every step the AI took, ensuring it actually understood the biology and didn't just guess.

What they tested:
They built a specific version called BiomniBench-DA, which is like a gym with 100 different workout stations. These stations cover 17 different types of data analysis, 5 different disease areas, and general biology. The "workouts" are based on real, high-stakes scientific papers from top journals like Nature, Cell, and Science. Crucially, the people who wrote the original papers (or experts who know them intimately) helped design these tests to make sure they are fair and accurate.

What they found:
They tested the smartest AI models available against this new system and discovered three big things:

The Smartest are Leading, but Still Learning: The most advanced AI models are doing the best, but they still have a long way to go before they are perfect.
The Tool Matters as Much as the Brain: It doesn't matter just how smart the AI model is; the "harness" (the software wrapper or tool used to run the AI) changes the results just as much as the model itself. It's like how a great driver can still crash in a broken car.
Specific Weaknesses: The AI agents consistently stumble in three areas: picking the right method to use, understanding what the biological results actually mean, and connecting the dots with true scientific reasoning.

In short, BiomniBench is the first tool that lets us watch the AI's "thinking" in real-world medical research, revealing mistakes that a simple "right or wrong" score would completely miss.

BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research

Technical Summary: BiomniBench

BiomniBench: Process-level Evaluation of LLM Agents for Real-world Biomedical Research

Technical Summary: BiomniBench

More like this