Performance Characteristics of Reasoning Large Language Models for Evidence Extraction from Clinical Genomics Literature

This study demonstrates that reasoning-capable large language models can achieve high accuracy in automating guideline-constrained PS4 evidence extraction from clinical genomics literature, though performance varies by model and prompt, supporting their use in a hybrid workflow with expert oversight.

Murugan, M., Yuan, B., Stephen, J., Gijavanekar, C., Xu, S., Kadirvel, S., Rivera-Munoz, E. A., Manita, V., Delca, F., Gibbs, R. A., Venner, E.

Published 2026-02-19
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to solve a massive medical mystery: Why do some people get sick with a specific genetic condition while others with the same genetic "typo" stay perfectly healthy?

To solve this, doctors need to find proof in thousands of old research papers. They are looking for a specific type of evidence called PS4. Think of PS4 as a "Gold Star" that says, "Hey, we found this genetic typo in 10 sick people, but zero healthy people. That's a strong clue!"

The Problem: The Human Bottleneck

Right now, finding these "Gold Stars" is like trying to find a needle in a haystack, but the haystack is made of 275 different books (scientific papers), and the needles are hidden in tiny paragraphs. Human experts have to read every single word, count the sick people, check their family trees, and make sure the rules are followed. It's slow, exhausting, and creates a huge backlog.

The Experiment: Can AI Do the Reading?

The researchers asked: "Can smart computer brains (AI) do this reading for us?"

They didn't just use any AI; they tested the newest, most "reasoning-capable" models (think of them as the top students in the class who are good at logic, not just memorizing facts). They gave five different AI models a test:

  1. The Search: "Find the specific genetic typo in this paper."
  2. The Count: "How many sick people with this typo are mentioned here, and does it fit the strict rules?"

They compared the AI's answers against a "Truth Set"—a list of answers already graded by a team of human experts.

The Results: The AI Report Card

Here is how the AI students performed:

  • Finding the Needle (Variant Detection): The AI was amazing at this. It found the genetic typos in the text with 93% to 98% accuracy. It was like a super-powered metal detector that rarely missed a signal.
  • Counting the Gold Stars (PS4 Evidence): This was the tricky part. The AI had to not just count people, but understand why they were counted (e.g., "Are these people from the same family? Do they have the right symptoms?").
    • Top Performers: Gemini 2.5 Pro and GPT-5 were the valedictorians, getting the count right 90%+ of the time.
    • Middle Pack: o3 did well (86%), while o4-mini and Claude Sonnet 4 struggled a bit more (73-79%).

Where Did They Stumble?

The AI didn't fail because it couldn't read; it failed because it sometimes got confused by the rules.

  • The Metaphor: Imagine a robot trying to follow a recipe. It can perfectly chop the onions (find the text), but if the recipe says "only use onions from the red basket," the robot might accidentally grab a green one if it doesn't understand the nuance of the family history or the specific symptoms.
  • The Fix: The researchers tried changing the instructions (prompts) given to the AI. For most models, clearer instructions made them smarter. However, for one model (Claude), the new instructions actually made it worse, proving that every AI has a different "personality" and needs different ways of being talked to.

The Conclusion: The "Human-in-the-Loop" Team

The big takeaway isn't that AI will replace doctors tomorrow. Instead, think of it as a super-efficient intern.

  • The Workflow: The AI does the heavy lifting first. It scans the papers, finds the typos, and does a rough count of the sick people. It highlights the most promising evidence.
  • The Safety Net: A human expert then steps in to double-check the AI's work, especially for the tricky parts where the rules are complex.

In short: These reasoning AI models are powerful tools that can speed up the process of finding genetic clues by a huge margin, but they still need a human supervisor to catch the subtle mistakes. It's a team sport, not a solo act.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →