Biomedical Large Language Models and Prompt Engineering… — Plain-Language Explanation

Original authors: Heckmann, N. S., Papoutsi, D. G., Barbieri, M. A., Battini, V., Molgaard, S. N., Schmidt, S. O., Melskens, L., Sessa, M.

Published 2026-02-24

📖 4 min read☕ Coffee break read

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Heckmann, N. S., Papoutsi, D. G., Barbieri, M. A., Battini, V., Molgaard, S. N., Schmidt, S. O., Melskens, L., Sessa, M.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Did a specific medicine cause a patient's bad reaction, or was it just a coincidence?

In the world of medicine, this is called Causality Assessment. It's a crucial job. If a drug is dangerous, we need to know so we can warn the public. If it's safe, we don't want to panic unnecessarily.

Traditionally, this job is done by human experts (like senior doctors and pharmacists) who read through messy, incomplete patient reports and ask themselves a checklist of questions: Did the reaction happen right after taking the pill? Did it stop when they stopped the pill? Is there another explanation?

This paper is about a new experiment: Can we teach a super-smart computer (an AI) to be this detective?

The Experiment: The AI Detectives

The researchers didn't just use any computer. They used Biomedical Large Language Models (LLMs). Think of these as "detectives who have read every medical textbook in the library," rather than general AI that has read the entire internet (which includes a lot of nonsense).

They tested five different detective teams, each made of:

The AI Model: Three different "brains" (TinyLlama, Medicine LLaMA-3, and MedLLaMA).
The Prompt Strategy: How they asked the AI to think. Did they say, "Just guess"? Or did they say, "Think step-by-step like a human" (Chain-of-Thought)? Or "Break the problem into tiny pieces" (Decomposition)?
The Rulebook: Two different official checklists used by real doctors (The Naranjo Scale and the WHO-UMC method).

They gave these AI teams 150 real-life mystery cases (some from regular drugs, some from vaccines) and asked them to solve the cases. Then, they compared the AI's answers to the answers given by two real human experts.

The Results: The Good, The Bad, and The "Almost"

1. The "Medical School" Advantage
The AI models trained specifically on medical books performed much better than general AI models. It's like comparing a detective who has read the Encyclopedia of Crime to one who has only read Wikipedia. The medical AI got about 64% agreement with the human experts. That's a big jump from the 34% agreement seen with general AI in previous studies.

2. The "Step-by-Step" vs. The "Story"
The AI did surprisingly well when using the Naranjo Rulebook. This rulebook is like a math test: "Question 1: Yes/No. Question 2: Yes/No. Add up the points." The AI was good at following these rigid steps.
However, the AI struggled with the WHO Rulebook. This one is more like writing an essay or telling a story about the patient. The AI got confused, often giving answers that didn't match the human experts at all. It seems the AI is better at math than at writing a coherent medical story.

3. The "Hallucination" Problem
Even the best AI team made mistakes.

The "Echo" Effect: Sometimes, the AI would just repeat the question back to you instead of answering it (like a parrot).
The "Confident Wrong" Effect: If a patient report was missing information, the human expert would say, "I don't know." The AI, however, would confidently guess an answer. It's like a student taking a test who doesn't know the answer but writes a long, confident paragraph anyway.
The "Missing Link": The AI was bad at checking if a side effect was officially listed in the drug's manual (because the AI didn't have access to that specific manual during the test) and was bad at spotting if another disease caused the problem.

The Big Takeaway

Can we replace human doctors with AI for this job yet? No.

The AI is a very fast, very knowledgeable intern, but it's not a Senior Detective yet.

It's faster: It can read reports instantly.
It's knowledgeable: It knows medical terms well.
But it lacks judgment: It can't always tell the difference between a "maybe" and a "definitely," and it sometimes makes up reasons to support its answer.

The Future

The authors suggest that in the future, we shouldn't just ask the AI to "solve the case." Instead, we should use a hybrid approach:

Let the AI do the heavy lifting (reading the reports, checking the facts).
Let the human expert make the final call, using the AI's work as a helpful assistant.

Think of it like GPS vs. a Human Driver. The GPS (AI) is great at knowing the roads and traffic rules, but the Human Driver (Expert) is still needed to make the final decision when the road gets muddy or the map is wrong.

In short: Biomedical AI is getting much better at being a medical detective, but it still needs a human partner to ensure the verdict is fair, accurate, and safe.

Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

The Experiment: The AI Detectives

The Results: The Good, The Bad, and The "Almost"

The Big Takeaway

The Future

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

Biomedical Large Language Models and Prompt Engineering for Causality Assessment of Individual Case Safety Reports in Pharmacovigilance

The Experiment: The AI Detectives

The Results: The Good, The Bad, and The "Almost"

The Big Takeaway

The Future

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Implications

More like this