A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

This study evaluates three large language models on moderate hyperthermia questions and finds that while their average performance is rated as "acceptable," a significant proportion of responses are of poor quality or potentially harmful, indicating they are not yet reliable for use without domain expertise.

Dennstaedt, F., Cihoric, N., Bachmann, N., Filchenko, I., Berclaz, L., Crezee, H., Curto, S., Ghadjar, P., Huebenthal, B., Hurwitz, M. D., Kok, P., Lindner, L. H., Marder, D., Molitoris, J., Notter, M., Rahman, S., Riesterer, O., Spalek, M., Trefna, H., Zilli, T., Rodrigues, D., Fuerstner, M., Stutz, E.

Published 2026-03-26
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a team of three incredibly smart, super-fast librarians. These librarians have read almost everything ever written on the internet. They are so good that if you ask them about general history, math, or even basic medicine, they can answer like a PhD professor.

Now, imagine you ask them a very specific, tricky question about moderate hyperthermia (a cancer treatment that uses controlled heat to help radiation work better). This is a very niche field, like asking a librarian to explain the specific rules of a game that only 500 people in the world play.

This paper is essentially a report card on how well these three "AI Librarians" (called DeepSeek, Llama, and GPT-4o) performed when asked 40 of these tricky questions. The "teachers" grading them were 19 real-world experts—doctors and physicists who actually use this heat therapy every day.

Here is the breakdown of what happened, using some simple analogies:

1. The Test: A "Pop Quiz" for AI

The researchers didn't just ask yes-or-no questions. They asked open-ended questions, like: "If a patient has a specific type of bone tumor and can't get chemotherapy, should we use heat therapy once a week or twice a week?"

They asked the three AI models to answer. Then, they hid the names of the AIs and showed the answers to the human experts. The experts gave them a grade from 1 (Very Bad) to 5 (Very Good) and also checked: "Is this answer dangerous if a real doctor followed it?"

2. The Results: "Okay, but Don't Trust Them Yet"

The overall grade for all three AIs was a 3 out of 5.

  • The Translation: In school terms, this is a "C" or "Passable." It's not failing, but it's definitely not an "A."
  • The Problem: While the average was a "C," about 25% of the answers were actually "F" grades (rated as "bad" or "very bad").
  • The Danger: Even more worrying, about 1 in 6 answers (roughly 15–19%) were flagged by the experts as "potentially harmful."

The Analogy: Imagine you are building a house. You ask a robot to tell you how to pour the concrete. On average, the robot gives you a decent recipe. But 25% of the time, it tells you to mix in sand instead of cement. If you follow that bad advice, the house might collapse.

3. The "Hallucination" Trap

The paper found that because this field of medicine (heat therapy) is so specialized and has less data on the internet than, say, "how to treat a cold," the AIs started to make things up.

  • The "Fake Reference" Metaphor: One AI gave a very confident answer, citing a famous study to back it up. The experts looked it up and realized the study didn't exist. The AI had "hallucinated" a fake source to sound smart.
  • The "Wrong Device" Metaphor: When asked to list the machines used for this treatment, the AIs couldn't give a complete list. They missed key machines that are actually used in hospitals. It's like asking a mechanic to list all the tools in a garage, and they forget the most important wrench.

4. Why Did They Struggle?

The experts explained that these AIs are like generalist students. They are great at general knowledge, but they haven't been trained enough on this specific, narrow subject.

  • The Data Gap: There is less high-quality, organized information about heat therapy on the internet compared to other cancer treatments. The AIs are trying to learn from a library that has missing pages.
  • The "Confident but Wrong" Issue: The AIs often sounded very sure of themselves, even when they were wrong. This is dangerous because a non-expert (like a patient or a new doctor) might think, "Wow, the AI sounds so smart, I'll trust it," and make a bad medical decision.

5. The Final Verdict

The authors conclude that you should not use these AI tools to make medical decisions about heat therapy right now.

  • Current Status: They are okay for getting a very rough idea of what the field is about (like reading a Wikipedia summary).
  • The Warning: If you use them for actual treatment planning without a human expert double-checking everything, you risk getting bad advice that could hurt a patient.

The Bottom Line:
Think of these AI models as interns who are very well-read but lack real-world experience in this specific department. They are eager to help, but they need a senior doctor (a human expert) to review every single thing they write before it goes to the patient. Until the AI gets more training on this specific topic, they are not ready to work alone.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →