A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a team of three incredibly smart, super-fast librarians. These librarians have read almost everything ever written on the internet. They are so good that if you ask them about general history, math, or even basic medicine, they can answer like a PhD professor.

Now, imagine you ask them a very specific, tricky question about moderate hyperthermia (a cancer treatment that uses controlled heat to help radiation work better). This is a very niche field, like asking a librarian to explain the specific rules of a game that only 500 people in the world play.

This paper is essentially a report card on how well these three "AI Librarians" (called DeepSeek, Llama, and GPT-4o) performed when asked 40 of these tricky questions. The "teachers" grading them were 19 real-world experts—doctors and physicists who actually use this heat therapy every day.

Here is the breakdown of what happened, using some simple analogies:

1. The Test: A "Pop Quiz" for AI

The researchers didn't just ask yes-or-no questions. They asked open-ended questions, like: "If a patient has a specific type of bone tumor and can't get chemotherapy, should we use heat therapy once a week or twice a week?"

They asked the three AI models to answer. Then, they hid the names of the AIs and showed the answers to the human experts. The experts gave them a grade from 1 (Very Bad) to 5 (Very Good) and also checked: "Is this answer dangerous if a real doctor followed it?"

2. The Results: "Okay, but Don't Trust Them Yet"

The overall grade for all three AIs was a 3 out of 5.

The Translation: In school terms, this is a "C" or "Passable." It's not failing, but it's definitely not an "A."
The Problem: While the average was a "C," about 25% of the answers were actually "F" grades (rated as "bad" or "very bad").
The Danger: Even more worrying, about 1 in 6 answers (roughly 15–19%) were flagged by the experts as "potentially harmful."

The Analogy: Imagine you are building a house. You ask a robot to tell you how to pour the concrete. On average, the robot gives you a decent recipe. But 25% of the time, it tells you to mix in sand instead of cement. If you follow that bad advice, the house might collapse.

3. The "Hallucination" Trap

The paper found that because this field of medicine (heat therapy) is so specialized and has less data on the internet than, say, "how to treat a cold," the AIs started to make things up.

The "Fake Reference" Metaphor: One AI gave a very confident answer, citing a famous study to back it up. The experts looked it up and realized the study didn't exist. The AI had "hallucinated" a fake source to sound smart.
The "Wrong Device" Metaphor: When asked to list the machines used for this treatment, the AIs couldn't give a complete list. They missed key machines that are actually used in hospitals. It's like asking a mechanic to list all the tools in a garage, and they forget the most important wrench.

4. Why Did They Struggle?

The experts explained that these AIs are like generalist students. They are great at general knowledge, but they haven't been trained enough on this specific, narrow subject.

The Data Gap: There is less high-quality, organized information about heat therapy on the internet compared to other cancer treatments. The AIs are trying to learn from a library that has missing pages.
The "Confident but Wrong" Issue: The AIs often sounded very sure of themselves, even when they were wrong. This is dangerous because a non-expert (like a patient or a new doctor) might think, "Wow, the AI sounds so smart, I'll trust it," and make a bad medical decision.

5. The Final Verdict

The authors conclude that you should not use these AI tools to make medical decisions about heat therapy right now.

Current Status: They are okay for getting a very rough idea of what the field is about (like reading a Wikipedia summary).
The Warning: If you use them for actual treatment planning without a human expert double-checking everything, you risk getting bad advice that could hurt a patient.

The Bottom Line:
Think of these AI models as interns who are very well-read but lack real-world experience in this specific department. They are eager to help, but they need a senior doctor (a human expert) to review every single thing they write before it goes to the patient. Until the AI gets more training on this specific topic, they are not ready to work alone.

1. Problem Statement

Large Language Models (LLMs) have demonstrated expert-level performance in broad medical domains (e.g., general oncology, licensing exams). However, their reliability in highly specialized, niche subspecialties remains unexplored. Moderate Hyperthermia (HT) in oncology is such a niche field, characterized by:

Limited Data: Compared to systemic therapies or standard radiotherapy, HT literature is sparse and often interspersed with non-scientific or alternative medicine information.
High Stakes: HT is an adjuvant therapy requiring precise integration with radiotherapy (RT) and chemotherapy (ChT). Errors in dosage, frequency, or device selection could lead to treatment failure or patient harm.
Knowledge Gap: There is a critical lack of formal evaluation regarding whether general-purpose LLMs can accurately navigate the complex clinical and physics-based questions inherent to HT without hallucinating facts or providing dangerous advice.

2. Methodology

The study employed a rigorous, three-phase evaluation design involving international domain experts.

Phase 1: Question Development
- Study coordinators created 40 open-ended questions (22 clinical, 18 physics-related) covering indications, technical delivery, quality assurance, and controversies.
- Questions were designed to mimic real-world clinical scenarios rather than multiple-choice exams, including both "ground truth" questions and those reflecting field controversies.
Phase 2: Response Generation
- Three state-of-the-art, non-medical-specific LLMs were queried in April 2025:
  1. DeepSeek-V3: A mixture-of-experts (MoE) model (236B parameters).
  2. Llama-3.3-70B-Instruct: An open-weight model optimized for complex instructions.
  3. GPT-4o: A multimodal model by OpenAI.
- Responses were generated without text length restrictions and collected exactly as produced.
Phase 3: Expert Evaluation
- Panel: 19 international HT experts (11 clinical, 8 physics) from 13 departments across 6 countries (Europe and USA).
- Blinding: Responses were blinded to the model source and randomized.
- Metrics:
  - Quality: Rated on a 5-point Likert scale (1 = Very Bad to 5 = Very Good).
  - Harmfulness: Binary classification (Yes/No) regarding potential harm in clinical decision-making.
- Statistical Analysis: Interrater agreement (IRA) was measured using Intraclass Correlation Coefficients (ICC) and $r_{wg}$ statistics. Comparisons between models used Wilcoxon signed-rank tests with False Discovery Rate (FDR) correction.

3. Key Contributions

First Specialized Evaluation: This is the first study to systematically evaluate LLM performance specifically within the domain of moderate hyperthermia.
Multi-Model Comparison: Unlike previous studies focusing on single models, this research directly compares three distinct architectures (DeepSeek, Llama, GPT-4o) to identify performance variances.
Dual-Domain Assessment: The study uniquely separates and evaluates clinical reasoning (treatment protocols) and physics knowledge (device capabilities, technical delivery), highlighting specific failure modes in each.
Risk Quantification: It quantifies the "harmfulness" risk, moving beyond simple accuracy to assess the safety of LLM outputs for clinical decision support.

4. Key Results

Overall Quality: All three models performed at a similar, merely "acceptable" level.
- Mean Quality Scores: DeepSeek (3.26), Llama (3.18), GPT-4o (3.07).
- Critical Finding: Approximately 25% of all responses were rated as "bad" or "very bad" (scores 1–2).
Harmfulness Risk: A significant proportion of responses were flagged as potentially harmful:
- Llama: 19.3%
- DeepSeek: 17.8%
- GPT-4o: 15.3%
- Physics Questions: The risk was even higher for physics-related queries, with 100% of Llama's responses and 83.3% of GPT-4o's responses flagged as potentially harmful by at least one expert.
Statistical Significance: No statistically significant differences in quality or harmfulness were found between the models after correcting for multiple comparisons, though DeepSeek showed a non-significant trend toward better performance in physics questions.
Interrater Agreement: Agreement among experts was "moderate" (ICC ~0.64), attributed to the subjective nature of evaluating open-ended clinical reasoning and the complexity of LLM outputs (e.g., correct conclusions derived from hallucinated evidence).
Qualitative Examples:
- Success: Models correctly identified that HT is not standard for solitary plasmacytoma (Q16).
- Failure: Models failed to list commercially available HT devices (Q30), a task a human expert could complete in minutes.
- Hallucination: DeepSeek provided the correct treatment frequency for cervical cancer but supported it with a hallucinated study name ("HYPO") and incorrect toxicity claims.

5. Significance and Conclusion

Clinical Utility: The study concludes that current general-purpose LLMs are not ready for clinical use in moderate hyperthermia. The high rate of poor-quality and potentially harmful answers poses a significant risk to non-expert users (clinicians or patients) who cannot independently verify the output.
Root Cause: The poor performance is attributed not necessarily to a total lack of evidence, but to the lack of structured, high-quality training data for HT. The field suffers from data sparsity and a lack of standardized guidelines compared to other oncology fields, leading to increased hallucinations.
Recommendations:
- LLMs should not be used for HT decision-making without strict domain-expert oversight.
- Future improvements require better structuring of HT knowledge, the development of standardized treatment guidelines, and the creation of domain-specific fine-tuned models.
- The study highlights the danger of "plausible but incorrect" outputs in niche medical fields, where the gap between expert and non-expert verification is widest.

In summary, while LLMs show promise as general assistants, their application in the specialized field of moderate hyperthermia currently carries a non-trivial risk of clinical error, necessitating caution and further development of domain-specific AI solutions.

A Systematic Performance Evaluation of Three Large Language Models in Answering Questions on moderate Hyperthermia

1. The Test: A "Pop Quiz" for AI

2. The Results: "Okay, but Don't Trust Them Yet"

3. The "Hallucination" Trap

4. Why Did They Struggle?

5. The Final Verdict

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

A feasibility study on combining Ayurvedic dietary knowledge and modern nutrition to personalise diets for cancer patients

A Real-World Retrospective Study of Sintilimab in Combination with Neoadjuvant Chemotherapy for Triple-Negative Breast Cancer

Backfill Bayesian Ordered Lattice Design for Phase I Clinical Trials

Cell-free chromatin epigenomic profiling enables non-invasive pancreatic cancer cell-state identification

Clinical and pathological characteristics of thin cutaneous melanomas with rapid recurrence.