SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy

Imagine you are a detective trying to solve a mystery. The "crime" is a patient having a seizure, and the "culprit" is a specific, tiny spot in their brain where the seizure started. Usually, doctors have to look at brain scans and EEGs to find this spot. But before they do that, they listen to the patient (or a witness) describe what happened.

This description is often messy, full of everyday words, and unstructured. It's like reading a frantic text message: "I felt weird, my arm started shaking, I tasted something metallic, and then I passed out."

SemioLLM is a new study that asks a big question: Can Artificial Intelligence (AI) read these messy, frantic descriptions and figure out exactly where the seizure started in the brain, just like a human doctor?

Here is the story of what they found, broken down into simple concepts:

1. The Cast of Characters (The AI Models)

The researchers didn't just test one AI. They gathered a team of eight different "digital detectives."

The Generalists: Big, famous AIs like GPT-4 and GPT-3.5 (the smart, all-knowing librarians).
The Specialists: AIs trained specifically on medical books and papers (like OpenBioLLM and Med42).
The Open Source Crew: Powerful models anyone can download and tweak (like Llama and Mixtral).

2. The Test: "The Seizure Detective Game"

The researchers gave these AIs over 1,200 real-life seizure descriptions. They asked the AIs to guess which of seven brain regions was the starting point.

The Goal: Match the AI's guess with the "Gold Standard" (what human doctors decided after surgery confirmed the seizure stopped).
The Challenge: The descriptions were unstructured. No multiple-choice questions, just raw text.

3. The Big Discovery: "The Power of the Prompt"

At first, when the AIs were just asked to guess without any help (Zero-Shot), they were okay, but not great. They were guessing a bit better than a coin flip, but not quite at doctor level.

Then, the researchers gave them a "cheat sheet" (Prompt Engineering).

The "Chain of Thought" Trick: Instead of just asking for an answer, they told the AI: "Stop and think step-by-step. Explain your reasoning like a doctor would."
The "Impersonation" Trick: They told the AI: "You are now a world-famous epilepsy expert."

The Result? The AIs got much smarter. With these tricks, the best AIs (like GPT-4) performed almost as well as the human doctors. It's like giving a student a study guide and telling them to "act like a professor" before a test—they suddenly ace it.

4. The Catch: Confidence vs. Reality

Here is where it gets tricky.

Confidence: The AIs were very confident in their answers. They said, "I am 90% sure!"
The Problem: Sometimes, they were confident but wrong.
The Hallucination: The researchers found that some AIs would make up fake medical facts or cite papers that didn't exist to support their wrong answers. It's like a student who confidently writes a history essay citing a book that was never written.

The Winner: GPT-4 was the star. It not only got the answer right more often but also cited real, existing medical papers to back up its reasoning. Mixtral was fast and good at reading, but it made more mistakes in its logic and citations.

5. The Weird Quirks (What made the AI stumble?)

The study found some funny and surprising patterns:

The "Goldilocks" Length: The AI did best with very short descriptions (just the key facts) or very long, detailed descriptions. It got confused by descriptions that were "medium" length. It's like how you might understand a very short summary or a very detailed story, but a half-baked story is confusing.
The Language Barrier: The AIs were great at English. If you gave them a French seizure description but asked the question in English, they were still okay. But if you asked the question in French, they got significantly worse. They are still mostly "English speakers" at heart.

6. Why This Matters

This study is a huge step forward. It shows that AI isn't just a trivia bot that can answer medical exam questions; it can actually read a messy patient story and help diagnose a complex brain condition.

However, there is a warning label:
We can't just trust the AI blindly. Because it can "hallucinate" (make things up) and be confidently wrong, doctors need to check its work. The AI is a powerful assistant, not a replacement for the human doctor.

The Bottom Line

SemioLLM proved that with the right instructions, AI can act like a junior doctor, reading unstructured patient stories to find the source of epilepsy. It's a tool that could help doctors diagnose patients faster and more accurately, but it still needs a human supervisor to keep it honest.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated proficiency in structured medical question-answering (Q&A) benchmarks (e.g., MedQA, PubMedQA), their ability to perform diagnostic reasoning based on unstructured clinical narratives remains under-explored. Real-world clinical decision-making relies heavily on interpreting complex, free-text patient interviews and witness accounts (semiology) rather than structured multiple-choice inputs.

In the context of epilepsy, accurate localization of the Seizure Onset Zone (SOZ) is critical, particularly for patients with drug-resistant epilepsy who may require surgical resection. Clinicians currently rely on interpreting semiological descriptions (e.g., "lip-smacking" indicating temporal lobe involvement) to guide imaging and surgical planning. The core problem addressed is whether LLMs can effectively extract diagnostic information from these unstructured narratives, map them to probabilistic brain regions, and provide reliable, interpretable reasoning comparable to human experts.

2. Methodology

The authors developed SemioLLM, a scalable evaluation framework designed to benchmark LLMs on the task of SOZ localization.

Dataset: The study utilized the Semio2Brain dataset, a curated collection of 1,269 seizure semiology entries derived from 4,643 patient cases across 309 peer-reviewed publications. Each entry links a specific seizure symptom description to one or more of seven major brain regions (Temporal, Frontal, Cingulate, Parietal, Occipital, Insula, Hypothalamus). The ground truth was established based on post-surgical seizure freedom (gold standard).
Models Evaluated: Eight LLMs were tested, including:
- General-purpose: GPT-3.5, GPT-4, Mixtral-8x7B, Qwen-72B, LLaMA-2-70B, LLaMA-3-70B.
- Medical-specific: OpenBioLLM-70B, Med42-70B.
Prompt Engineering Strategies: The study systematically compared five prompting approaches:
1. Zero-Shot (ZS): Direct query without examples.
2. Few-Shot (FS): Providing 5 representative input-output pairs.
3. Chain-of-Thought (CoT): Requiring step-by-step reasoning.
4. Few-Shot CoT (FS-CoT): Combining expert-curated examples with reasoning patterns.
5. Self-Consistency (SC): Generating multiple reasoning paths and selecting the majority vote.
Evaluation Metrics:
- Accuracy: Measured via F1-score against ground truth labels.
- Confidence: Calculated using Shannon Entropy derived from the model's likelihood distribution across the 7 brain regions ( $C = 1 - H$ ).
- Calibration: Measured via Brier Score Loss to assess alignment between predicted probabilities and actual correctness.
- Reasoning Quality: A clinical domain expert manually evaluated a subset (n=81) of outputs for Correctness, Completeness, Knowledge Recall, and Source Citation Accuracy.

3. Key Contributions

First Large-Scale Unstructured Evaluation: SemioLLM is the first framework to evaluate LLMs on diagnostic reasoning using over 1,200 unstructured seizure descriptions, moving beyond structured Q&A.
Prompt Engineering Impact: Demonstrated that prompt engineering (specifically clinician-guided CoT and Self-Consistency) significantly boosts performance, allowing models to approach or match clinician-level accuracy.
Expert Impersonation: Showed that prompting models to adopt specific clinical personas (e.g., "Expert Epileptologist") improves both accuracy and confidence.
Reasoning vs. Accuracy Discrepancy: Revealed that high accuracy does not guarantee correct reasoning. Models can produce correct predictions based on hallucinated knowledge or inaccurate citations.
Multilingual and Length Analysis: Identified a U-shaped relationship between symptom description length and accuracy (very short and very long descriptions performed better than medium-length ones) and highlighted limitations in non-English contexts.

4. Key Results

Performance and Prompting

Baseline: In Zero-Shot conditions, most models performed near chance level, though GPT-4 (52.27% F1) and Mixtral-8x7B (51.66% F1) significantly outperformed others, matching the performance of human clinicians (~46–48% F1).
Prompt Engineering: All models showed substantial improvement with advanced prompting.
- Self-Consistency (SC) and FS-CoT yielded the highest gains (approx. 10% improvement over ZS).
- GPT-4 maintained the highest overall performance (53.44% with SC), while OpenBioLLM approached GPT-4's performance with SC (53.06%).
Confidence & Calibration: Confidence scores improved with prompt engineering. GPT-4 demonstrated the best calibration (lowest Brier score), meaning its probability estimates were most reliable.

Reasoning and Hallucination Analysis

GPT-4 vs. Mixtral: While both performed well quantitatively, GPT-4 significantly outperformed Mixtral-8x7B in qualitative reasoning:
- Correctness: GPT-4 (56.79%) vs. Mixtral (29.63%).
- Logical Reasoning: GPT-4 (98.77% correct steps) vs. Mixtral (80.25%).
- Citation Accuracy: GPT-4 (76.54%) vs. Mixtral (19.13%). Mixtral frequently hallucinated sources or cited non-existent papers.
Implication: Correct predictions can be "lucky" or based on hallucinated facts, underscoring the need for interpretability tools.

Influencing Factors

Description Length: A U-shaped curve was observed. Very short descriptions (highly specific keywords) and very long, detailed narratives yielded higher accuracy than medium-length descriptions, which likely contained redundant or conflicting information.
Persona: Impersonating an "Expert Epileptologist" increased GPT-4's F1 score by 13.68% and Mixtral's by 4.47% compared to a generic AI assistant.
Language: Models performed best in English. While they handled non-English inputs with English prompts (Cross-Language) reasonably well, performance dropped significantly when both the input and prompt were in non-English languages (especially for Mixtral).

5. Significance and Conclusion

The study establishes that LLMs, particularly when guided by expert-informed prompting strategies, can effectively translate unstructured clinical narratives into probabilistic diagnostic decisions for epilepsy.

Clinical Applicability: SemioLLM provides a framework for deploying LLMs in real-world clinical settings where structured data is scarce. It highlights that prompt engineering and persona adaptation are critical for bridging the gap between general AI and specialized medical tasks.
Safety and Trust: The findings warn that high accuracy metrics alone are insufficient. The presence of hallucinated citations and reasoning errors in even top-performing models necessitates human-in-the-loop verification and the integration of Retrieval-Augmented Generation (RAG) to ground models in factual medical literature.
Future Directions: The framework is adaptable to other specialties (e.g., dermatology) but requires further multilingual tuning and validation on diverse demographic populations to ensure equitable healthcare applications.

In summary, SemioLLM demonstrates that foundational AI systems hold significant promise for clinical diagnostics but require rigorous evaluation of their reasoning processes, not just their final predictions, to be safely integrated into healthcare workflows.