A Blinded Comparative Evaluation of Clinical and… — Plain-Language Explanation

Original authors: Akinniyi, S., Jain-Poster, K., Evangelista, E., Yoshikawa, N., Rivero, A.

Published 2026-04-15

📖 4 min read☕ Coffee break read

Original authors: Akinniyi, S., Jain-Poster, K., Evangelista, E., Yoshikawa, N., Rivero, A.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a question about your ear—maybe it's ringing, hurting, or you're feeling dizzy. You turn to the internet for answers. In the past, you might have asked a doctor on a public forum like Reddit. Today, you might ask a super-smart computer program (an AI) like ChatGPT.

This study is like a blind taste test to see who gives better answers: the real doctors or the AI robots.

Here is the story of what they found, broken down simply:

The Setup: The "Blind Taste Test"

The researchers took 49 real questions people asked about their ears (like "Why does my ear hurt?" or "I can't hear well").

Team Doctor: They looked at the answers real, verified doctors gave on Reddit.
Team AI: They asked three different super-computers (ChatGPT, Claude, and Google Gemini) to answer the exact same questions.
The Judges: Five experts read all the answers without knowing who wrote them. They scored them on three things: Quality (is it right?), Empathy (does it sound kind?), and Readability (is it easy to understand?).

The Results: The Robot Wins the Popularity Contest

Surprisingly, the AI robots scored higher than the real doctors in almost every category!

The "Kindness" Factor (Empathy):
- The Doctor: Imagine a busy doctor who has seen 20 patients today. They give you a quick, practical answer: "It sounds like an infection. Go see a specialist." It's accurate, but a bit short and to the point.
- The AI: Imagine a patient advocate who has all day to talk. The AI said, "I'm so sorry your ear is hurting. That sounds really frustrating. Here is what might be happening, and here is exactly what you should do next."
- The Verdict: The judges felt the AI was much warmer, more caring, and sounded more like a friend who really listened.
The "Clarity" Factor (Readability):
- The Doctor: Doctors often use medical jargon or write in a way that assumes you know a little bit about medicine. It's like reading a textbook.
- The AI: The AI was told to write at a "6th-grade reading level." It explained things simply, like a teacher explaining a concept to a middle schooler.
- The Verdict: The AI's answers were much easier for the average person to understand.
The "Length" Factor:
- The doctors were concise (short). The AI was chatty (long).
- Analogy: The doctor gave you a map. The AI gave you a guided tour with the map, the history of the place, and tips on where to eat. Even though the AI talked more, people actually liked the extra detail.

The Catch: The "Uncanny Valley"

Even though the AI won the scores, there was a twist.

Can you tell who is who? The judges were able to guess correctly 89% of the time whether an answer was from a human or a robot.
Why? The AI sometimes sounded too perfect or used a specific "robotic" style of empathy that felt a little fake. It was like an actor playing a doctor; they did a great job, but you could still tell they were acting.
The Safety Issue: The AI sometimes got a little too scared. If you mentioned a mild earache, the AI might say, "Go to the ER immediately!" (Up-triaging). Real doctors are better at knowing when to say, "It's probably fine, just keep an eye on it."

The Big Picture: What Does This Mean for You?

Think of the AI not as a replacement for your doctor, but as a super-powered assistant.

The Problem: Doctors are often overwhelmed with messages from patients. They are tired and short on time.
The Solution: Imagine a future where the AI writes the first draft of the answer. It explains the condition clearly, sounds kind, and suggests next steps. Then, the real doctor quickly reads it, fixes any medical errors, and hits "send."

In short: The AI is great at being a friendly, clear, and patient teacher. But it still needs a real doctor to be the final boss who makes sure the advice is safe and accurate. The future of healthcare might be a team-up: the robot's brain for clarity and kindness, and the human doctor's heart and judgment for safety.

1. Problem Statement

As patient reliance on digital health resources grows, there is an increasing need to evaluate the efficacy of Large Language Models (LLMs) in patient communication. While LLMs show promise in general medical triage and education, their performance in otology (ear, nose, and throat care) remains under-explored. Existing studies often focus on single conditions, rely on single AI models, or lack direct, blinded comparisons with verified physician responses. Furthermore, concerns persist regarding the accuracy, clinical appropriateness, empathy, and readability of AI-generated medical advice compared to human experts. This study addresses the gap by systematically comparing LLM responses against verified physician responses across a broad spectrum of otologic symptoms.

2. Methodology

The study employed a comparative, blinded design using data from a public online forum.

Data Source: 49 otology-related questions were extracted from the Reddit community r/AskDocs between January 2020 and June 2025.
Search Criteria: Questions were selected using specific keywords: "hearing loss," "ear infection," "tinnitus," "ear pain," and "vertigo."
Response Generation:
- Physician Group: Original responses from verified physician moderators on Reddit (verified via licensure documentation).
- AI Group: Three leading LLMs were prompted to answer the same questions: ChatGPT-4o, ClaudeAI Sonnet 4, and Google Gemini.
- Prompt Engineering: AI models were instructed to act as board-certified otolaryngologists, write at a 6th-grade reading level, use medically accurate language, and keep responses under 100 words.
Evaluation Process:
- Blinding: All responses were anonymized and randomly ordered.
- Evaluators: Five independent physician evaluators rated each response.
- Metrics: Responses were scored on a 5-point Likert scale for:
  - Quality: Medical accuracy, completeness, and focus.
  - Empathy: Emotional and cognitive empathy.
  - Readability: Ease of understanding.
- Automated Metrics: Readability was further quantified using Flesch-Kincaid Grade Level (FKGL), Automated Readability Index (ARI), Gunning-Fog Index (GFI), Mean Dependency Distance (MDD), and Measure of Textual Lexical Diversity (MTLD).
Statistical Analysis: Two-tailed Welch's t-tests were used for comparisons, with One-way ANOVA and leave-one-out sensitivity analyses to validate robustness. Significance was set at $p < 0.05$ .

3. Key Contributions

Broad Symptom Scope: Unlike prior studies focusing on single conditions (e.g., only vertigo or tinnitus), this study evaluated a diverse range of otologic complaints (otalgia, vertigo, hearing loss, etc.).
Multi-Model Comparison: It is one of the first studies to simultaneously compare three distinct, state-of-the-art LLMs (ChatGPT, Claude, Gemini) against human physicians.
Blinded Human Evaluation: The study utilized a rigorous blinded evaluation by multiple physicians to assess subjective qualities like empathy and tone, rather than relying solely on automated fact-checking.
Linguistic Analysis: The inclusion of advanced syntactic and lexical diversity metrics (MDD, MTLD) provides a granular view of the structural differences between AI and human writing styles.

4. Key Results

Quality and Empathy: LLM responses significantly outperformed physician responses in all evaluated categories:
- Overall Quality: LLMs (Mean 10.95) vs. Physicians (9.58).
- Empathy: LLMs (Mean 7.26) vs. Physicians (5.18).
- Readability: LLMs (Mean 4.00) vs. Physicians (3.73).
- Note: All differences were statistically significant ( $p < 0.05$ ).
Readability Levels:
- ChatGPT-4o produced the most readable content (Mean FKGL: 7.25, approx. 7th-grade level).
- ClaudeAI was significantly more complex (Mean FKGL: 11.86, approx. 11th-12th grade level).
- Google Gemini and Physicians fell in the intermediate range (8th–9th grade).
Word Count: LLM responses were significantly longer (Mean 145 words) than physician responses (Mean 67 words), despite the 100-word prompt constraint for AI.
Blind Identification: Evaluators correctly identified the author type (AI vs. Human) in 89.4% of cases. Sensitivity for detecting physician responses was particularly high (93.5%), suggesting AI responses still possess distinct stylistic markers.
Sensitivity Analysis: The superiority of LLMs held true even when any single AI model was excluded from the analysis, indicating the finding is robust across models.

5. Significance and Implications

Clinical Utility: The findings suggest that LLMs can generate responses that are more empathetic, comprehensive, and readable than typical forum-based physician replies. This supports the potential use of LLMs as drafting tools to augment clinician communication, potentially reducing burnout associated with high volumes of patient portal messages.
Patient Education: The ability of AI (specifically ChatGPT) to tailor complex medical information to a lower reading level (6th–7th grade) could significantly improve health literacy and patient understanding.
Limitations and Risks:
- Lack of Physical Exam: Neither AI nor forum physicians can perform physical exams, limiting the clinical validity of both.
- Over-triage: The study noted that LLMs frequently "up-triaged" concerns, recommending high-acuity evaluation even for benign conditions, which could increase healthcare utilization.
- Context: Physician responses were from an informal public forum, not secure clinical messaging systems, which may skew the baseline for human performance.
Future Directions: The authors conclude that while LLMs show high potential for enhancing patient engagement, they must be implemented with human oversight to verify medical accuracy and prevent unnecessary anxiety or over-utilization of services.

Conclusion: The study demonstrates that current LLMs can outperform verified physicians in the style and emotional tone of responses to otologic queries, offering a promising adjunct for patient education and communication, provided they are used as supportive tools rather than autonomous diagnostic agents.

A Blinded Comparative Evaluation of Clinical and AI-Generated Responses to Otologic Patient Queries