Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

This study demonstrates that few-shot prompted large language models, particularly Claude Haiku 4.5, can outperform supervised baselines like BioBERT in routing online patient inquiries to appropriate clinical follow-up levels under low-resource conditions, though their performance variability suggests they are best suited for supporting selective human review rather than autonomous deployment.

Original authors: Liqi Zhou, Jiafu Li

Published 2026-05-18✓ Author reviewed
📖 5 min read🧠 Deep dive

Original authors: Liqi Zhou, Jiafu Li

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a busy hospital emergency room, but instead of people walking through the door, thousands of people are typing questions into a computer screen. Some are asking about a mild cold, some need to make a routine doctor's appointment, some have symptoms that need a doctor's attention within a day, and a few have life-threatening emergencies.

The challenge for the hospital is: How do you sort these thousands of messages quickly and safely without a human reading every single one?

This paper is like a test drive for a new kind of "digital sorter" using Artificial Intelligence (AI). Here is the breakdown of what they did and what they found, using simple analogies.

The Problem: The "Noisy" Inbox

Online patient messages are messy. People don't speak like doctors; they write like friends. They might forget to mention how long they've been sick, how bad the pain is, or if they have other health issues.

  • The Goal: Sort these messages into four buckets:
    1. Self-Care: "Stay home, drink tea, you'll be fine."
    2. Schedule a Visit: "Make an appointment for next week."
    3. Urgent Review: "Call a doctor today or tomorrow."
    4. Emergency: "Call 911 or go to the ER right now."

The Experiment: The "Teacher" vs. The "Smart Student"

The researchers wanted to see if new, powerful AI models (called Large Language Models or LLMs) could do this sorting better than older, simpler computer programs, especially when they didn't have a huge pile of pre-labeled examples to study from.

  • The Old Way (Supervised Models): Imagine a student who has to memorize 700 specific examples of patient messages and their answers to learn the rules. They are trained on "silver labels" (answers generated by an AI, not a human doctor).
  • The New Way (Prompted LLMs): Imagine a very smart student who has read millions of books. Instead of memorizing 700 examples, you just give them a few rules and a couple of examples (called "few-shot prompting") and ask, "Here is a new message; where does it go?"

The Results: Who Won the Race?

1. The "Smart Student" (LLMs) did better, but not by a landslide.
The best AI model (Claude Haiku 4.5) got about 47.5% of the answers right when given 12 examples to learn from. The best "Old Way" model (BioBERT) got about 37.8% right.

  • The Catch: The difference wasn't huge enough to say the new AI is definitely "better" in a statistical sense; their scores overlapped. It's like two runners finishing a race where one is slightly ahead, but the gap is so small you can't be 100% sure who is faster without running it again.

2. The "Safety Score" is more important than the "Grade."
In a sorting task, it's worse to miss a fire (Emergency) than to send a non-emergency to the fire department (Over-triage).

  • The researchers found that while the AI models got better at the general "grade" (Macro-F1), they were much better at safety.
  • The AI models almost never missed a true emergency (Severe Under-triage was 0% in the test), whereas the older models missed dangerous cases about 30% of the time.
  • Analogy: The AI is like a security guard who is slightly slower at checking IDs but is much better at spotting a real threat.

3. The "Confusing Middle" is still hard.
The AI was great at spotting "Self-Care" (easy) and "Emergency" (obvious). But it struggled with the middle ground: "Urgent Clinician Review."

  • Analogy: It's easy to tell the difference between a paper cut and a heart attack. It's very hard to tell the difference between a bad stomach ache that needs a doctor tomorrow versus one that can wait a week. Even the smartest AI got confused here.

4. The "Two-Headed" Strategy (Consensus)
The researchers tried a clever trick: What if they used two different AI models to sort the messages?

  • If both AIs agree: "Okay, we both think this is 'Self-Care.' Let's accept it." (This worked very well).
  • If the AIs disagree: "We can't agree. Let's send this to a human doctor to look at."
  • The Result: This "Two-Headed" approach created a safety net. It didn't mean the AI could work alone; it meant the AI could act as a filter to help humans focus on the tricky cases.

The Bottom Line: A Helpful Assistant, Not a Replacement

The paper concludes that these AI models are not ready to work alone. They are not "autonomous" doctors.

Instead, think of them as a high-tech triage nurse assistant:

  • They can quickly sort out the easy "self-care" questions.
  • They can flag the obvious emergencies so no one misses them.
  • But for the confusing, middle-ground cases, they must always pass the message to a human doctor.

In short: The AI is a great tool to help humans prioritize their workload, but it should never be the final decision-maker for patient safety.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →