Imagine a busy hospital emergency room, but instead of people walking through the door, thousands of people are typing questions into a computer screen. Some are asking about a mild cold, some need to make a routine doctor's appointment, some have symptoms that need a doctor's attention within a day, and a few have life-threatening emergencies.

The challenge for the hospital is: How do you sort these thousands of messages quickly and safely without a human reading every single one?

This paper is like a test drive for a new kind of "digital sorter" using Artificial Intelligence (AI). Here is the breakdown of what they did and what they found, using simple analogies.

The Problem: The "Noisy" Inbox

Online patient messages are messy. People don't speak like doctors; they write like friends. They might forget to mention how long they've been sick, how bad the pain is, or if they have other health issues.

The Goal: Sort these messages into four buckets:
1. Self-Care: "Stay home, drink tea, you'll be fine."
2. Schedule a Visit: "Make an appointment for next week."
3. Urgent Review: "Call a doctor today or tomorrow."
4. Emergency: "Call 911 or go to the ER right now."

The Experiment: The "Teacher" vs. The "Smart Student"

The researchers wanted to see if new, powerful AI models (called Large Language Models or LLMs) could do this sorting better than older, simpler computer programs, especially when they didn't have a huge pile of pre-labeled examples to study from.

The Old Way (Supervised Models): Imagine a student who has to memorize 700 specific examples of patient messages and their answers to learn the rules. They are trained on "silver labels" (answers generated by an AI, not a human doctor).
The New Way (Prompted LLMs): Imagine a very smart student who has read millions of books. Instead of memorizing 700 examples, you just give them a few rules and a couple of examples (called "few-shot prompting") and ask, "Here is a new message; where does it go?"

The Results: Who Won the Race?

1. The "Smart Student" (LLMs) did better, but not by a landslide.
The best AI model (Claude Haiku 4.5) got about 47.5% of the answers right when given 12 examples to learn from. The best "Old Way" model (BioBERT) got about 37.8% right.

The Catch: The difference wasn't huge enough to say the new AI is definitely "better" in a statistical sense; their scores overlapped. It's like two runners finishing a race where one is slightly ahead, but the gap is so small you can't be 100% sure who is faster without running it again.

2. The "Safety Score" is more important than the "Grade."
In a sorting task, it's worse to miss a fire (Emergency) than to send a non-emergency to the fire department (Over-triage).

The researchers found that while the AI models got better at the general "grade" (Macro-F1), they were much better at safety.
The AI models almost never missed a true emergency (Severe Under-triage was 0% in the test), whereas the older models missed dangerous cases about 30% of the time.
Analogy: The AI is like a security guard who is slightly slower at checking IDs but is much better at spotting a real threat.

3. The "Confusing Middle" is still hard.
The AI was great at spotting "Self-Care" (easy) and "Emergency" (obvious). But it struggled with the middle ground: "Urgent Clinician Review."

Analogy: It's easy to tell the difference between a paper cut and a heart attack. It's very hard to tell the difference between a bad stomach ache that needs a doctor tomorrow versus one that can wait a week. Even the smartest AI got confused here.

4. The "Two-Headed" Strategy (Consensus)
The researchers tried a clever trick: What if they used two different AI models to sort the messages?

If both AIs agree: "Okay, we both think this is 'Self-Care.' Let's accept it." (This worked very well).
If the AIs disagree: "We can't agree. Let's send this to a human doctor to look at."
The Result: This "Two-Headed" approach created a safety net. It didn't mean the AI could work alone; it meant the AI could act as a filter to help humans focus on the tricky cases.

The Bottom Line: A Helpful Assistant, Not a Replacement

The paper concludes that these AI models are not ready to work alone. They are not "autonomous" doctors.

Instead, think of them as a high-tech triage nurse assistant:

They can quickly sort out the easy "self-care" questions.
They can flag the obvious emergencies so no one misses them.
But for the confusing, middle-ground cases, they must always pass the message to a human doctor.

In short: The AI is a great tool to help humans prioritize their workload, but it should never be the final decision-maker for patient safety.

Technical Summary: Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

Problem Statement

Online patient inquiries on health platforms are typically informal, incomplete, and written prior to professional assessment. Despite these limitations, health systems require scalable methods to route these messages to an appropriate level of clinical follow-up. This study frames the problem as a four-class actionable triage task, distinct from diagnosis generation or general medical text classification. The objective is to assign exactly one of four routing labels to a patient inquiry:

Self-care: Manageable at home without clinical contact.
Schedule-visit: Requires non-urgent clinician assessment (days to weeks).
Urgent-clinician-review: Requires timely review within 24–48 hours.
Emergency-referral: Requires immediate emergency evaluation.

The task is challenging due to the lack of key clinical details (duration, severity, vitals) in patient-authored text, the rarity of high-acuity cases, and the clinical asymmetry of errors where under-triage (missing an urgent case) is more dangerous than over-triage.

Methodology

Data Construction

The study utilizes the HealthCareMagic-100K corpus, a public dataset of anonymized patient-physician exchanges.

Preprocessing: Records were filtered to remove messages with fewer than 20 tokens or more than 500 tokens, leaving 110,163 usable messages.
Stratified Sampling: To address class imbalance (specifically the scarcity of emergency cases), a keyword-stratified sampling strategy was employed. Records were scored based on emergency keywords and physician escalation phrases, then assigned to buckets (self-care, schedule-visit, urgent, emergency) to enrich the working pool with higher-acuity inquiries.
Data Splits: From a 1,040-record working pool, three disjoint sets were created:
- Silver Training Set (N=700): Auto-labeled by Claude Sonnet 4.5. Used for training supervised baselines.
- Gold Evaluation Set (N=300): Human-calibrated by two researchers using a refined annotation guideline. Used for final evaluation.
- Few-Shot Pool (N=40): High-confidence, human-verified examples used for in-context learning demonstrations.

Annotation and Labeling

A structured annotation guideline was developed through a two-person pilot and six rounds of refinement. It emphasizes "triage from text alone," distinguishing active symptoms from informational queries, and applying lower thresholds for vulnerable populations.

Silver Labels: Generated by Claude Sonnet 4.5.
Gold Calibration: Human reviewers compared their independent labels against the initial Sonnet labels. For the gold set, 38% of labels were revised, resulting in a Sonnet-human Cohen's $\kappa$ of 0.35, highlighting the necessity of human calibration.

Experimental Setup

The study compares supervised baselines against prompted Large Language Models (LLMs) under low-resource conditions.

Supervised Baselines:
- TF-IDF: Logistic Regression, Random Forest, and XGBoost trained on the 700-record silver set.
- BioBERT: BioBERT-v1.1 fine-tuned on the silver set.
- Note: Both "default" (full 700 examples) and "balanced" (downsampled to 91 examples per class) training conditions were evaluated.
Prompted LLMs: Six models (Llama3.1-8B, Qwen3-8B, Mistral-7B, DeepSeek-R1-7B, GPT-4o-mini, Claude Haiku 4.5) evaluated without parameter updates.
Prompting Conditions: Models were tested under 0-shot, 4-shot (one example per class), and 12-shot (three examples per class) settings.

Evaluation Metrics

Primary Metric: Macro-F1 (to account for class imbalance).
Safety-Aware Metrics: Emergency recall, urgent-or-higher recall, under-triage rate (predicting a lower severity than true), and severe under-triage rate (gap of $\ge$ 2 levels).
Consensus Analysis: An oracle Human-in-the-Loop (HITL) simulation where predictions are auto-accepted only if two models agree; otherwise, cases are escalated to human review.

Key Results

Classification Performance

Supervised Baselines: The strongest supervised baseline was BioBERT-v1.1 (default) with a macro-F1 of 0.378. Performance was notably weak on the emergency-referral class (F1 $\approx$ 0.26).
LLM Performance: Few-shot prompting improved performance. The strongest model, Claude Haiku 4.5 (12-shot), achieved a macro-F1 of 0.475. Other top performers included Llama3.1-8B (0.464) and Qwen3-8B (0.444).
Statistical Significance: While LLMs outperformed baselines in point estimates, confidence intervals overlapped. McNemar tests indicated that only Llama3.1-8B was significantly better than BioBERT-v1.1; the top LLMs were not significantly different from each other.

Class-Specific and Safety Performance

Class Difficulty: "Self-care" was the easiest class for LLMs (F1 > 0.65). "Urgent-clinician-review" remained the most difficult class across all models (F1 < 0.35), reflecting the ambiguity of intermediate-acuity cases.
Safety Metrics: LLMs demonstrated superior safety profiles compared to supervised baselines.
- Under-triage: All top LLM configurations achieved a 0.000 severe under-triage rate on the gold set, whereas supervised baselines ranged from 0.269 to 0.308.
- Recall: GPT-4o-mini (12-shot) achieved the highest urgent-or-higher recall (0.984) and lowest under-triage rate (0.053), despite having a lower macro-F1 than Claude Haiku 4.5.

Prompt Sensitivity and Consensus

Prompt Sensitivity: Performance gains from few-shot prompting were not monotonic or uniform. While Claude Haiku 4.5 improved monotonically with more shots, Qwen3-8B peaked at 4-shot, and Llama3.1-8B performed worse at 4-shot than 0-shot.
Two-Model Consensus: Agreement between models was highly label-dependent.
- Self-care: High agreement reliability (consensus accuracy > 90%).
- Urgent-clinician-review: Low agreement reliability (consensus accuracy $\approx$ 25%).
- Oracle-HITL: Simulating a workflow where disagreements are escalated to humans yielded a theoretical macro-F1 of up to 0.708 (GPT-4o-mini + Llama3.1-8B), suggesting significant potential for decision support.

Significance and Claims

The paper concludes that prompted LLMs can support triage prioritization and selective human review but are not ready for autonomous deployment.

Decision Support, Not Replacement: The authors argue that the value of LLMs lies in their ability to interpret free-text symptoms and follow complex guidelines without task-specific fine-tuning. However, the persistent difficulty in classifying "urgent-clinician-review" cases and the risk of under-triage in high-stakes scenarios preclude autonomous routing.
Workflow Integration: The study proposes a selective prediction strategy where LLMs handle low-risk "self-care" agreements (which are reliable) and flag high-risk or uncertain cases for human review.
Safety-Aware Evaluation: The paper emphasizes that aggregate metrics like macro-F1 obscure critical safety trade-offs. Models with lower F1 scores may be preferable if they minimize under-triage, a finding that necessitates safety-aware evaluation frameworks in clinical NLP.
Limitations: The authors acknowledge limitations including the use of a single public corpus, the modest size of the gold set (particularly for emergency cases), the reliance on silver labels for supervised training, and the offline nature of the evaluation. They state that prospective validation with clinician reviewers is required before claims about workload reduction or safety can be made.

In summary, this work provides a rigorous benchmark for LLMs in online patient triage, demonstrating that while few-shot LLMs outperform traditional supervised baselines in low-resource settings, their deployment must be strictly bounded by human oversight and label-dependent confidence signals.

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries