Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

Imagine you are in a rural village clinic with no internet and no fancy supercomputers. You have a small, local AI assistant on a regular laptop, and you need it to help answer tricky medical questions. You want to know: Can we trust this little AI?

This paper is like a rigorous "stress test" for five different small AI models to see how they handle medical questions when you ask them in slightly different ways.

Here is the breakdown of what they found, using some simple analogies:

1. The "Steady Hand" vs. The "Smart Brain"

The biggest surprise in this study is that being consistent doesn't mean being right.

The Analogy: Imagine two students taking a math test.
- Student A (The "Gemma 2" model): If you ask them the same question in five different ways, they give you the exact same wrong answer every single time. They are incredibly consistent (like a broken clock that is right twice a day, but here, it's always wrong). They are confidently incorrect.
- Student B (The "Llama 3.2" model): If you ask them the same question in different ways, they might give you slightly different answers, but most of the time, they get the right answer. They are less "steady" in their wording, but they are much smarter.

The Takeaway: In healthcare, a model that is "reliably wrong" is actually more dangerous than one that is sometimes right and sometimes wrong, because the doctor might trust the steady-but-wrong model too much.

2. The "Roleplay" Trap

The researchers tried asking the AI questions in different "personas." For example, they tried saying, "Pretend you are a senior doctor taking a board exam," versus just asking the question directly.

The Analogy: It's like asking a chef to cook a meal.
- Direct: "Make me a burger." (The chef makes a great burger).
- Roleplay: "You are a world-famous chef who loves to cook burgers. Show me your skills!" (The chef gets distracted by the drama of the roleplay, overthinks it, and burns the burger).

The Takeaway: For these small AI models, pretending to be a character actually made them worse at answering medical questions. The "roleplay" prompts confused them. If you want the best results, just ask the question plainly.

3. Bigger Isn't Always Better (and "Medical Knowledge" isn't Enough)

They tested models of different sizes (from 2 billion to 7 billion "brain cells" or parameters) and even one that was specifically trained on medical books but never taught how to follow instructions.

The Analogy:
- The Big Model: Imagine a giant library (7B parameters). You'd think it knows everything. But sometimes, it gets so overwhelmed it forgets to give you the answer in the format you asked for (like giving you a paragraph when you asked for a "Yes/No").
- The Medical Expert: Imagine a doctor who has read every medical textbook in the world but has never been taught how to fill out a form. If you ask them a question, they might know the answer, but they can't give it to you in the way you need it. They just stare at you blankly.

The Takeaway:

Size doesn't guarantee safety: A bigger model didn't necessarily follow instructions better than a smaller one.
Knowledge needs instructions: Just having medical knowledge isn't enough. The AI needs to be "taught" how to listen and answer in a structured way. The model that knew the most medicine but couldn't follow instructions failed almost 100% of the time.

4. The Winner for Low-Resource Clinics

Since many clinics in developing areas can't afford expensive cloud servers, they need models that run on regular computers.

The Winner: Llama 3.2 (a 3-billion parameter model).
Why? It struck the best balance. It wasn't the most "steady" (it changed its answer slightly depending on how you asked), but it was the most accurate. It also rarely failed to give an answer at all.

Summary: What Should We Do?

If you are building an AI for a doctor in a low-resource clinic:

Don't just look for accuracy: Check if the AI gives the same answer every time, but also check if that answer is actually correct.
Don't use "Roleplay": Don't tell the AI to "act like a doctor." Just ask the question directly.
Teach it to follow rules: Make sure the AI knows how to follow instructions, not just that it knows medical facts.
Pick the balanced model: Sometimes a slightly smaller, smarter model is better than a huge, confused one.

The Bottom Line: In medicine, a "confidently wrong" AI is a ticking time bomb. We need to test these models not just on how smart they are, but on how stable and obedient they are, too.

Here is a detailed technical summary of the paper "Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment."

1. Problem Statement

The integration of Artificial Intelligence (AI) into healthcare workflows is accelerating, particularly in low-resource settings where cloud infrastructure is unavailable. Small open-source Large Language Models (SLMs) (2B–7B parameters) are emerging as viable tools for local deployment on consumer CPU hardware. However, current evaluations of these models in clinical contexts focus almost exclusively on accuracy (selecting the correct answer).

There is a critical gap in understanding prompt sensitivity and answer consistency:

Reliability vs. Stability: A model might be accurate on average but unstable, providing different answers to semantically identical questions phrased differently.
The "Reliable Incorrectness" Risk: A model that consistently produces the same wrong answer across different prompt variations is dangerous in clinical decision support, as it creates a false sense of confidence.
Instruction Adherence: It is unclear if domain-specific medical knowledge (pretraining) is sufficient for structured clinical tasks without instruction tuning, or if small models can reliably follow formatting constraints on limited hardware.

2. Methodology

The study employed a systematic evaluation framework designed to simulate realistic, low-resource deployment conditions.

Datasets: Three established clinical benchmarks were used:
- MedQA: USMLE-style multiple-choice questions.
- MedMCQA: Indian medical entrance exam questions (AIIMS/NEET).
- PubMedQA: Biomedical research questions requiring Yes/No/Maybe answers based on abstracts.
- Sampling: 200 questions were randomly sampled from each dataset (600 total) using a fixed seed.
Models Evaluated (5 Models, 2B–7B parameters):
- Instruction-Tuned General Models: Phi-3 Mini (3.8B), Llama 3.2 (3B), Gemma 2 (2B), Mistral 7B (7B).
- Domain-Pretrained Model: Meditron-7B (7B, Llama-2 based, pretrained on medical corpus but without instruction tuning).
- Constraint: All models were run locally on consumer CPU hardware using Ollama, with no fine-tuning, to reflect real-world low-resource constraints.
Prompt Variation Design:
Each question was transformed into five semantically equivalent but stylistically distinct prompts:
1. Original: Dataset baseline.
2. Formal: Clinical academic language.
3. Simplified: Plain language (patient/non-expert).
4. Roleplay: Instructing the model to act as a practicing physician.
5. Direct: Bare question with minimal framing.
Inference & Metrics:
- Setup: Temperature set to 0 (deterministic) to isolate prompt sensitivity from stochastic sampling. Output limited to 10 tokens to force concise answers.
- Consistency Score ( $C$ ): The proportion of prompt styles yielding the majority answer for a specific question ( $C = \frac{\sum 1[r_i = \hat{r}]}{5}$ ).
- Accuracy: Comparison of the majority answer against ground truth.
- UNKNOWN Rate: Proportion of responses failing to produce a valid format (instruction-following failure).

3. Key Contributions

Decoupling Consistency and Accuracy: The study establishes that high answer consistency does not imply high accuracy. Models can be "reliably incorrect."
Quantification of Prompt Sensitivity: It provides the first systematic quantification of how small SLMs react to prompt reformulations in clinical settings.
Roleplay Detriment: It identifies "Roleplay" prompting as a systematic failure mode that degrades performance across all tested models.
Domain Knowledge vs. Instruction Following: It demonstrates that medical domain pretraining (Meditron-7B) is insufficient for structured QA without instruction tuning, highlighting the orthogonality of knowledge and usability.
Low-Resource Evaluation Framework: A reproducible methodology for evaluating clinical AI on consumer CPUs without specialized infrastructure.

4. Key Results

A. Consistency vs. Accuracy Independence

Gemma 2 (2B): Achieved the highest consistency (0.845–0.888) but the lowest accuracy (33.0–43.5%). This indicates the model consistently outputs the same wrong answers regardless of phrasing.
Llama 3.2 (3B): Demonstrated the best balance, with moderate consistency (0.774–0.807) and the highest accuracy (49.0–65.0%).
Conclusion: Consistency and accuracy are independent metrics. A model with high consistency but low accuracy poses a severe safety risk due to "confident incorrectness."

B. Impact of Prompt Styles

Roleplay Failure: The "Roleplay" prompt style (e.g., "Act as a doctor") consistently reduced accuracy across all models and datasets.
- Example: Phi-3 Mini on MedQA dropped 21.5 percentage points in accuracy when switching from "Direct" to "Roleplay" prompts.
Optimal Style: "Direct" and "Original" prompts yielded the most stable and accurate results.

C. Instruction-Following Failures (UNKNOWN Rate)

Meditron-7B: Despite having medical domain knowledge, it exhibited near-total failure on structured tasks without instruction tuning.
- On PubMedQA, it had a 99.0% UNKNOWN rate (failed to output Yes/No/Maybe).
- This proves domain pretraining alone cannot replace instruction tuning for structured clinical workflows.
Model Size: Instruction-following capability was not correlated with parameter count. Mistral 7B (largest) had higher failure rates than Llama 3.2 (smaller).

D. Statistical Significance

Consistency differences were statistically significant across most model pairs on MedQA and MedMCQA.
Llama 3.2 significantly outperformed Gemma 2 and Phi-3 Mini in accuracy on PubMedQA ( $p < 0.001$ ).

5. Significance and Implications

For Low-Resource Healthcare Deployment

Model Selection: For deployment on consumer CPUs in rural/low-resource clinics, Llama 3.2 (3B) is recommended as the optimal candidate due to its superior balance of accuracy and instruction adherence. Gemma 2, despite high consistency, is deemed unsafe due to systematic errors.
Prompt Engineering: Healthcare AI applications should avoid roleplay/persona-based prompts. Minimal, direct prompting yields higher reliability.
Hardware Constraints: Larger models (7B+) do not guarantee better performance or safety on consumer hardware; smaller, well-tuned models (3B) may be superior.

For Clinical AI Safety

Multidimensional Evaluation: The paper argues that accuracy-only benchmarks are insufficient. Safe deployment requires a joint evaluation of Accuracy + Consistency + Instruction Adherence.
The "Reliable Incorrectness" Hazard: Systems that are consistently wrong are more dangerous than inconsistent ones because they erode clinician trust and lead to repeated, unchallenged misdiagnoses.
Instruction Tuning Necessity: Domain-specific pretraining is not a substitute for instruction tuning. Models must be explicitly trained to follow structured output formats to be usable in clinical workflows.

Conclusion

This study provides a critical reality check for the deployment of small open-source LLMs in healthcare. It demonstrates that stability does not equal correctness. For safe, low-resource clinical AI, developers must prioritize models that balance accuracy with instruction-following capabilities (like Llama 3.2) and avoid prompt strategies (like roleplay) that introduce systematic instability. The findings call for a shift in evaluation frameworks from single-metric accuracy to multidimensional reliability assessments.