Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

This study evaluates small open-source language models for clinical question answering on consumer hardware, revealing that while Llama 3.2 offers the best balance of accuracy and reliability, high prompt consistency does not guarantee correctness and certain prompt styles or domain-specific pretraining without instruction tuning can severely degrade performance.

Shravani Hariprasad

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are in a rural village clinic with no internet and no fancy supercomputers. You have a small, local AI assistant on a regular laptop, and you need it to help answer tricky medical questions. You want to know: Can we trust this little AI?

This paper is like a rigorous "stress test" for five different small AI models to see how they handle medical questions when you ask them in slightly different ways.

Here is the breakdown of what they found, using some simple analogies:

1. The "Steady Hand" vs. The "Smart Brain"

The biggest surprise in this study is that being consistent doesn't mean being right.

  • The Analogy: Imagine two students taking a math test.
    • Student A (The "Gemma 2" model): If you ask them the same question in five different ways, they give you the exact same wrong answer every single time. They are incredibly consistent (like a broken clock that is right twice a day, but here, it's always wrong). They are confidently incorrect.
    • Student B (The "Llama 3.2" model): If you ask them the same question in different ways, they might give you slightly different answers, but most of the time, they get the right answer. They are less "steady" in their wording, but they are much smarter.

The Takeaway: In healthcare, a model that is "reliably wrong" is actually more dangerous than one that is sometimes right and sometimes wrong, because the doctor might trust the steady-but-wrong model too much.

2. The "Roleplay" Trap

The researchers tried asking the AI questions in different "personas." For example, they tried saying, "Pretend you are a senior doctor taking a board exam," versus just asking the question directly.

  • The Analogy: It's like asking a chef to cook a meal.
    • Direct: "Make me a burger." (The chef makes a great burger).
    • Roleplay: "You are a world-famous chef who loves to cook burgers. Show me your skills!" (The chef gets distracted by the drama of the roleplay, overthinks it, and burns the burger).

The Takeaway: For these small AI models, pretending to be a character actually made them worse at answering medical questions. The "roleplay" prompts confused them. If you want the best results, just ask the question plainly.

3. Bigger Isn't Always Better (and "Medical Knowledge" isn't Enough)

They tested models of different sizes (from 2 billion to 7 billion "brain cells" or parameters) and even one that was specifically trained on medical books but never taught how to follow instructions.

  • The Analogy:
    • The Big Model: Imagine a giant library (7B parameters). You'd think it knows everything. But sometimes, it gets so overwhelmed it forgets to give you the answer in the format you asked for (like giving you a paragraph when you asked for a "Yes/No").
    • The Medical Expert: Imagine a doctor who has read every medical textbook in the world but has never been taught how to fill out a form. If you ask them a question, they might know the answer, but they can't give it to you in the way you need it. They just stare at you blankly.

The Takeaway:

  1. Size doesn't guarantee safety: A bigger model didn't necessarily follow instructions better than a smaller one.
  2. Knowledge needs instructions: Just having medical knowledge isn't enough. The AI needs to be "taught" how to listen and answer in a structured way. The model that knew the most medicine but couldn't follow instructions failed almost 100% of the time.

4. The Winner for Low-Resource Clinics

Since many clinics in developing areas can't afford expensive cloud servers, they need models that run on regular computers.

  • The Winner: Llama 3.2 (a 3-billion parameter model).
  • Why? It struck the best balance. It wasn't the most "steady" (it changed its answer slightly depending on how you asked), but it was the most accurate. It also rarely failed to give an answer at all.

Summary: What Should We Do?

If you are building an AI for a doctor in a low-resource clinic:

  1. Don't just look for accuracy: Check if the AI gives the same answer every time, but also check if that answer is actually correct.
  2. Don't use "Roleplay": Don't tell the AI to "act like a doctor." Just ask the question directly.
  3. Teach it to follow rules: Make sure the AI knows how to follow instructions, not just that it knows medical facts.
  4. Pick the balanced model: Sometimes a slightly smaller, smarter model is better than a huge, confused one.

The Bottom Line: In medicine, a "confidently wrong" AI is a ticking time bomb. We need to test these models not just on how smart they are, but on how stable and obedient they are, too.