Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
This study evaluates small open-source language models for clinical question answering on consumer hardware, revealing that while Llama 3.2 offers the best balance of accuracy and reliability, high prompt consistency does not guarantee correctness and certain prompt styles or domain-specific pretraining without instruction tuning can severely degrade performance.