Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

This study evaluates small open-source language models for clinical question answering on consumer hardware, revealing that while Llama 3.2 offers the best balance of accuracy and reliability, high prompt consistency does not guarantee correctness and certain prompt styles or domain-specific pretraining without instruction tuning can severely degrade performance.

Shravani Hariprasad2026-03-05🤖 cs.AI

How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

This study presents a large-scale audit of 10 commercial LLMs revealing significant variation in citation hallucination rates across models and domains, while demonstrating that prompt-induced fabrication can be effectively mitigated through multi-model consensus, within-prompt repetition, and a lightweight bibliographic classifier that detects phantom citations without external database queries.

MZ Naser2026-03-05💬 cs.CL