Do Deployment Constraints Make LLMs Hallucinate Citations? An Empirical Study across Four Models and Five Prompting Regimes

This empirical study demonstrates that deployment-motivated prompting constraints significantly exacerbate citation hallucinations across four large language models, with no model achieving a citation existence rate above 47.5% and a substantial portion of unverifiable outputs being fabricated, thereby underscoring the critical need for post-hoc verification in academic and software engineering contexts.

Chen Zhao, Yuan Tang, Yitian Qian

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are asking a very smart, well-read assistant to write a short essay for you. You ask them to include a list of "real books" they used to write it. The assistant hands you a beautiful, perfectly formatted list of references. They look professional: they have authors, titles, publication years, and even ISBN numbers.

But here's the catch: most of those books don't actually exist.

This is exactly what researchers Chen Zhao, Yuan Tang, and Yitian Qian investigated in their paper, "Do Deployment Constraints Make LLMs Hallucinate Citations?" They wanted to see if putting "rules" on these AI assistants makes them lie more about their sources.

Here is the breakdown of their study using simple analogies.

The Setup: The "Closed-Book" Exam

The researchers treated the AI models like students taking a closed-book exam. They couldn't look up answers on the internet; they had to rely entirely on what was in their memory (training data).

They gave four different "students" (two famous, expensive AI models and two open-source ones) 144 different writing prompts. They asked them to write academic paragraphs and include a list of references.

To make things tricky, they gave the students five different sets of rules (constraints):

  1. Baseline: Just write normally.
  2. Temporal: "Only use books published between 2020 and 2025." (Like asking for only the latest news).
  3. Survey: "Write a broad overview covering many different topics." (Like asking for a 10-page summary of a whole library).
  4. Non-Disclosure: "Don't say you memorized these books from your training data." (A rule often used in corporate settings).
  5. Combo: All the above rules combined.

The Investigation: The "Fact-Checker"

After the AI generated thousands of citations, the researchers didn't just take their word for it. They built an automated fact-checking machine. This machine checked every single citation against two massive databases of real academic papers (Crossref and Semantic Scholar).

They categorized the results into three buckets:

  • Existing: The book/paper is real and matches the citation.
  • Fabricated: The book/paper is a complete lie (fake title, fake author, fake year).
  • Unresolved: The machine couldn't tell. It looked plausible, but the machine couldn't find it in the database. (The researchers found that about half of these "unsure" ones were actually lies too).

The Big Findings

1. The "Perfectly Fake" Problem

The most shocking discovery was that no AI model got a majority of citations right. Even the best model only had about 47% of its citations verified as real.

When the researchers added the "Temporal" rule (only recent papers), the AI's performance tanked.

  • The Analogy: Imagine asking a student, "Tell me about the history of the internet, but only use books published last week." The student, desperate to follow the rule, invents fake books that look like they were published last week.
  • The Result: The AI followed the rule perfectly (the dates were correct), but the books didn't exist. The "format" was perfect, but the "substance" was zero.

2. The "Rich vs. Poor" Student Gap

The researchers compared the expensive, proprietary models (like GPT-4o and Claude) against the free, open-source ones (like LLaMA and Qwen).

  • The Analogy: The expensive models are like students who went to a massive, private library with millions of books. The open-source models are like students who only had access to a small public library.
  • The Result: The "Rich" students did better, but they still lied a lot. The "Poor" students lied even more. When the rules got harder (like the "Survey" rule asking for a broad overview), the gap between them got huge. The expensive models could still find some real books; the open-source models mostly made them up.

3. The "Unresolved" Trap

About 36% to 61% of the citations fell into the "Unresolved" bucket.

  • The Analogy: Imagine a student hands you a reference that looks like a real book, but the library catalog is missing that specific entry. Is the book real but lost? Or did the student invent it?
  • The Danger: The researchers found that nearly half of these "mystery" citations were actually fake. This is dangerous because if you just look at the list, it seems trustworthy, but it's actually a trap.

4. The "Combo" Disaster

When they combined all the rules (Recent + Broad + Secretive), the results were terrible.

  • The Result: The open-source models collapsed almost entirely (near 0% real citations). Even the expensive models struggled, though they managed to keep a tiny sliver of real citations.
  • The Twist: Even though the quality of the citations was terrible, the AI kept generating more of them to fill the quota. It was like a factory churning out more and more fake products just to meet the production target.

Why Should You Care?

This paper is a warning label for anyone using AI to write academic papers, technical reports, or software documentation.

  • Don't trust the list: Just because an AI gives you a list of citations that looks perfect (with DOIs and years), it doesn't mean they are real.
  • Rules make it worse: If you ask an AI to follow strict rules (like "only use 2024 papers"), it is more likely to hallucinate (lie) to satisfy you.
  • The "Unresolved" is a red flag: If you can't easily verify a source, assume it might be fake.

The Bottom Line

The researchers conclude that prompt engineering alone cannot fix this. You can't just tell the AI "don't lie."

If you want reliable citations, you need to treat AI-generated lists as drafts only. You must manually check every single reference against a real database before you use it in any serious work. The AI is a great writer, but it is a terrible librarian.