How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

Imagine you hire a team of super-smart librarians (the AI models) to answer questions based only on a specific set of books you give them. You want them to be perfect: if the answer isn't in the books, they should say, "I don't know." If the answer is there, they should find it instantly.

But here's the scary truth this paper reveals: Even the best librarians sometimes make up stories. They might confidently tell you a fact that isn't in the books, just because it sounds plausible. This is called "hallucination."

The authors of this paper, Kamiwaza AI, didn't just ask a few questions and guess. They ran a massive experiment—172 billion tokens worth of testing (that's like reading the entire internet 10 times over)—to see exactly how often these AI librarians lie, and why.

Here is the breakdown of their findings, translated into everyday language:

1. The "Library Size" Problem (Context Length)

Imagine you give a librarian a single pamphlet. They can probably answer your questions perfectly. Now, imagine you give them a library with 200,000 books.

The Finding: As the library gets bigger, the librarians get confused and start making things up.
The Analogy: It's like trying to find a specific needle in a haystack. If the haystack is small, you find it. If the haystack is the size of a mountain, you might just grab a random piece of hay and pretend it's the needle because you're overwhelmed.
The Data: At a small library size (32K), the best AI lies about 1% of the time. But at a massive library size (200K), every AI starts lying more than 10% of the time. Some models, which were great at small libraries, completely collapse and lie 70% of the time in huge libraries.

2. The "Smartness" vs. "Honesty" Trap

You might think, "If a librarian is really smart, they won't lie." The paper proves this is false.

The Finding: Being good at finding facts (Grounding) is totally different from being good at refusing to make up facts (Fabrication Resistance).
The Analogy: Think of a model like a Llama 3.1 70B. It's like a genius who can find any book in the library instantly (90%+ accuracy at finding facts). But, it's also a compulsive storyteller who will invent a whole new chapter if you ask about something that isn't there. It's "smart but untrustworthy."
The Contrast: Other models, like the GLM 4.5, are slightly less "flashy" at finding things but are incredibly disciplined. They rarely invent things. The paper shows that honesty is a training habit, not just a result of being big and smart.

3. The "Temperature" Knob (The Chaos Dial)

AI models have a setting called "Temperature."

Low Temp (0.0): The model is a robot. It picks the most logical next word every time.
High Temp (1.0): The model is a jazz musician. It takes risks and tries different words.
The Old Rule: Everyone thought, "For facts, always set the dial to 0.0 (Robot Mode)."
The New Discovery: This is dangerous advice!
- The "Robot" Glitch: When set to 0.0, especially with huge libraries, the AI sometimes gets stuck in a loop, repeating the same sentence forever (like a broken record). This happens 48 times more often at 0.0 than at higher settings!
- The "Jazz" Benefit: Surprisingly, turning the dial up a little (to 0.4 or 0.7) actually helps the AI stop lying in some cases and prevents it from getting stuck in loops.
- Takeaway: Don't just set it to zero. You need to tune it carefully.

4. The Hardware Myth

People often worry: "If I run this AI on an NVIDIA chip, will it lie less than on an AMD or Intel chip?"

The Finding: No. It doesn't matter.
The Analogy: It's like asking if a car drives better on a Honda engine or a Ford engine. If the driver (the AI model) is the same, the car performs the same. The paper tested three different hardware giants and found zero difference in how much the AI lied. You can pick your hardware based on price, not fear of hallucinations.

5. The "Best Librarian" Ranking

The paper tested 35 different AI models. Here is the hierarchy they found:

The Gold Standard: GLM 4.5 is the current champion. It lies only about 1.2% of the time at small library sizes.
The "Big but Risky" Models: The massive Llama models (like the 405B or 70B) are very popular, but they lie 25% to 50% of the time. They are great at finding facts but terrible at knowing when not to answer.
The "Small but Honest" Models: Some smaller models (like MiniMax) are surprisingly honest and reliable.

The Big Picture: What Should You Do?

If you are a business owner trying to use AI to answer questions from your documents:

Don't trust the biggest model. Size doesn't equal honesty. Pick a model family known for being "honest" (like GLM or MiniMax) rather than just the one with the most parameters.
Watch your library size. If you feed the AI too much text at once, it will start hallucinating. Break big documents into smaller chunks.
Don't set the temperature to zero blindly. It might make the AI get stuck in loops or lie more. Try a middle setting (0.4 or 0.7).
Accept that lies happen. Even the best AI will lie about 1% of the time. You need a safety net (human review or other checks) to catch those lies.

In short: AI is getting smarter at finding information, but it hasn't learned to be perfectly honest yet. The more text you give it, the more likely it is to make things up. Choose your tools wisely, and don't let the "biggest" model fool you.

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

1. The "Library Size" Problem (Context Length)

2. The "Smartness" vs. "Honesty" Trap

3. The "Temperature" Knob (The Chaos Dial)

4. The Hardware Myth

5. The "Best Librarian" Ranking

The Big Picture: What Should You Do?

1. Problem Statement

2. Methodology: RIKER

3. Experimental Design

4. Key Findings & Results

A. Hallucination Rates are Non-Trivial and Context-Dependent

B. Model Selection Dominates All Other Factors

C. Grounding and Fabrication are Decoupled

D. Temperature Effects: Nuance over Dogma

E. Hardware Independence

5. Significance and Implications

Conclusion

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

1. The "Library Size" Problem (Context Length)

2. The "Smartness" vs. "Honesty" Trap

3. The "Temperature" Knob (The Chaos Dial)

4. The Hardware Myth

5. The "Best Librarian" Ranking

The Big Picture: What Should You Do?

1. Problem Statement

2. Methodology: RIKER

3. Experimental Design

4. Key Findings & Results

A. Hallucination Rates are Non-Trivial and Context-Dependent

B. Model Selection Dominates All Other Factors

C. Grounding and Fabrication are Decoupled

D. Temperature Effects: Nuance over Dogma

E. Hardware Independence

5. Significance and Implications

Conclusion

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance