How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

Imagine you ask a very smart, well-read robot to write a research paper for you. You tell it, "Please list the most important books and articles that prove your point." The robot, eager to please, types out a long list of citations with authors, titles, and years. It looks perfect. It looks professional.

But here's the catch: Many of those books and articles don't actually exist.

This paper is a massive investigation into exactly how often this happens, why it happens, and how we can catch the robot lying before it ruins your homework (or your PhD).

Here is the breakdown of the study, translated into everyday language with some helpful analogies.

1. The Great "Fake Book" Audit

The researchers didn't just guess; they put 10 different popular AI models (like the brains behind ChatGPT, Claude, and others) through a grueling test.

The Test: They asked these AIs to write about four different topics (like building bridges, climate change, medicine, and computer science) and requested references.
The Scale: They generated nearly 70,000 citations.
The Check: They ran every single one of those citations through a "truth detector" (checking against three massive databases of real academic papers).

The Result: The AIs were lying a lot.

Some models were "honest" about 88% of the time.
Others were lying more than 50% of the time.
The Big Discovery: When the researchers asked the AIs questions without asking for references, zero fake citations appeared. This proves the AIs aren't naturally prone to lying; they only start fabricating facts when you specifically ask them to "cite your sources." It's like a student who knows the material but starts making up sources only when the teacher says, "Show me your work."

2. The "Time Travel" Trap

The researchers noticed something weird about when the AIs lied.

The "Old Classics" Test: When asked for "seminal" (famous, old) papers, the AIs did okay.
The "New News" Test: When asked for "recent" papers, the lying skyrocketed.

The Analogy: Imagine a librarian who has read every book in the library up to 2023. If you ask for a book from 1990, they can find it easily. But if you ask for a book published last week, they haven't read it yet. To avoid admitting they don't know, they invent a title that sounds like a real book from last week. The AIs are doing the same thing; they are guessing at recent events because their "memory" (training data) is outdated.

3. The "Crowd Wisdom" Solution

So, how do we stop the lying? The researchers found two clever, low-tech tricks that work like a "voting system."

Trick #1: The "Three-Headed Monster" Rule.
If you ask three different AIs the same question and they all give you the exact same citation, it is almost certainly real (95% chance).
- Why? It's hard for three different liars to accidentally invent the exact same fake book title. But it's easy for three honest librarians to find the same real book.
Trick #2: The "Ask Again" Rule.
If you ask the same AI the same question three times, and it gives you the same citation twice, it's probably real.
- Why? Fake citations are random guesses. Real citations are memories. If the AI remembers the same thing twice, it's likely a real memory.

4. The "Fake ID" Detector (The AI Classifier)

The researchers also built a special tool—a "lie detector" that doesn't need to check the internet. It just looks at the shape of the citation string.

The Analogy: Think of a fake ID. Even if the photo looks good, the font might be slightly wrong, or the address might be too short.

Real Citations: Usually have longer author names, more authors listed, and older publication years.
Fake Citations: Often have very short author names, fewer authors, and strangely recent years (because the AI is trying to sound "up to date").

They trained a computer program to spot these "suspicious shapes." It can scan a list of citations and flag the likely fakes in a split second, saving you from having to check every single one manually.

5. Bigger Isn't Always Better

You might think, "If I use the newest, most expensive AI model, it will lie less."

The Reality: Not necessarily.
One company (OpenAI) made a new model that lied much less than its older version.
Another company (Anthropic) made a new model that lied more than its older version.

It turns out that just making a model "smarter" or "bigger" doesn't automatically fix its ability to tell the truth about references. It depends on how the company trained it and what data they fed it.

The Bottom Line

AI is a powerful tool for writing, but it is a terrible librarian when it comes to making up references.

Don't trust it blindly: If an AI gives you a citation, assume it might be fake until proven otherwise.
Use the "Voting" method: If multiple AIs agree on a source, it's likely real.
Check the "ID": If a citation looks too simple or too recent, be suspicious.

The study ends with a clear message: AI can help you write, but you must be the one to verify the facts. The robot is the writer; you are the editor.

Here is a detailed technical summary of the paper "How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations" by M.Z. Naser.

1. Problem Statement

Large Language Models (LLMs) are increasingly integrated into academic workflows for drafting manuscripts and generating literature reviews. However, they frequently "hallucinate" (fabricate) scholarly citations—creating references that appear legitimate (correct formatting, plausible authors/titles) but do not exist.

The Gap: Existing literature lacks a comprehensive, comparative audit across major commercial providers, domains, and prompting conditions. It is unclear whether hallucination is an intrinsic model tendency or a prompt-induced behavior, and no reliable heuristics exist to filter these errors without external database queries.
Consequences: Undetected phantom citations enter the scholarly record, creating chains of false attribution that are difficult to reverse and undermining the integrity of peer-reviewed research.

2. Methodology

The study conducted a massive-scale audit involving 10 commercially deployed LLMs from 7 developers (OpenAI, Anthropic, Meta, DeepSeek, Moonshot AI, Mistral) across 4 academic domains (Structural Engineering, Climate/Environmental Science, Biomedical Research, NLP/AI).

Dataset Scale: The experiment generated 69,557 citation instances from 15,150 model responses.
Experimental Design:
- Prompting: A factorial design crossing 4 domains with 2 temporal framings ("recent/influential" vs. "seminal/foundational").
- Control Condition: An unprompted control was included where models were asked topical questions without requesting citations to test for spontaneous hallucination.
- Replication: Each condition was repeated 3 times per model to assess consistency.
Verification Pipeline: A three-source automated verification system queried CrossRef, OpenAlex, and Semantic Scholar.
- Citations were scored (0–100) based on title similarity (60%), author overlap (20%), and year agreement (20%).
- Thresholds: Score $\ge$ 80 = "Confirmed"; 65–79 = "Probable"; < 65 = "Hallucinated."
- Validation: An independent validation using GPT-4.1-mini with web search confirmed 100% precision for confirmed matches and a low false-negative rate.

3. Key Contributions

Comprehensive Audit: The largest citation hallucination audit to date, covering 10 models, 4 domains, and 2 temporal framings.
Causal Attribution: Established that citation hallucination is prompt-induced, not intrinsic, as no model spontaneously generated formal citations in the unprompted control condition (0/3,030 responses).
Practical Detection Heuristics: Identified and validated two low-cost filtering methods:
- Multi-model Consensus: Requiring agreement across $\ge$ 3 models.
- Within-Prompt Repetition: Requiring a citation to appear in $\ge$ 2 of 3 replications.
Text-Based Classifier: Developed a lightweight Gradient Boosted Machine (GBM) classifier that detects hallucinations using only bibliographic string features (no external database queries), achieving high generalization across model families.
Generational & Capacity Analysis: Tracked performance trajectories within model families, revealing that newer generations do not guarantee improved citation accuracy.

4. Key Results

A. Hallucination Rates

Variability: Hallucination rates spanned a fivefold range (11.4% to 56.8%) across models.
- Best: GPT-5-mini (11.4%).
- Worst: Haiku-4.5 (56.8%).
Prompt Framing: Requests for "recent" references yielded significantly higher hallucination rates (74.1%) compared to "seminal" references (55.0%), likely due to sparsity of recent data in training corpora.
Domain Effects: NLP/AI had the lowest hallucination rate (26.6%), while Structural Engineering had the highest (50.1%), correlating with the density of open-access training data in those fields.

B. Verification Heuristics Performance

Multi-Model Consensus:
- Single model accuracy: 16.5%.
- Consensus of $\ge$ 3 models: 95.6% accuracy (5.8 $\times$ improvement).
Within-Prompt Repetition:
- Single replication accuracy: 28.6%.
- Repetition in $\ge$ 2 of 3 runs: 88.9% accuracy (3.1 $\times$ improvement).

C. Generational and Capacity Trends

Generational Regression: While OpenAI's GPT-5-mini improved significantly over GPT-4o-mini (45.3% $\to$ 11.4%), Anthropic's Haiku-4.5 regressed compared to Haiku-3.5 (48.8% $\to$ 56.8%). This indicates that newer model versions do not automatically solve citation accuracy.
Capacity Scaling: Within model families (Llama 4 and GPT-5), larger capacity models consistently hallucinated less than their smaller counterparts.
Open vs. Closed Weights: No significant difference was found between open-weight and closed-weight models; performance varied more by specific developer implementation than by weight accessibility.

D. Text-Based Classifier

Performance: The GBM classifier achieved an AUC of 0.876 in cross-validation and 0.834 in leave-one-model-out generalization.
Key Features: The strongest predictors of hallucination were author simplification (fewer authors, shorter strings, lack of "et al.") and temporal recency (hallucinated citations tended to be slightly more recent than real ones).
Utility: This tool allows for pre-screening citations without API calls, potentially reducing external lookup costs by 40–60%.

E. Bibliometric Bias

Confirmed citations showed a systematic bias toward Open Access (OA) works (77–92% OA vs. ~50% global baseline) and highly cited papers.
Models with lower hallucination rates (e.g., GPT-5-mini) exhibited stronger amplification of OA and high-citation works, suggesting they rely more heavily on dense, accessible training data.

5. Significance and Implications

Reframing the Problem: The finding that hallucination is prompt-induced shifts the mitigation strategy from architectural overhaul to prompt engineering and post-hoc verification.
Immediate Mitigation: Researchers can immediately implement consensus and repetition filters to drastically reduce error rates without waiting for model updates.
Epistemic Equity: The bias toward Open Access literature in LLM-generated citations risks further marginalizing subscription-based research and researchers in lower-resource settings who rely on non-OA publications.
Tooling: The release of the text-based classifier and the dataset (69k citations) provides the community with immediate tools to audit and filter AI-generated references, enhancing the reliability of AI-assisted academic writing.

In conclusion, the paper demonstrates that while LLMs currently pose a significant risk to citation integrity, this risk is highly variable, predictable, and manageable through simple, scalable verification strategies.