Citation Failure: Definition, Analysis and Efficient Mitigation

Imagine you are asking a very smart, well-read librarian (the AI) a complex question. The librarian gives you a brilliant answer, but when you ask, "Where did you get that information?" they point to the wrong books, or they forget to point to any books at all.

This is the problem of Citation Failure.

The paper you provided, titled "Citation Failure in LLMs," is like a detective story where the authors investigate why this happens and how to fix it without hiring a whole new team of librarians.

Here is the breakdown in simple terms:

1. The Problem: The "Smart but Forgetful" Librarian

When AI models (LLMs) answer questions using a system called RAG (Retrieval-Augmented Generation), they are supposed to find facts in a pile of documents and tell you exactly which document they used.

Response Failure: The AI gives you a wrong answer. (Easy to spot: "The capital of France is London.")
Citation Failure: The AI gives you the right answer, but points to the wrong evidence or no evidence at all. (Harder to spot: "The capital of France is Paris," but they point to a book about London).

The Big Mistake: Previous research treated these two problems as the same. The authors say, "Wait a minute! If the answer is right, the AI knows the truth. It just failed to show its homework." They realized that if you don't separate "getting the answer wrong" from "forgetting to cite," you can't fix the citation problem properly.

2. The Investigation: Building a Better Test (CITECONTROL)

To study this, the authors built a special test lab called CITECONTROL.

Think of this like a video game level designer for AI.

They created questions where they knew the answer was correct.
They varied the "difficulty" of the connection between the answer and the source document.
- Easy Level (Explicit): The answer is written word-for-word in the source document. (Like finding a quote in a book).
- Hard Level (Implicit): The answer requires connecting two different documents. (Document A says "Kinshasa is the capital." Document B says "A coup happened in Kinshasa." The AI must connect the dots to answer "When did the coup happen in the capital?").

What they found:

Small AI models get lost even on easy levels.
Even huge, powerful AI models get confused on "Hard Levels" (multi-hop reasoning). They often find the answer but forget to cite the first document that started the chain of logic.
The AI tends to "under-cite," meaning it finds the answer but only points to the final piece of evidence, ignoring the steps it took to get there.

3. The Solution: The "CITENTION" Toolkit

The authors wanted to fix this without retraining the AI (which is expensive and slow, like rebuilding a car engine). Instead, they created a toolkit called CITENTION.

Imagine the AI is a chef cooking a meal.

Generative Citation: The chef writes the recipe while cooking. Sometimes they forget to list an ingredient.
Retrieval-Based: A separate robot scans the pantry and says, "You used flour, so cite the flour bag." This is fast but sometimes misses the nuance.
Attention-Based (The Secret Sauce): This looks at the chef's brain activity (specifically, the "attention" the model pays to words). Even if the chef doesn't write down the ingredient, their brain "glows" when they think about the flour. The authors realized they could read this "brain glow" to see what the AI actually looked at.

The Magic Combination:
They didn't just pick one method. They built a system that combines three things:

What the AI says (Generative).
What the AI looks at internally (Attention).
What a search engine finds (Retrieval).

It's like having a three-person committee vote on which book to cite. If the writer forgets, the "brain scanner" might catch it. If the scanner is confused, the search engine might help.

4. The Results: A Smarter, More Honest AI

When they tested this new toolkit:

It worked: The AI started citing the right documents much more often, even on the "Hard Levels."
It was efficient: They didn't need to retrain the AI. They just added a small layer of "smart checking" on top.
The "Masking" Trick: They found that if they temporarily hid the "reasoning words" (the internal thinking steps) while checking the attention, the AI gave better citations. It's like asking a student to show their work after they've finished the problem, rather than while they are still thinking, so they don't get distracted.

The Takeaway

This paper teaches us that AI is often smarter than it admits. It knows the answer but fails to show its work. By separating "wrong answers" from "missing citations" and using a mix of tools (including looking inside the AI's "brain" via attention), we can make AI much more trustworthy and easier to verify.

In short: Don't just trust the AI's answer; check its homework. And if it forgets to show its work, use a little bit of "mind-reading" technology to help it remember.

Here is a detailed technical summary of the paper "Citation Failure in LLMs: Definition, Analysis and Efficient Mitigation" by Buchmann and Gurevych.

1. Problem Definition

The paper addresses a critical flaw in Retrieval-Augmented Generation (RAG) systems: Citation Failure.

Context: RAG systems are designed to generate responses supported by evidence (citations) to allow for verification.
The Issue: A model may generate a valid response (correct answer) but fail to provide complete or correct citations for the evidence used.
Distinction: The authors distinguish this from Response Failure (where the answer itself is wrong). Previous work often conflated the two, making it difficult to analyze why citations fail when the answer is correct.
Gap: Existing benchmarks lack rigorous verification of response correctness and rely on error-prone LLM evaluators. Furthermore, the impact of the response-evidence relationship (e.g., reasoning complexity, explicitness) on citation quality in generative citation (where the model generates text and citations simultaneously) is under-explored.

2. Methodology

The research follows a two-step approach: Analysis of failure modes and Mitigation via a new framework.

Step 1: Analysis via CITECONTROL

To systematically study citation failure, the authors introduced CITECONTROL, a novel benchmark.

Design: It separates response failure from citation failure by using datasets with verifiable ground-truth answers and known evidence.
Key Variables: The benchmark systematically varies two properties of the response-evidence relation:
1. Reasoning Type: Single-hop (direct extraction), Multi-hop (chain of facts), and Intersection (combining facts to compute a new value).
2. Overtness: Explicit (response appears verbatim in evidence) vs. Implicit (response is inferred from evidence).
Datasets: Four datasets were selected/modified: RepliQA, BoolQ-M (created to avoid data contamination), MuSiQue, and NeoQA.
Evaluation Metric: Filtered Recall @k ( $R_k^f$ ). Unlike standard recall, this metric only evaluates instances where the model generated the correct response, isolating citation performance from answer generation performance.

Step 2: Mitigation via CITENTION

To address the identified failures efficiently (without expensive fine-tuning), the authors proposed CITENTION, a framework that integrates three citation paradigms:

Generative: The standard approach where the LLM outputs citations during generation.
Attention-Based: Utilizing the model's internal attention weights to identify relevant context tokens. Three methods were adapted:
- ICR: Equal weighting of all attention heads.
- QR (Query-focused Retrieval): Selecting specific heads that focus on relevant documents.
- AT2: Learning soft weights based on the impact of removing a document on the output probability.
Retrieval-Based: Using external retrievers (BM25 and DRAG) post-generation to find evidence.

Aggregation Strategy: CITENTION combines scores from these methods using a weighted average (learned via a linear model) and a decision function to select the top- $k$ citations. A key technical innovation is masking reasoning tokens during attention computation to prevent the model from being distracted by intermediate reasoning steps.

3. Key Contributions

CITECONTROL Benchmark: The first benchmark to rigorously disentangle response failure from citation failure, allowing for controlled analysis of how reasoning complexity and overtness affect citation quality.
CITENTION Framework: A novel, efficient framework that combines generative, attention-based, and retrieval-based methods to mitigate citation failure without requiring full model retraining.
Empirical Insights:
- Citation failure is frequent, especially in multi-hop and intersection reasoning tasks.
- Smaller models struggle even with simple relations, while all models struggle with complex reasoning chains.
- Attention-based methods are highly effective for extractive and abstractive responses but sensitive to overtness and reasoning tokens.
- Combining methods yields the best results, as different methods excel in different relation types (e.g., retrieval handles implicit relations better; attention handles explicit relations well).

4. Key Results

Experiments were conducted on 18 LLMs (ranging from 0.6B to 120B parameters) across CITECONTROL and transfer datasets (QASPER, GovReport).

Failure Modes:
- Complexity: Citation performance drops significantly as reasoning complexity increases (Multi-hop/Intersection).
- Model Size: Small models (<3B) fail even on single-hop tasks; larger models fail on complex multi-hop tasks.
- Ordering: Citations are often ordered by confidence (precision decreases for later citations), except in GPT models.
- Implicit Relations: Models struggle to trace implicit evidence chains (e.g., Hop -1, -2) compared to explicit ones (Hop 0).
Mitigation Performance:
- Attention-Based: Outperformed pure generative citation by +10% on transfer datasets (QASPER/GovReport) for Llama models, proving LLMs encode more evidence than they generate.
- Retrieval-Based: Outperformed attention-based methods on CITECONTROL (specifically for implicit relations) because retrievers have access to the original query, which aids in finding implicit evidence.
- Combination (CITENTION): The combined approach (GEN + ATT + RET) achieved the highest scores across all datasets, showing at least a +5% relative improvement over generative citation alone.
- Masking: Masking reasoning tokens during attention calculation significantly improved performance, particularly for Qwen models.

5. Significance

Theoretical: The paper establishes that "citation failure" is a distinct and prevalent phenomenon that cannot be solved by simply improving answer generation. It highlights the limitations of current generative citation methods in complex reasoning scenarios.
Practical: CITENTION offers a resource-efficient solution. Instead of costly fine-tuning or multiple LLM calls (as in attribution-first approaches), it leverages internal model states (attention) and lightweight retrieval to significantly boost citation accuracy.
Future Direction: The work suggests that future RAG systems should not rely on a single citation strategy but should dynamically combine generative, attention-based, and retrieval-based signals based on the complexity of the query and the nature of the evidence. It also opens new avenues for using attention mechanisms for reranking and attribution in black-box or resource-constrained settings.

Citation Failure: Definition, Analysis and Efficient Mitigation

1. The Problem: The "Smart but Forgetful" Librarian

2. The Investigation: Building a Better Test (CITECONTROL)

3. The Solution: The "CITENTION" Toolkit

4. The Results: A Smarter, More Honest AI

The Takeaway

1. Problem Definition

2. Methodology

Step 1: Analysis via CITECONTROL

Step 2: Mitigation via CITENTION

3. Key Contributions

4. Key Results

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models