Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

Imagine you are a head chef (the AI) trying to cook a perfect, complex banquet for a guest (the user). The guest doesn't just want a list of ingredients; they want a delicious, well-organized meal that covers all the flavors they asked for, without any boring repeats.

To do this, the chef relies on a scout (the Retrieval System) who runs out to the market to gather ingredients (documents/videos) before the cooking begins.

This paper asks a very practical question: "If the scout brings back a basket full of diverse, high-quality ingredients, does the chef automatically make a better meal?"

Here is the breakdown of their findings using simple analogies:

1. The Old Way vs. The New Way

The Old Way (Ad-hoc Search): Imagine you ask a librarian, "Where is the book on cats?" They hand you the single best book. That's great if you just want to read one book.
The New Way (RAG - Report Generation): Now, imagine you ask, "Write a report on the history of cats, their diet, and their behavior in different cultures." The librarian can't just give you one book. They need to gather many books, pick out the best facts from each, and throw away the duplicates so the chef (the AI) has a clean, diverse pile of information to work with.

2. The Core Discovery: The "Scout" Matters Most

The researchers tested this by trying out 15 different "scouts" (retrieval systems) and 4 different "chefs" (AI generation pipelines) across text and video tasks.

The Big Finding: There is a strong, direct link between how good the scout is at gathering diverse information and how good the final report is.

The Analogy: If the scout brings back 10 apples and 10 oranges (high coverage), the chef can make a great fruit salad. If the scout brings back 20 apples (redundant information), the chef is stuck making a boring apple-only salad, no matter how talented the chef is.
The Metric: They found that if you measure the scout's success by "Did they find all the different types of facts we need?" (called Nugget Coverage), you can predict the quality of the final report with high accuracy. You don't even need to wait for the chef to cook the meal to know if the ingredients were good.

3. The "Complex Chef" Loophole

The researchers also tested what happens when you use a super-complex chef (an iterative AI that can think, ask for more ingredients, and rewrite its own questions).

The Finding: A complex chef can sometimes "fix" a bad scout. If the scout brings back bad ingredients, the complex chef might say, "Wait, I need more spices," and go back to the market themselves.
The Catch: This is expensive and slow. While a complex chef can compensate for a weak scout, it's much more efficient to just hire a better scout in the first place. Also, the complex chef sometimes gets so distracted by its own thinking that it stops listening to the scout entirely, making the scout's performance irrelevant.

4. Does this work for Videos too?

They tested this with video (like a chef trying to make a documentary using video clips).

The Twist: For videos about famous events (like the 2016 Olympics), the AI already "knows" a lot from its training (like a chef who has cooked this dish a thousand times). In these cases, the scout's job isn't to find new facts, but to verify the facts the chef already knows.
The Result: Even here, a good scout helps the chef be more accurate (factuality), though the link to "finding new info" is weaker because the chef already has the info in their head.

5. Why This Matters (The "So What?")

Currently, testing these AI systems is like tasting every single dish before serving it to a customer. It takes forever and costs a lot of money (computing power).

The Paper's Solution:
You don't need to taste the dish to know if it will be good. You just need to check the scout's basket.

If the retrieval system (the scout) is good at finding diverse, non-redundant information, the final AI report will likely be good.
This allows developers to skip the expensive "cooking" step during testing and just evaluate the "scouting" step. It saves time, money, and computing power.

Summary in One Sentence

If you want a great AI-generated report, focus on hiring a retrieval system that gathers a wide variety of unique facts; a good ingredient list is the best predictor of a delicious meal, even if your chef is trying to be fancy.

Here is a detailed technical summary of the paper "Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage."

1. Problem Statement

Retrieval-Augmented Generation (RAG) systems combine document retrieval with Large Language Models (LLMs) to synthesize complex information, such as reports. While it is intuitively assumed that better retrieval leads to better generation, this relationship has not been systematically studied, particularly regarding information coverage (the extent to which a response covers all necessary aspects of a query).

Current evaluation practices face significant challenges:

Cost: End-to-end evaluation of RAG pipelines requires running expensive generation models and collecting new human or LLM judgments for every iteration.
Noise: Generation introduces variability; different LLMs or strategies can produce divergent outputs from the same context, making it difficult to attribute performance to the retrieval component.
Misalignment: Traditional retrieval metrics (e.g., standard nDCG) focus on document relevance, whereas RAG tasks often require diversity and coverage of information facets (nuggets) rather than just a single relevant document.

The paper asks: Can upstream retrieval metrics serve as reliable early indicators of the final generated response's information coverage?

2. Methodology

The authors conducted a comprehensive empirical study across text and multimodal domains to analyze the correlation between retrieval quality and generation coverage.

Datasets and Tasks

Text RAG:
- TREC NeuCLIR 2024: A multilingual report generation task (Chinese, Persian, Russian) with 19 topics.
- TREC RAG 2024: A question-answering task using MS MARCO with 55 topics.
Multimodal RAG:
- WikiVideo: An event-centric article writing task using video documents (109K videos, 57 topics).

Retrieval Systems (15 Text Stacks, 10 Multimodal Stacks)

The study utilized a wide spectrum of retrieval architectures to ensure robustness:

First-Stage Models: BM25 (lexical), PLAID-X (late-interaction), LSR (learned sparse), Qwen3-8B Embed (dense), and 3-way RRF (fusion).
Rerankers: Qwen3-8B Reranker and Rank1-7B (reasoning-based).
Multimodal Models: CLIP, LanguageBind, MMMORRF, Video-ColBERT, and OmniEmbed.

RAG Pipelines

Four distinct pipeline architectures were tested to evaluate complexity:

GPT-Researcher (GPT-R): Linear (1 query) and Iterative (3 queries, generating sub-queries).
Bullet List: Extractive approach generating 10 queries to group facts.
LangGraph: A complex iterative system using self-reflection to identify knowledge gaps and trigger re-retrieval.
CAG (for Video): A video-specific pipeline aggregating key information.

Evaluation Frameworks

Retrieval Metrics: Focused on coverage rather than just relevance.
- $\alpha$ -nDCG: Penalizes redundancy, rewarding novel relevant content.
- Sub-topic Recall (StRecall): Measures the fraction of information facets covered.
- nDCG (Nugget-based): Standard ranking quality based on nugget presence.
Generation Metrics:
- Auto-ARGUE: Evaluates "Nugget Coverage" (proportion of grounded question-answer pairs covered).
- MiRAGE: Evaluates Information Recall (InfoR) and Precision (InfoP) for multimodal tasks.

3. Key Contributions

Empirical Validation of Correlation: Demonstrated strong positive correlations between nugget-based retrieval metrics (specifically $\alpha$ -nDCG and StRecall) and nugget coverage in generated responses at both the topic and system levels.
Pipeline Complexity Analysis: Revealed that while simple linear pipelines benefit directly from retrieval improvements, complex iterative pipelines (like LangGraph) can partially decouple generation quality from retrieval effectiveness by adapting queries to the retrieval model's capabilities.
Generalizability: Validated findings across different modalities (text vs. video), generation strategies, and evaluation frameworks (Auto-ARGUE vs. MiRAGE).
Practical Implication: Established retrieval metrics as reliable proxies for RAG performance, allowing developers to optimize systems without incurring the high cost of full end-to-end generation evaluation.

4. Key Results

RQ1: Topic-Level Correlation

Finding: There is a strong correlation between retrieval coverage metrics and generation coverage for specific topics.
Insight: $\alpha$ -nDCG and StRecall are superior indicators compared to standard relevance-based nDCG. Standard relevance metrics showed low correlation in complex report generation tasks (NeuCLIR) because a "relevant" document does not necessarily contain the specific information facet needed.

RQ2: System-Level Correlation

Finding: Systems with better average retrieval coverage generally produce better average generation coverage.
Insight: The correlation is strongest when the retrieval objective matches the generation objective (e.g., using nugget-based metrics for both). Mismatching objectives (e.g., using relevance metrics for a coverage task) weakens the correlation but does not eliminate it entirely.

RQ3: Impact of Pipeline Complexity

Finding: Complex iterative pipelines (LangGraph) show weaker correlations between retrieval and generation metrics compared to linear pipelines (GPT-R 1-query).
Insight: Iterative systems can "compensate" for weaker retrieval by generating new queries to fill gaps. However, this shifts the bottleneck from retrieval to the LLM's ability to adapt. In some cases, a complex pipeline with a weak retriever (BM25) outperformed a simple pipeline with a strong retriever, but the correlation between the two components broke down.

RQ4 & RQ5: Robustness and Multimodality

Evaluator Robustness: The correlation holds across different evaluators (Auto-ARGUE and MiRAGE), though metric definitions (e.g., citation requirements) cause slight variations.
Multimodal (WikiVideo): In video RAG, retrieval effectiveness strongly correlated with factuality (InfoP) rather than coverage. This is attributed to LLMs relying on parametric knowledge for well-known events, using retrieval primarily for verification. However, retrieval remains a critical indicator of generation quality.

5. Significance and Implications

Cost Reduction: The study provides empirical grounds for simplifying RAG evaluation. Developers can use upstream retrieval metrics (specifically coverage-oriented ones like $\alpha$ -nDCG) as a proxy for downstream generation quality, significantly reducing computational costs and the need for expensive human/LLM judgments.
Design Guidance:
- For linear pipelines, investing in better retrieval models yields direct performance gains.
- For iterative pipelines, optimization should focus on the LLM's query adaptation capabilities, as the link to raw retrieval metrics is weaker.
Metric Selection: The paper argues for shifting from pure relevance metrics to diversity/coverage metrics in RAG contexts, as they better align with the goal of synthesizing comprehensive reports.

In conclusion, the paper establishes that while RAG complexity can introduce noise, information coverage in retrieval remains a fundamental and reliable predictor of information coverage in generation, provided the evaluation metrics align with the task's diversity requirements.