A Systematic Study of Pseudo-Relevance Feedback with LLMs

Imagine you are a detective trying to solve a mystery. You have a vague hunch about who the culprit is (your Query), but you need more clues to be sure. In the world of search engines, this process of gathering extra clues to sharpen your search is called Pseudo-Relevance Feedback (PRF).

For a long time, detectives (search engines) would look at the top few files in the library (the Corpus) that seemed relevant, read them, and pull out new keywords to refine their search.

But now, we have a super-smart AI assistant (a Large Language Model or LLM) that can help. The big question this paper asks is: How should we use this AI assistant to help us solve the mystery?

The authors, Nour Jedidi and Jimmy Lin, realized that everyone was mixing up two different tools in the toolbox. They decided to separate them and test them one by one, like a scientist in a lab.

Here is the breakdown of their study using simple analogies:

The Two Main Ingredients

The paper says every PRF method has two parts:

The Feedback Source (Where do the clues come from?):
- The Library (Corpus): The AI reads real documents from the database.
- The Dream (LLM): The AI imagines what the answer might look like and writes a fake document based on its own knowledge.
- The Mix: Using both real documents and the AI's imagination.
The Feedback Model (How do we use the clues?):
- This is the recipe. Do we just paste the new words onto the old search? Do we average them out? Do we weigh them carefully?

The Big Discoveries

1. The Recipe Matters More Than You Think (RQ1)

Analogy: Imagine you have a bag of delicious ingredients (the clues). If you just throw them all into a pot and stir randomly (a simple "Average" method), the soup might taste okay. But if you use a master chef's technique (the Rocchio method) to balance the flavors, the soup becomes a gourmet meal.

The Finding: The authors found that how you process the clues is critical. If you are using the AI's "Dream" (generated text), using a simple mixing method often fails. You need a sophisticated "chef's recipe" (Rocchio) to get the best results. If you use the wrong recipe, even the best ingredients won't save the dish.

2. Real vs. Fake Clues: The "Lazy" vs. "Hard" Worker (RQ2)

Analogy:

The Lazy Worker (LLM Only): The AI sits at its desk and writes a fake "perfect answer" from its memory. It's fast, cheap, and doesn't need to look at the library.
The Hard Worker (Corpus Only): The AI goes to the library, reads 100 real books, and picks the best ones. This takes a lot of time and effort.

The Finding:

The Lazy Worker wins for speed: If you just want a quick, good-enough answer, having the AI write a fake document is the most cost-effective solution. It's surprisingly strong.
The Hard Worker wins for quality (sometimes): If you have a very smart librarian (a strong initial search engine) who brings you only the best books, then reading those real books is better than the AI's imagination.
The Catch: If your librarian is bad, sending the AI to the library is a waste of time. The AI's imagination is often safer and more consistent.

3. Mixing the Sources: The "Double-Check" Strategy (RQ3)

Analogy: Should you trust the AI's dream and the library books?

For Dense Search (Modern AI search): Yes! It's like having two detectives. One looks at the library, the other uses their intuition. If you combine their reports side-by-side, you get a much stronger case.
For Traditional Search (BM25): It's trickier. If you just mash them together, it doesn't help much. However, if you let the AI dream first, and then use that dream to help the librarian find better books, that works wonders.

4. The Cost of Time (Latency)

Analogy:

The Dream (LLM Only): Takes 1 second.
The Library (Corpus): Takes 10 seconds if the books are short, but 100 seconds if the books are long novels.

The Finding: The "Dream" method is the fastest. If you try to read real documents to get clues, your search gets slower, especially if the documents are long. If you want speed, stick to the AI's imagination. If you want maximum accuracy and don't mind waiting, read the real books (but only if your librarian is good at finding them).

The "Aha!" Moment

The paper reveals a surprising twist: Dense retrievers (modern AI search engines) are actually bad at using these extra clues.

Analogy: Imagine you have a super-smart GPS (Dense Retriever) and a classic paper map (BM25). You give both of them a new, detailed traffic report (the feedback).

The Paper Map (BM25) immediately uses the report to reroute you perfectly.
The GPS (Dense Retriever) gets confused by the extra data and actually drives worse than before, even though it started out faster.

The authors found that the old-school search method (BM25) is actually better at using these new AI-generated clues than the fancy modern AI search methods!

Summary for the General Public

This paper is a "user manual" for the future of search engines. It tells us:

Don't just throw AI-generated text at a search engine; use a smart method to mix it in.
If you want speed, let the AI imagine the answer.
If you want the absolute best accuracy and have a good starting search, let the AI read real documents.
Sometimes, the old-school search methods are actually better at using AI help than the new, fancy ones.

By untangling these methods, the authors hope future search engines will be faster, smarter, and more reliable.

Here is a detailed technical summary of the paper "A Systematic Study of Pseudo-Relevance Feedback with LLMs" by Nour Jedidi and Jimmy Lin.

1. Problem Statement

Pseudo-relevance feedback (PRF) is a technique used to improve query representations by assuming top-ranked documents are relevant and using them to expand the query. With the advent of Large Language Models (LLMs), PRF methods have evolved to leverage LLMs for generating feedback text or selecting relevant documents.

However, existing research suffers from a lack of systematic comparison. Most studies entangle two critical design dimensions:

Feedback Source: Where the feedback text comes from (Corpus documents, LLM-generated text, or a combination).
Feedback Model: How the feedback is processed to update the query (e.g., string concatenation, vector averaging, Rocchio, RM3).

Because these dimensions are often changed simultaneously in empirical evaluations, it is unclear which specific component drives performance improvements. Furthermore, factors like the number of feedback terms, the strength of the initial retriever, and latency are often uncontrolled. This paper aims to disentangle these variables to provide a fair, systematic understanding of LLM-based PRF.

2. Methodology

The authors conducted a large-scale, controlled experiment across 13 low-resource BEIR datasets using five distinct LLM PRF methods and three downstream retrievers (BM25, Contriever, and Contriever MS-MARCO).

A. Feedback Sources (The "What")

The study evaluated four primary source configurations:

Corpus Only (PRF-Umbrela): Uses an LLM (Umbrela) to assess relevance scores of top- $k$ documents retrieved by an initial retriever. Only documents deemed highly relevant are used for feedback.
LLM Only (HyDE): Uses the LLM to generate "hypothetical" answer documents (fake documents) via zero-shot prompting, which serve as the feedback source without accessing the corpus.
Corpus & LLM (Parallel/Concatenation - Umbrela-HyDE): Independently generates feedback from both the corpus (via PRF-Umbrela) and the LLM (via HyDE), then concatenates the sets.
Corpus & LLM (Implicit/In-Context - PRF-HyDE): Uses top- $k$ corpus documents as in-context examples to guide the LLM's generation of hypothetical documents.

B. Feedback Models (The "How")

The study tested how these sources update the query representation for different retriever types:

Sparse Retrievers (BM25): Compared RM3 (a probabilistic model) and Rocchio (a vector-space model adapted for term weighting) against simple string concatenation.
Dense Retrievers (Contriever): Compared Average Vector updates against Rocchio Vector updates.

C. Experimental Controls

LLM Backbone: Qwen3-14B.
Feedback Volume: Strictly controlled to use at most 8 feedback documents (or equivalent token counts) across all methods to ensure fair comparison.
Retrievers: Evaluated with BM25, Contriever, and Contriever MS-MARCO.
Metrics: nDCG@20.

3. Key Contributions

Disentanglement of Design Dimensions: The first systematic study to isolate the impact of feedback source vs. feedback model in LLM-PRF.
Unified Framework: A reproducible implementation built on Anserini/Pyserini, evaluating multiple methods under identical constraints (number of terms, documents, and hyperparameters).
Latency Analysis: A detailed breakdown of the computational cost (latency) associated with different PRF strategies relative to the baseline HyDE.
Insight into Dense vs. Sparse: Revealed that dense retrievers do not fully leverage PRF compared to sparse retrievers (BM25), even when provided identical feedback content.

4. Key Results & Findings

RQ1: Impact of Feedback Model

Critical for LLM-Generated Text: The choice of feedback model significantly impacts methods using LLM-generated text (HyDE). For dense retrievers, using Rocchio vector updates significantly outperforms simple average vector updates (e.g., +4.4 points for HyDE on Contriever).
Robustness for Corpus: For methods using corpus documents (PRF-Umbrela), the feedback model choice matters less; RM3 and Rocchio perform similarly.
Recommendation: Use Rocchio for dense retrieval with LLM-generated feedback.

RQ2: Corpus vs. LLM-Generated Feedback

LLM-Only is Cost-Effective: In a single-retriever setup, HyDE (LLM-only) generally outperforms PRF-Umbrela (Corpus-only) because HyDE is not limited by the quality of the initial retriever's output.
The "Strong Retriever" Exception: If the initial retriever is strong (e.g., Contriever MSM) and the downstream retriever is BM25, Corpus-only feedback can outperform HyDE.
Latency Trade-off: To make Corpus-only competitive with HyDE in a single-retriever setup, one must assess a significantly larger number of candidate documents (increasing latency).
Conclusion: LLM-generated feedback is the most practical and cost-effective solution unless a strong initial retriever is available and BM25 is the target.

RQ3: Combining Sources

Dense Retrievers: Combining sources via Umbrela-HyDE (independent concatenation) provides a clear advantage over single sources.
Sparse Retrievers (BM25): Combining sources is only beneficial if the corpus documents are of high quality (from a strong initial retriever).
In-Context Prompting: Simply prompting HyDE with corpus documents (PRF-HyDE) did not consistently improve performance over standard HyDE unless the corpus examples were exceptionally strong.
Surprising Insight: Feedback terms derived from real documents retrieved by HyDE (via BM25) were more effective than the fake documents generated by HyDE itself, boosting effectiveness by 3.6%.

RQ4: Latency

LLM-Only is Fastest: HyDE is the most efficient method across the board.
Corpus Methods are Slower: Methods requiring corpus assessment (PRF-Umbrela, Umbrela-HyDE) incur higher latency, which scales linearly with the number of candidate documents and document length.
Trade-off: Gaining the performance benefits of corpus feedback requires a significant increase in inference-time latency.

5. Significance and Implications

Best Practices: The paper provides concrete guidelines for practitioners. For dense retrieval, use Rocchio updates with LLM-generated feedback. For sparse retrieval (BM25), HyDE is a strong baseline, but Corpus feedback is superior if a strong initial retriever is available and latency is not a constraint.
Retriever Limitations: The study highlights that dense retrievers (like Contriever) are currently less effective at utilizing PRF signals compared to sparse retrievers (BM25), suggesting a gap in current dense retrieval architecture.
Future Directions: The findings suggest that future PRF research should focus on improving the initial retriever's quality for corpus-based methods and optimizing the feedback models for dense vectors, rather than just generating more text.

In summary, this paper demystifies the "black box" of LLM-PRF by proving that how feedback is processed (the model) is just as important as where it comes from (the source), and that the optimal strategy depends heavily on the downstream retriever and latency constraints.