Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion

Imagine you are trying to find a specific recipe in a massive, chaotic library. You type a simple search like "chicken soup," but the library's computer only finds recipes for "broth" or "stew" because it doesn't understand that you might also want "noodle soup" or "spicy chicken soup." This is the problem of Query Expansion: helping a search engine understand what you really mean by adding more related words to your search.

For a long time, computers did this by looking at the top few results, grabbing random words from them, and hoping they fit. It was like asking a confused librarian to guess your next question based on a blurry photo.

This paper introduces a smarter, fully automated way to do this using Large Language Models (LLMs)—the same kind of AI that powers chatbots. Here is how their new system works, broken down into simple steps with analogies:

1. The "Practice Run" Library (Building the Example Pool)

Before the AI can help you search, it needs to learn the specific "language" of the library you are using (whether it's medical papers, Wikipedia, or general web news).

The Old Way: Humans would manually write down examples of good searches and good answers. This is slow and expensive.
The New Way: The system runs a quick, automated test run. It asks a basic search engine (BM25) to find the top 100 results for a few sample questions. Then, it uses a smart sorter (MonoT5) to pick the best result for each question.
The Analogy: Imagine a chef wanting to learn how to make Italian food. Instead of hiring a teacher to write a textbook, the chef goes to a local Italian market, grabs the top 100 most popular ingredients, and studies them. Now, the chef has a "library" of authentic Italian examples without needing a human teacher.

2. The "Smart Sampling" (Choosing the Right Examples)

When you ask a question, the AI needs to show it a few examples of how to answer similar questions. But which examples should it pick? If it picks random ones, it might get confused.

The Strategy: The system uses a "clustering" trick. It groups all the examples it found earlier into different "neighborhoods" based on their meaning (like grouping all "soup" recipes together and all "salad" recipes together).
The Analogy: Instead of picking 4 random recipes from the whole library, the system picks the "perfect representative" from the soup neighborhood, the salad neighborhood, the dessert neighborhood, and the main course neighborhood. This ensures the AI sees a diverse, balanced view of the topic before it tries to help you.

3. The "Two-Headed Brain" (Multi-LLM Expansion)

Now, the system asks the AI to rewrite your search query to be more helpful. But AI can sometimes be a bit "hallucinatory" or make things up. To fix this, they don't just use one AI; they use two.

The Process:
1. AI #1 (let's call it "Alex") looks at your question and the examples, then writes a new, expanded search query.
2. AI #2 (let's call it "Sam") does the exact same thing independently.
3. The Result: You now have two different suggestions. Maybe Alex added "spicy" and Sam added "homemade."

4. The "Editor-in-Chief" (Refinement)

Having two suggestions is good, but having them mashed together randomly is messy. So, they bring in a third AI, the Refiner.

The Job: The Refiner looks at what Alex wrote and what Sam wrote. It acts like a skilled editor. It says, "Alex's idea about 'spicy' is great, and Sam's idea about 'homemade' is perfect. Let's combine them into one smooth, perfect sentence and throw out the nonsense."
The Analogy: Imagine two architects designing a house. One focuses on the kitchen, the other on the garden. A third architect (the Refiner) looks at both blueprints and draws one final, cohesive house plan that includes the best of both, removing any conflicting ideas.

Why This Matters

The paper tested this system on three very different types of "libraries":

General Web Search (TREC DL20)
Wikipedia/Entity Search (DBPedia)
Scientific Medical Papers (SciFact)

The Results:

Better than guessing: It worked much better than just using random words or a single AI.
No human needed: It built its own "textbook" of examples automatically.
Stronger than the sum of parts: The "Two-Headed Brain + Editor" approach was significantly better than just using one AI. It found more relevant documents and missed fewer important ones.

The Big Picture

This paper is like inventing a self-teaching, self-editing search assistant. Instead of relying on humans to write rules or examples, the system builds its own knowledge base, picks the best examples to study, asks two different experts for advice, and then has a third expert merge their advice into a perfect answer. It makes searching for information faster, more accurate, and adaptable to any topic, from cooking to cancer research.

1. Problem Statement

Query Expansion (QE) aims to bridge the vocabulary mismatch between user queries and relevant documents. While classical methods (e.g., Rocchio, RM3) rely on pseudo-relevance feedback (PRF) from initial retrieval, they often suffer from query drift and lack semantic depth.

Recent approaches using Large Language Models (LLMs) and In-Context Learning (ICL) offer semantically rich expansions but face three critical limitations:

Manual Dependency: Most pipelines rely on hand-crafted prompts or manually curated exemplars, which are not scalable.
Domain Sensitivity: Exemplars are often drawn from mismatched domains, leading to unstable performance when applied to specific target corpora.
Single-Model Limitation: Existing studies typically use a single LLM, failing to leverage the complementary knowledge of multiple models without additional training.

2. Methodology

The authors propose a fully automated, label-free, and domain-adaptive framework consisting of three distinct stages (illustrated in Fig. 1 of the paper):

Stage 1: Automatic In-Domain Example Pool Construction

Instead of using manual examples, the system constructs a large pool of in-domain $(query, passage)$ pairs from an unlabeled target corpus:

Process: For seed queries, the system retrieves top- $N$ candidates using BM25, then reranks them using MonoT5.
Selection: The top-1 passage for each query is treated as a "pseudo-relevant" expansion.
Outcome: This generates a massive pool of domain-specific examples (e.g., 100k for MS MARCO, 809 for SciFact) without human annotation.

Stage 2: Few-Shot Expansion with Cluster-Based Exemplar Selection

To select diverse and stable demonstrations for the LLM prompt:

Embedding: All candidate examples in the pool are encoded using Contriever embeddings.
Clustering: A k-means algorithm clusters these embeddings into $k$ semantic groups.
Selection: The medoid (the example closest to the cluster centroid) of each cluster is selected as the few-shot exemplar.
Generation: An LLM (Qwen-2.5-7B-Instruct) generates an expansion based on the test query and the selected $k$ exemplars.

Stage 3: Two-LLM Expansion Ensemble with Refinement

To exploit model complementarity, the framework employs a multi-agent strategy:

Independent Generation: Two heterogeneous LLMs (e.g., Qwen-2.5-7B and Llama-3.1-8B) independently generate expansions using the same cluster-selected exemplars.
Refinement: A third LLM acts as a refinement module. It takes the original query and the two independent expansions as input.
Consolidation: The refinement LLM merges the two expansions into a single, coherent, and noise-reduced paragraph, preserving useful entities and relations while eliminating redundancy.
Final Query: The final expanded query is formed by concatenating five copies of the original query with the refined expansion.

3. Key Contributions

Automated, Label-Free Pipeline: A method to build large in-domain QE exemplar pools using a BM25–MonoT5 pipeline, eliminating the need for manual prompt engineering or human-labeled data.
Cluster-Based Exemplar Selection: A simple, training-free strategy using Contriever embeddings and k-means clustering to ensure diverse, domain-aligned demonstrations, outperforming fixed or random selection.
Training-Free Multi-LLM Ensemble: A novel two-LLM expansion architecture where a refinement LLM synthesizes outputs from two heterogeneous models. This acts as a query-level fusion mechanism that leverages complementary lexical and semantic cues without additional training or multiple retrieval runs.

4. Experimental Results

The framework was evaluated on three benchmarks: TREC DL20 (Web Search), DBPedia-Entity (Entity Search), and SciFact (Scientific Claim Verification).

Baselines: Compared against BM25, BM25+Rocchio, Zero-shot LLM, and Fixed Few-shot LLM (using out-of-domain exemplars).
Performance:
- In-Domain vs. Fixed: Cluster-based in-domain exemplars consistently outperformed fixed few-shot baselines (e.g., +0.5 NDCG@10 on SciFact), proving the importance of domain matching.
- Refinement Impact: The Two-LLM QE (Refine) method achieved the best results across all datasets.
  - On TREC DL20, it achieved 62.86 NDCG@10, significantly outperforming the best single-LLM baseline (58.71) and the fixed few-shot baseline (56.38).
  - On DBPedia and SciFact, it also showed statistically significant gains in NDCG@10 and P@10.
Dense Retrieval: The method was also tested with a Sentence-BERT (SBERT) dense retriever. The refinement approach improved NDCG@10 from 63.44 (SBERT baseline) to 68.32, demonstrating its applicability beyond lexical retrieval.
Ablation Studies:
- Ensemble vs. Single: Combining two LLMs via refinement yielded larger gains than simple concatenation or single-model generation.
- Length vs. Quality: Simply increasing generation length (64 to 128 tokens) degraded performance due to noise. The refinement approach proved that quality control via multi-LLM synthesis is more effective than verbosity.

5. Significance

This work addresses the scalability and robustness issues of LLM-based Query Expansion. By automating the creation of in-domain exemplars and introducing a refinement-based multi-LLM ensemble, the authors provide a practical, reproducible, and label-free solution for real-world information retrieval. The framework demonstrates that:

Domain-adaptive exemplar selection is critical for stable ICL performance.
Multi-LLM collaboration, mediated by a refinement step, can effectively harness complementary model strengths without the computational cost of fine-tuning or iterative retrieval.
The approach is versatile, working effectively with both sparse (BM25) and dense (SBERT) retrieval systems.