Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search

Imagine you run a massive, bustling library that serves thousands of different neighborhoods (tenants). Each neighborhood has its own unique set of books, rules, and slang. Your goal is to build a super-smart librarian robot that can instantly find the right book for any question a visitor asks.

However, you face two massive problems:

The "Dark Data" Problem: You have millions of books, but no one has written "review cards" saying which books are actually good answers to specific questions. It's like having a library where the books are there, but the catalog is blank. You can't train your robot because you don't know what "good" looks like.
The "Re-Shelving" Tax: Every time you want to teach your robot a new trick, you usually have to take every single book off the shelf, re-read it, and re-shelve it in a new order. If you have 1,000 neighborhoods, doing this for every single update would take forever and cost a fortune.

This paper, "Succeeding at Scale," introduces a new way to solve both problems. Here is how they did it, explained simply:

1. Building the Training Manual Without Humans (The "AI Detective" Pipeline)

Usually, you need human experts to read questions and mark the correct answers to train a search engine. But that's slow and expensive.

The authors built an automated factory to create this training data:

The Scavenger Hunt: Instead of relying on one search tool, they sent out a team of seven different "scouts" (some look for exact word matches, others look for meaning). They gathered every possible answer these scouts could find.
The Judge: They then used a super-smart AI (an LLM) as a "Judge." This Judge looked at the pile of answers and asked: "Does this actually answer the question, or is it just a fancy-looking distraction?"
The Result: The AI filtered out the junk and kept only the gold. They created a massive, high-quality training dataset (called DevRev-Search) without a single human having to manually label a single document.

2. The "One-Sided" Makeover (Index-Preserving Adaptation)

This is the paper's biggest breakthrough.

In the old way, to make the librarian smarter, you had to reorganize the entire library (the documents) every time.

The Old Way: Imagine you want to teach the librarian how to understand a new neighborhood's slang. You have to re-shelve every book in the entire building to match the new slang. Impossible for a huge library.
The New Way (Query-Only Adaptation): The authors realized they only needed to change the Librarian's brain (the query encoder), not the books themselves.
- They kept the library shelves exactly as they were (frozen document index).
- They only gave the librarian a "brain upgrade" to understand the specific questions from that neighborhood.
- The Analogy: It's like giving your librarian a pair of specialized glasses for a specific customer. You don't need to move the books; you just change how the librarian looks at the question. This makes updates instant and cheap.

3. The "Lightweight" Upgrade (Parameter-Efficient Fine-Tuning)

Even upgrading the librarian's whole brain is heavy. So, they used a technique called PEFT (Parameter-Efficient Fine-Tuning).

The Analogy: Instead of rebuilding the librarian's entire brain (which has billions of neurons), they just added a few smart sticky notes or a small cheat sheet to the librarian's desk.
They found that using a method called LoRA (Low-Rank Adaptation) is like giving the librarian a tiny, highly efficient notebook.
The Magic: This tiny notebook allows the librarian to learn the new neighborhood's needs almost as well as if they had rebuilt their whole brain, but it uses 99% less computing power and memory.

The Bottom Line

The authors proved that:

You can build a perfect training dataset using AI judges instead of humans.
You can make a search engine smarter for specific customers without ever touching the massive database of documents.
You can do this with a tiny, efficient "upgrade" that saves massive amounts of money and time.

In short: They figured out how to teach a giant, multi-tenant search engine to be a genius for every specific customer, without ever having to move a single book on the shelf. It's a win for speed, cost, and quality.

Here is a detailed technical summary of the paper "Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search."

1. Problem Statement

The paper addresses two critical bottlenecks in deploying dense neural retrieval systems within multi-tenant enterprise environments:

Data Scarcity (The "Dark Data" Problem): Enterprise tenants possess proprietary corpora (e.g., support tickets, internal docs) with extensive query logs but lack curated relevance labels. Standard benchmarks (like BEIR) fail to capture the noisy, heterogeneous nature of this data. Manual annotation is costly, slow, and prone to false negatives due to the inability to exhaustively review large corpora.
Adaptation Latency (The "Re-indexing Tax"): Standard bi-encoder fine-tuning requires updating both the query and document encoders. Updating the document encoder necessitates re-generating embeddings for the entire corpus and reconstructing the index (e.g., HNSW). In multi-tenant systems with thousands of isolated indices, this process is computationally prohibitive and prevents rapid, tenant-specific adaptation.

2. Methodology

The authors propose a unified framework comprising two main components: an automated dataset construction pipeline and an index-preserving adaptation strategy.

A. Automated Dataset Construction (DevRev-Search Benchmark)

To overcome the lack of labeled data, the authors built a fully automated pipeline to generate the DevRev-Search benchmark:

Query Collection & Cleaning: Real-world customer queries were collected from production agent interactions. Noise (test queries, code snippets) was removed via length filtering, language detection, deduplication, and clustering-based diversity selection.
Document Segmentation: Long enterprise documents were segmented into 500-character chunks using Recursive Character Splitting to preserve semantic boundaries and avoid token limits.
Multi-Retriever Fusion (Candidate Generation): To maximize recall, an ensemble of seven diverse retrievers (6 dense models including Google, OpenAI, Cohere, and Qwen variants, plus one lexical BM25) was used. The union of their top-60 results created a candidate pool of up to 420 chunks per query.
LLM-as-a-Judge Filtering: A Large Language Model was employed to filter the fused candidates. Instead of generating labels, the LLM acted as a judge to retain only chunks that genuinely answered the query, removing false positives caused by superficial lexical overlap.
Validation: A 10% random sample was manually validated to confirm the pipeline's accuracy.

B. Index-Preserving Adaptation (Query-Only Strategy)

To avoid the "Re-indexing Tax," the authors propose Query-Only Adaptation:

Freezing the Document Encoder: The document encoder and its associated index remain frozen. Only the query encoder is fine-tuned.
Training Objective: The model uses InfoNCE loss with mined hard negatives. To prevent representation collapse (where old hard negatives become "easy"), they employ asynchronous ANCE training (updating hard negatives periodically).
Parameter-Efficient Fine-Tuning (PEFT): To further reduce costs and enable scalable multi-tenant deployment, they evaluated several PEFT techniques on the query encoder:
- LoRA (Low-Rank Adaptation): Tuning low-rank matrices.
- Linear/FFN Heads: Adding projection layers on top of embeddings.
- Partial Unfreezing: Unfreezing only the top $N$ transformer layers.

3. Key Contributions

DevRev-Search Benchmark: A high-fidelity, enterprise-specific retrieval benchmark constructed without human annotators, featuring high relevance density (avg. 13.6 relevant chunks/query).
Zero-Reindexing Adaptation: A formalized strategy of fine-tuning only the query encoder, which eliminates the computational overhead of re-indexing while maintaining competitive performance.
PEFT Validation: Demonstration that parameter-efficient methods (specifically LoRA) on the query encoder can match or exceed full fine-tuning performance, enabling cost-effective scaling for thousands of tenants.
Comprehensive Ablation Studies: Detailed analysis of retriever ensemble contributions, LoRA ranks, target modules, and the number of unfrozen layers.

4. Experimental Results

The approach was validated on DevRev-Search, SciFact, and FiQA-2018.

Query-Only vs. Query-Document (QD):
- Freezing the document encoder (Query-Only) resulted in minimal performance loss compared to symmetric fine-tuning (QD).
- On the SciFact dataset with the qwen3-4b model, Query-Only adaptation actually outperformed full QD fine-tuning in Recall@10 (0.953 vs. 0.949).
Parameter-Efficient Fine-Tuning (PEFT):
- LoRA consistently matched or outperformed full fine-tuning. For qwen3-4b on DevRev-Search, the best LoRA configuration achieved a Recall@10 of 0.355 vs. 0.327 for full fine-tuning.
- LoRA Rank: Optimal ranks were found to be between 32 and 64. Higher ranks (128) showed diminishing returns or overfitting.
- Target Modules: Tuning "All" modules yielded the best results, but tuning only "Dense" layers offered a strong performance-efficiency trade-off.
- Layer Unfreezing: Performance increased monotonically with the number of unfrozen top layers, but LoRA remained more effective than simply unfreezing layers with more parameters.
Retriever Ensemble:
- No single retriever achieved perfect recall (best was 82.48%).
- The ensemble approach was crucial; removing any single retriever from the fusion pipeline dropped the recall of the combined set, proving the complementarity of diverse models.

5. Significance

This paper provides a practical roadmap for scalable enterprise search:

Economic Viability: By eliminating the need for manual labeling and full re-indexing, it drastically reduces the cost and latency of adapting search systems to new tenants or domains.
Scalability: The Query-Only + PEFT approach allows platforms to serve thousands of isolated tenants with unique models without prohibitive computational overhead.
Data Quality: It demonstrates that high-quality, domain-specific training data can be synthesized automatically using LLMs and ensemble retrieval, solving the "dark data" problem in enterprise settings.

In conclusion, the authors prove that asymmetric, parameter-efficient adaptation is not only feasible but often superior for multi-tenant search, offering a high-quality, low-latency solution for enterprise information retrieval.

Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search

1. Building the Training Manual Without Humans (The "AI Detective" Pipeline)

2. The "One-Sided" Makeover (Index-Preserving Adaptation)

3. The "Lightweight" Upgrade (Parameter-Efficient Fine-Tuning)

The Bottom Line

1. Problem Statement

2. Methodology

A. Automated Dataset Construction (DevRev-Search Benchmark)

B. Index-Preserving Adaptation (Query-Only Strategy)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Real-Time AI Service Economy: A Framework for Agentic Computing Across the Continuum

Reasoning Models Struggle to Control their Chains of Thought

Evolving Medical Imaging Agents via Experience-driven Self-skill Discovery

The World Won't Stay Still: Programmable Evolution for Agent Benchmarks