SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

Imagine you are a detective trying to solve a massive, complex mystery. You have a stack of 100 different files on your desk: some are financial reports, some are scientific studies, and some are old letters. To solve the case, you need to find a specific clue hidden in every single one of those files and then piece them all together.

Here is how the old way of doing this fails, and how the new method described in this paper (SPD-RAG) fixes it.

The Old Way: The Overwhelmed Detective

In the traditional method (called Standard RAG), you have one super-smart detective (the AI). You give them the whole stack of 100 files and ask, "What's the answer?"

The Problem: The detective's brain has a limit. They can only read a few pages at a time. So, they quickly skim the first 5 files, grab what looks important, and ignore the other 95. If the crucial clue was in file #42, they miss it.
The Alternative: You could try to feed the detective all 100 files at once by giving them a super-brain (a "Long-Context" model). But even super-brains get tired. When you give them a massive amount of text, they start to hallucinate (make things up) or get confused, like a person trying to drink from a firehose. They might miss the forest for the trees.

The New Way: SPD-RAG (The Specialized Task Force)

The authors of this paper created SPD-RAG. Instead of one overwhelmed detective, they built a specialized task force.

Here is how it works, using a simple analogy:

1. The Commander (The Coordinator)

First, you have a smart Commander. When you ask a question, the Commander doesn't try to read the files themselves. Instead, they break the big question down into small, specific instructions.

Example: "Okay team, we need to find all mentions of 'profit margins' in these reports."

2. The Specialists (The Document Agents)

The Commander assigns one dedicated specialist to each single document.

Specialist A gets only the Financial Report.
Specialist B gets only the Scientific Paper.
Specialist C gets only the Old Letter.

Because each specialist only has to look at one file, they can read it incredibly carefully. They aren't distracted by the other 99 files. They dig deep, find every single relevant clue in their specific file, and write a short report on what they found.

3. The Synthesizer (The Merging Layer)

Once all the specialists finish their reports, they hand them to a Synthesizer.

The Synthesizer takes all these small, focused reports and combines them into one final, perfect answer.
If there are too many reports to read at once, the Synthesizer groups similar reports together (like sorting files into folders), summarizes the folders, and then summarizes the folders of folders, until everything fits into one final answer.

Why is this better?

1. No Clues Left Behind
In the old method, the AI might skip a file because it was "too long" or "not in the top 5." In SPD-RAG, every single file gets its own private detective. Nothing is missed.

2. Cheaper and Smarter
The "Specialists" don't need to be super-expensive, super-smart models. They just need to be good at reading one file. The paper used a cheaper, faster AI for the specialists and saved the expensive, super-smart AI for the Commander and the Synthesizer.

Result: They got a much better answer (58.1% score) than the old methods (33% score) but spent less than half the money to do it.

3. Handling the "Needle in a Haystack"
The paper tested this on a very hard challenge called Loong, where you have to find facts scattered across huge documents (like 250,000 words).

Old AI: Got lost in the haystack and missed the needle.
SPD-RAG: Sent a specialist to every single piece of hay, found the needle, and brought it back.

The Bottom Line

Think of SPD-RAG as moving from a "One-Man Show" to a "Specialized Assembly Line."

Instead of asking one giant brain to swallow a library and spit out an answer, you ask a team of focused experts to read one book each, take notes, and then combine their notes. It's faster, cheaper, and much more accurate when the information is scattered across a massive amount of text.

Here is a detailed technical summary of the paper SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation.

1. Problem Statement

Current Retrieval-Augmented Generation (RAG) systems face significant bottlenecks when answering complex queries that require synthesizing evidence scattered across vast, heterogeneous document corpora (e.g., financial reports or academic papers).

Standard RAG Limitations: Traditional pipelines retrieve a fixed number of top- $K$ documents. If the answer depends on information distributed across many documents beyond the top- $K$ , critical evidence is discarded, leading to incomplete answers.
Long-Context LLM Limitations: While Large Language Models (LLMs) now support massive context windows (128K–2M tokens), empirical evidence suggests that reasoning quality degrades significantly as context length increases ("lost in the middle" phenomenon).
The Core Challenge: There is a need for a system that can perform exhaustive cross-document reasoning without relying on a single model to process millions of tokens at once, while maintaining scalability and cost-efficiency.

2. Methodology: SPD-RAG Architecture

SPD-RAG (Sub-agent Per Document RAG) introduces a hierarchical multi-agent framework that decomposes the problem along the document axis rather than the task axis. The architecture consists of three distinct layers:

A. Coordination Layer

A central Coordinator Agent receives the user query and decomposes it into a Shared Instruction Set (atomic extraction tasks) and Synthesis Directives.
It generates a structured WriteTodos object containing specific fields, entities, or numeric values to extract from each document, ensuring all agents work toward a unified goal.

B. Parallel Retrieval Layer

Document-Level Specialization: A dedicated Sub-Agent ( $\alpha_i$ ) is assigned to each document ( $d_i$ ) in the corpus.
Isolated Retrieval Universe: Each sub-agent operates strictly within its assigned document. It cannot access other documents, preventing cross-document distractors.
Iterative Reasoning: Agents perform an iterative "retrieve-and-reason" loop. They are instructed to perform at least 2 focused searches before concluding information is absent (capped at 5 search calls total).
Retrieval Mechanism: Uses dense vector retrieval (Cohere embed-v4.0) followed by re-ranking (Cohere rerank-v4.0-fast) to fetch the top 5 chunks per search.
Parallelism: All sub-agents execute concurrently via LangGraph's fan-out API.

C. Synthesis Layer

Recursive Map-Reduce: To handle massive corpora where the sum of all sub-agent findings exceeds the LLM context window, the system employs a token-bounded, similarity-ordered recursive synthesis.
Process:
1. Embedding & Clustering: Sub-agent findings are embedded, and a cosine similarity matrix is computed.
2. Agglomerative Clustering: Findings are clustered (UPGMA linkage) to group semantically similar summaries.
3. Batch Synthesis: Clusters are merged into batches (capped at 750k tokens). An LLM synthesizes each batch into a new summary.
4. Iteration: This process repeats until a single final summary remains.
Fallback: If the total findings fit within the context window in one go (as seen in the Loong benchmark evaluation), the recursion terminates after one step.

3. Key Contributions

Novel Architecture: Proposes SPD-RAG, a hierarchical framework combining per-document agentic RAG with centralized fusion. It enables deep, exhaustive analysis of every document while maintaining parallel execution.
Cost-Efficient Specialization: By constraining retrieval to isolated document spaces, the system can use cheaper, faster models (Gemini 2.5 Flash) for document agents, reserving powerful models (Gemini 2.5 Pro) only for the coordinator and final synthesis.
Scalable Synthesis: Introduces a recursive, similarity-guided merging protocol that ensures the system can scale to corpora with thousands of documents without exceeding context limits.
Comprehensive Evaluation: Provides a rigorous evaluation on the Loong benchmark, demonstrating significant improvements over both standard RAG and Agentic RAG baselines.

4. Experimental Results

The system was evaluated on the Loong benchmark (102 instances: 40 academic papers, 62 financial reports) using GPT-5 as the judge.

Performance (Avg Score):
- SPD-RAG: 58.1
- Normal RAG: 33.0
- Agentic RAG: 32.8
- Full-Context Baseline (Oracle): 68.0
- Result: SPD-RAG outperforms standard RAG by ~25 points (76% relative gain) and achieves 85.4% of the Oracle's performance.
Task-Specific Gains:
- Clustering: +40.5 points over Normal RAG.
- Chain of Reasoning: +26.2 points over Agentic RAG.
- Academic Papers: Standard RAG and Agentic RAG failed completely (0% Perfect Rate), while SPD-RAG achieved a 60.0 Avg Score.
Cost-Efficiency:
- SPD-RAG costs $0.103 per query.
- The Full-Context Baseline costs $0.273 per query.
- SPD-RAG achieves 85% of the quality at only 38% of the API cost.
- It offers a 2.25x improvement in cost-quality efficiency compared to the Oracle.
Latency: SPD-RAG has a slightly higher latency (54.8s) compared to baselines (~40-45s) due to the multi-step pipeline, but this is considered a reasonable trade-off for the accuracy gains.

5. Significance and Conclusion

SPD-RAG addresses the critical trade-off between coverage and reasoning quality in long-context multi-document QA.

Paradigm Shift: It demonstrates that "exhaustive per-document processing" is superior to "global top-K retrieval" or "single-pass massive context" for complex synthesis tasks.
Practical Impact: The system proves that using specialized, cheaper agents for document-level extraction, coordinated by a smarter central agent, yields a highly scalable and cost-effective solution for enterprise-grade information retrieval.
Future Potential: While the current evaluation did not fully trigger the recursive synthesis loop (due to the 1M token context of the baseline model), the architecture is explicitly designed to handle corpora with thousands of documents, making it a robust foundation for future large-scale knowledge bases.

In summary, SPD-RAG establishes that for complex, real-world queries, how information is processed (decomposed and specialized) is more critical than simply increasing the raw context window size of a single model.