DocSage: An Information Structuring Agent for Multi-Doc Multi-Entity Question Answering

🧐 The Problem: The "Needle in a Haystack" Nightmare

Imagine you are a detective trying to solve a complex mystery. But instead of one notebook, you have 100 different notebooks scattered across a room. Each notebook contains a few clues, but the clues are messy, written in different handwriting, and sometimes contradict each other.

Your boss asks: "Who is the mastermind, and how much money did they steal?"

To answer this, you can't just read one page. You have to:

Find the right pages in 100 different books.
Connect the dots between a name in Book A, a bank account in Book B, and a date in Book C.
Ignore the irrelevant noise.

Current AI tools (like standard LLMs or RAG) struggle here.

Standard AI tries to read all 100 books at once. It gets overwhelmed, forgets the details in the middle, and starts guessing. It's like trying to drink from a firehose.
Standard RAG (Retrieval-Augmented Generation) is like a librarian who only finds books based on keywords. If you ask about "money," it might bring you a book about "money" but miss the specific page about the theft. It's too "coarse-grained."
Graph-based AI tries to draw a giant map of connections. But with 100 messy books, the map becomes a tangled ball of yarn that takes forever to untangle.

🦉 The Solution: Meet DocSage

DocSage is a new AI agent designed specifically to solve this "Multi-Document, Multi-Entity" puzzle. Instead of just reading, it acts like a super-organized project manager who turns the messy pile of notebooks into a clean, structured filing system before trying to answer the question.

Think of DocSage as a Master Chef who doesn't just throw ingredients into a pot; they first chop, measure, and organize everything into labeled bowls.

DocSage works in three magical steps:

1. The "Active Detective" (Schema Discovery)

What it does: Before digging into the books, DocSage asks itself: "What exactly do I need to find the answer?"
The Analogy: Imagine you are looking for a specific person in a crowd. Instead of looking at everyone, you first decide: "I need to find someone wearing a red hat, holding a blue umbrella, and standing near the fountain."
How it works: DocSage reads a little bit, then asks itself clarifying questions like, "Wait, did I miss the connection between the CEO and the Bank?" It builds a custom "search map" (called a Schema) specifically for your question. It doesn't just guess; it actively hunts for the missing pieces of the puzzle.

2. The "Quality Control Inspector" (Structured Extraction)

What it does: It takes the messy text from the documents and turns it into neat, clean tables (like an Excel spreadsheet).
The Analogy: Imagine a factory line where workers pull parts out of a junk pile. Most workers just grab whatever looks right. DocSage has a Quality Control Inspector who checks every part.
- If a part says "Age: 180," the Inspector says, "That's impossible! Fix it."
- If a part says "Company: Apple" but the database doesn't have an Apple entry, the Inspector says, "Go back and find the real source."
The Result: The messy text is converted into a clean database where every row makes sense and connects logically. This removes the "hallucinations" (lies) that AI usually makes.

3. The "SQL Detective" (Relational Reasoning)

What it does: Now that the data is in a clean table, DocSage doesn't "guess" the answer. It runs a precise database query (like a computer code called SQL) to join the tables and find the answer.
The Analogy: Instead of a detective wandering around a crime scene guessing, this is like a detective using a computer database to instantly link "Suspect A" to "Crime Scene B" and "Bank Account C."
The Benefit: Because the data is structured, the AI doesn't get confused by long texts. It can instantly see the path from A to B to C, even if they are in 50 different documents.

🏆 Why is DocSage a Game Changer?

The paper tested DocSage against the best AI models available (like GPT-4o) on two very hard tests:

MEBench: A test with many different people and relationships.
Loong: A test with extremely long documents (up to 250,000 words!).

The Results:

Accuracy: DocSage beat the competition by a huge margin (over 27% better).
Scalability: As the number of documents and entities grew, other AI models got confused and their scores dropped. DocSage stayed strong.
Reliability: It didn't just guess; it could show exactly where in the documents it found the answer (like a citation).

🚀 The Big Takeaway

DocSage proves that "Structure" is the secret sauce.

When humans face a complex problem, we don't just stare at the chaos; we draw a diagram, make a list, or build a spreadsheet. DocSage does the same thing for AI. By forcing the AI to organize the information first and check for errors before answering, it solves problems that were previously impossible for current AI.

It turns the "Needle in a Haystack" problem into a "Find the Needle in a Neatly Organized Box" problem.

1. Problem Statement

The paper addresses Multi-document Multi-entity Question Answering (MDMEQA), a task where answers require synthesizing implicit logic across scattered entities in multiple unstructured documents (e.g., comparing corporate filings or analyzing clinical trial reports).

Limitations of Existing Approaches:

Standard RAG: Relies on vector similarity for coarse-grained retrieval, often missing critical facts needed for cross-document reasoning and failing to track entity relationships.
Graph-based RAG: Struggles to efficiently integrate fragmented, complex relationship networks as document counts scale; graph construction becomes computationally prohibitive.
Long-context LLMs: Suffer from "attention diffusion" and "contextual dilution," causing them to overlook critical details in lengthy texts.
Common Flaw: A lack of schema awareness. Existing methods do not dynamically organize scattered entities and relationships into a structured format tailored to the specific query, leading to disjointed evidence chains and inaccurate deductions.

2. Methodology: The DocSage Framework

DocSage is an end-to-end agentic framework that transforms unstructured documents into a dynamic, query-specific relational representation. It operates through three interdependent modules:

A. Interactive Schema Discovery Module

Goal: Dynamically infer a minimal, joinable relational schema ( $S_q$ ) specific to the query without relying on pre-defined schemas.
Mechanism: Uses the ASK (Active Schema Discovery via Knowledge-seeking Queries) algorithm.
- Step 1: Generates an initial schema hypothesis ( $S^0_q$ ) based on the query and a document sample.
- Step 2: Analyzes the schema against more documents to detect uncertainty signals (e.g., entity alignment conflicts, attribute anomalies, missing relationships).
- Step 3: Actively generates clarification questions to resolve ambiguities and iteratively updates the schema ( $S^{k+1}_q$ ) based on retrieved evidence.
- Step 4: Terminates when the schema stabilizes, ensuring the framework focuses only on relevant information.

B. Logic-Aware Structured Extraction Module

Goal: Populate the discovered schema with high-fidelity tuples from the documents.
Mechanism: Employs a CLEAR (Cross-record Logic Enforcement for Accuracy Reinforcement) correction mechanism.
- Level A (Confidence): Uses a LoRA-enhanced adapter to calibrate confidence scores for individual extractions.
- Level B (Consistency): Enforces cross-record logical constraints (e.g., functional dependencies, temporal constraints, foreign key integrity).
- Correction: If a tuple has low confidence or violates logic, the system triggers a re-extraction by a stronger LLM or a verification submodule to resolve conflicts.

C. Schema-Guided Relational Reasoning Module

Goal: Perform multi-hop reasoning over the structured data to generate answers.
Mechanism:
- Query Compilation: The reasoning LLM compiles the natural language query into an optimized SQL query ( $Q_{SQL}$ ) leveraging the explicit join keys and schema definitions.
- Execution & Traceback: The SQL query is executed on the structured database. The system traces the provenance of results back to original document chunks.
- Synthesis: The final answer is synthesized based on the structured result set and the complete evidence chain, ensuring verifiability.

3. Key Contributions

Agentic Paradigm for Structure: Introduces a novel workflow that autonomously discovers schemas, structures unstructured text, and performs reasoning, moving beyond static retrieval.
Dynamic Schema Discovery: The ASK algorithm actively resolves schema ambiguity through an iterative dialogue process, ensuring the structure is tailored to the specific query logic.
Error-Guaranteed Extraction: The CLEAR mechanism quantifies extraction uncertainty and enforces logical consistency, significantly reducing the noise and hallucinations common in unstructured extraction.
SQL-Powered Reasoning: By converting reasoning tasks into deterministic database operations (SQL), DocSage eliminates attention diffusion and enables precise multi-hop entity alignment.

4. Experimental Results

DocSage was evaluated on two benchmarks: MEBench (multi-entity reasoning) and Loong (long-context reasoning).

Performance on MEBench:
- DocSage achieved an overall accuracy of 89.2%, outperforming the next best method (GPT-4o + RAG at 62.0%) by 27.2 percentage points.
- It showed consistent superiority across Comparison, Statistics, and Relationship question types.
- Robustness: While other models degraded significantly as the number of entities increased (from 0-10 to >100), DocSage's accuracy dropped only minimally (91.8% to 87.9%).
Performance on Loong:
- DocSage achieved the highest Perfect Rate (0.53), more than double that of the next best model (GPT-4o at 0.26).
- It excelled in "Spotlight Locating" and "Chain of Reasoning" tasks.
- Scalability: In the most challenging setting (200K-250K tokens), DocSage maintained a Perfect Rate of 0.47, whereas competitors dropped to 0.10 or below.

5. Significance

The paper validates that structured data representation combined with an agentic design is a superior paradigm for MDMEQA. By addressing the fragmentation and schema scarcity of unstructured multi-document data, DocSage:

Enables precise fact localization via SQL indexing.
Facilitates seamless cross-document entity joins through relational tables.
Mitigates LLM attention diffusion, allowing for reliable reasoning even in massive, complex document collections.

This approach offers a viable solution for high-stakes domains (legal, financial, clinical) where accuracy and evidence traceability are critical.