ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Imagine you are a librarian in a massive library containing millions of books, articles, and reports. One day, a researcher walks in and asks a very specific, complex question: "Find me every document that talks about a new drug that cures a rare disease without side effects."

In the old days, you would have to pull every single book off the shelf, read the title and a few pages, and decide if it fits. This would take forever.

Now, imagine you have a super-smart, super-expensive genius librarian (let's call him The Oracle) who can read any document instantly and understand the deep meaning perfectly. But, The Oracle charges a fortune for every single book he reads. If you ask him to read 10,000 books, you go bankrupt.

ScaleDoc is a new system designed to solve this exact problem. It's like hiring a smart, cheap assistant to do the heavy lifting before The Oracle ever sees a book.

Here is how ScaleDoc works, broken down into simple steps:

1. The "Pre-Read" (Offline Phase)

Before the researcher even arrives, ScaleDoc takes a look at the entire library. It uses a slightly less expensive version of The Oracle to read every single document once and write a summary note (a "semantic embedding") on a card for each one.

The Analogy: Think of this as a librarian who reads every book in the library and writes a 3-word tag on the spine (e.g., "Medicine," "Politics," "Cooking"). This happens only once. Now, the library is organized, and we don't need the expensive Oracle to read the books again.

2. The "Smart Assistant" (Online Phase)

When the researcher finally asks their specific question, ScaleDoc doesn't call The Oracle yet. Instead, it hires a lightweight, cheap assistant (a small AI model) just for this specific question.

The Analogy: The assistant looks at the 3-word tags we wrote earlier. It's not a genius, but it's very fast and cheap. It quickly scans the library and says, "Okay, I'm 99% sure these 5,000 books are NOT about the drug. And I'm 99% sure these 4,000 books ARE about the drug."

3. The "Filter" (The Cascade)

Here is the magic trick. The assistant only sends the confusing books to The Oracle.

The Analogy: Imagine the assistant puts the "definitely yes" books in a "Yes" pile and the "definitely no" books in a "No" pile. It leaves a small pile of "I'm not sure" books in the middle. It hands only that small middle pile to The Oracle.
The Result: The Oracle only has to read 15% of the books instead of 100%. You save 85% of the money, but because the assistant was smart, you still get the right answer.

4. How the Assistant Learns (The Secret Sauce)

The paper explains that a normal assistant might get confused and guess randomly. To fix this, ScaleDoc uses a special training method called Contrastive Learning.

The Analogy: Imagine you are teaching a dog to find a specific type of ball.
- Old way: You just say "Good dog" when it finds the ball. The dog might get confused between a red ball and a blue ball.
- ScaleDoc's way: You show the dog a red ball and say "This is the one!" Then you show a blue ball and say "This is NOT the one!" You do this over and over, pushing the "red ball" thoughts to one side of the dog's brain and the "blue ball" thoughts to the other side.
- The Result: The dog (the assistant) becomes incredibly good at separating the "Yes" books from the "No" books, leaving very few confusing ones for The Oracle to handle.

5. The "Safety Net" (Adaptive Calibration)

Sometimes, the researcher wants to be super sure (99% accuracy), and sometimes they just want a quick answer (90% accuracy). ScaleDoc has a built-in safety net.

The Analogy: Before sending the "confusing" pile to The Oracle, the system does a quick test run on a tiny sample of books. It asks, "If I send this many books to the Oracle, will I meet the researcher's accuracy goal?" If the answer is no, it adjusts the filter to be stricter. It's like a thermostat that automatically adjusts the heat to keep the room at the perfect temperature without wasting energy.

Why is this a big deal?

Speed: It makes finding answers in huge libraries 2 times faster.
Cost: It cuts the cost of using the expensive AI by 85%.
Scalability: It works whether you have 1,000 documents or 10 million.

In summary: ScaleDoc is a system that pre-organizes a massive library, uses a fast, cheap assistant to sort out the obvious answers, and only asks the expensive genius to solve the really hard, confusing cases. It gets you the right answer without breaking the bank.

1. Problem Statement

Modern data analysis increasingly relies on querying large collections of unstructured documents using semantic predicates (e.g., "Find papers that discuss novel psychotropic medications"). While Large Language Models (LLMs) possess powerful zero-shot capabilities to understand such semantics, applying them directly to millions of documents for every ad-hoc query is prohibitively expensive due to high inference costs and latency.

Existing solutions face two main limitations:

Traditional ML Proxies: Lightweight models (e.g., SVM, KDE) lack the zero-shot flexibility to handle diverse, new semantic queries without extensive retraining and labeling.
LLM-based Cascades: Using smaller LLMs (e.g., 7B parameters) as proxies is still computationally expensive for massive datasets and fails to bridge the capability gap with the "oracle" LLM (e.g., GPT-4o).

The core challenge is to execute semantic predicates over large document collections with high efficiency (minimizing LLM calls) while strictly meeting user-specified accuracy targets.

2. Methodology: ScaleDoc Architecture

ScaleDoc introduces a novel system that decouples predicate execution into two phases: an Offline Representation Phase and an Optimized Online Phase.

A. Offline Phase: Semantic Representation

One-time Computation: For each document in the collection, a small-scale LLM (e.g., Mistral-7B) generates a high-dimensional semantic embedding.
Storage: These embeddings are pre-computed and stored. This shifts the expensive LLM computation to an offline stage, allowing it to be reused across countless ad-hoc queries.

B. Online Phase: Query-Aware Proxy & Adaptive Cascade

When a new query arrives, ScaleDoc performs the following steps:

Sampling & Labeling: A small subset of documents (e.g., 5%) is sampled and labeled by the powerful Oracle LLM (e.g., GPT-4o) to create a ground-truth training set.
Proxy Training: A lightweight, query-specific proxy model (a 3-layer MLP) is trained on the pre-computed embeddings and the sampled labels.
Scoring & Filtering: The proxy assigns a decision score to every document.
- High-confidence documents (clearly positive or negative) are filtered out by the proxy.
- Ambiguous documents are forwarded to the Oracle LLM for final decision.
Adaptive Cascade: The system dynamically determines the filtering thresholds to ensure the final accuracy meets the user's target while minimizing Oracle invocations.

3. Key Innovations & Technical Contributions

1. Contrastive Learning for Reliable Proxy Scores

Standard training methods often yield ambiguous scores that fail to distinguish positive from negative cases, leading to excessive Oracle calls. ScaleDoc introduces a two-phase contrastive learning framework to shape the decision score distribution:

Phase 1 (Semantic Monotonicity): Uses a query-similarity loss ( $L_{qsim}$ ) to pull positive document embeddings closer to the query anchor and push negatives away in the latent space.
Phase 2 (Enforcing Bipolarity): Uses two additional losses to create a clear separation:
- Supervised Contrastive Loss ( $L_{supcon}$ ): Clusters documents of the same label together.
- Polar Loss ( $L_{polar}$ ): A novel mechanism that selects "bellwether" samples (closest positive and furthest negative) to explicitly enforce a bipolar manifold, ensuring positive and negative scores cluster at opposite ends of the spectrum.
Result: This produces a smooth, monotonic, and bipolar score distribution, enabling effective threshold-based filtering.

2. Ad Hoc Calibration & Threshold Selection

Since the distribution of scores for a new query is unknown, ScaleDoc uses a robust calibration workflow:

Stratified Sampling: Instead of random sampling, it discretizes the score range into bins and samples proportionally to preserve low-density regions.
Distribution Reconstruction: It reconstructs continuous Probability Density Functions (PDFs) for positive and negative classes using Jittering (to recover information in empty bins), Linear Interpolation (for density estimation), and Moving Average (for smoothing).
Optimized Threshold Selection: An algorithm traces the Pareto frontier of feasible thresholds to find the pair $(l, r)$ that minimizes the unfiltered rate (Oracle calls) while satisfying the accuracy constraint $\alpha$ .

3. Theoretical Guarantees

The paper provides a theoretical proof (using Bernstein inequality) that the accuracy achieved on the sampled subset generalizes to the full dataset with high confidence, provided the score variance is low (which the contrastive learning ensures).

4. Experimental Results

ScaleDoc was evaluated on three real-world datasets (PubMed, BigPatent, GovReport) with 10,000 documents each and 20 diverse queries per dataset.

Performance Speedup: Achieved an average 2× end-to-end speedup compared to baselines.
Cost Reduction: Reduced expensive Oracle LLM invocations by up to 85% (approx. 6.6× cost saving).
Accuracy: Consistently met user-specified accuracy targets (e.g., 90% F1 score) across varying data selectivities and complex query types (implicit reasoning, quantitative analysis).
Comparison: Outperformed baselines including:
- Oracle Only: Direct LLM processing (slowest, most expensive).
- Probabilistic Predicates (PPs): Traditional ML models (poor zero-shot performance).
- LLM Cascades (FrugalGPT, LOTUS, BARGAIN): Using smaller LLMs as proxies (still too expensive or less accurate).
- Direct Embedding Matching: Static similarity scores (less reliable than query-aware training).

5. Significance

Scalability: ScaleDoc makes large-scale semantic analysis practical by decoupling the heavy lifting of LLM inference from the online query path.
Generalizability: Unlike traditional ML which requires task-specific engineering, ScaleDoc handles ad-hoc queries with zero-shot flexibility via its lightweight, query-aware training mechanism.
Efficiency: By optimizing the proxy training and cascade logic, it achieves a superior trade-off between cost and accuracy, enabling the use of powerful LLMs in production systems with millions of documents without prohibitive costs.

In summary, ScaleDoc transforms the problem of semantic filtering from a brute-force LLM application into a system-level optimization problem, leveraging offline representation and adaptive online learning to achieve massive efficiency gains.