Multi-LLM Query Optimization

Imagine you are a detective trying to solve a mystery, but you don't know who the culprit is. You have a team of five different experts (let's call them "LLMs") to help you.

Expert A is cheap to hire but sometimes gets confused between similar-looking suspects.
Expert B is very expensive but incredibly sharp at spotting specific details.
Expert C is great at identifying one type of criminal but useless for others.

Your goal is to figure out the one true culprit with near-perfect accuracy, but you have a limited budget. You can't just hire everyone a million times; that would bankrupt you. But if you don't ask enough questions, you might catch the wrong person.

This paper is a guide on how to spend your money wisely to get the right answer every time.

The Problem: The "Guessing Game" is Hard

The authors realized that figuring out the perfect mix of questions is incredibly difficult. It's like trying to solve a massive puzzle where every piece you add changes the shape of the whole picture.

They proved mathematically that finding the perfect plan is a nightmare (what they call "NP-hard"). It's like trying to find the absolute shortest path through a maze that has billions of twists and turns. If you try to calculate the perfect answer for every single scenario, you'd be waiting until the sun burns out.

The Solution: The "Smart Shortcut"

Since finding the perfect answer is impossible, the authors invented a smart shortcut (a "surrogate").

Think of it like this:
Instead of trying to predict exactly how the experts will argue and vote (which is messy and hard to calculate), they created a safety net formula.

The Pairwise Check: Instead of worrying about all 100 suspects at once, the formula breaks the problem down. It asks: "If the culprit is Suspect A, how likely is it that Expert B will mistakenly think it's Suspect C?"
The "Chernoff" Safety Net: They used a famous mathematical trick (Chernoff bounds) to create a "worst-case scenario" estimate. Imagine you are packing a parachute. You don't calculate the exact wind speed for every second of the fall; you calculate a safety margin that guarantees you won't hit the ground even if the wind is the worst it could possibly be.
The Magic Result: This safety net formula is simple. It turns a messy, impossible puzzle into a clean, solvable math problem. It tells you exactly how many times to ask Expert A and Expert B to stay safe.

Why This Shortcut is Amazing

You might think, "If it's a shortcut, isn't it less accurate?"

The authors proved that no, it's not.

The "Almost Perfect" Guarantee: They showed that as you demand higher and higher accuracy (making your error tolerance tiny), the cost of using their "shortcut" becomes identical to the cost of the impossible "perfect" plan.
The Analogy: Imagine you are trying to hit a bullseye. The "perfect" plan is a supercomputer calculating the wind, humidity, and bullet weight to the nanometer. The "shortcut" is a seasoned archer who uses a simple rule of thumb. The paper proves that for a professional archer aiming for a tiny bullseye, the simple rule of thumb gets you just as close as the supercomputer, but much faster.

The Algorithm: The "Smart Shopper"

Finally, they built a shopping algorithm (called an AFPTAS).

Imagine you are at a grocery store with a list of items you must buy to survive.
Some items are expensive, some are cheap.
The algorithm is like a super-shopper who quickly scans the aisles, rounds the prices to make the math easy, and picks the absolute cheapest combination that still gets you all the food you need.
It guarantees that you will never spend more than 1% (or any tiny amount you choose) over the theoretical best price.

The Big Picture

In the real world, companies use AI models to diagnose diseases, sort legal documents, or read customer reviews. Right now, they often just guess how many times to ask the AI, wasting money or risking mistakes.

This paper gives them a blueprint. It says:

"Don't guess. Don't try to solve the impossible puzzle. Use our safety-net formula. It will tell you the exact, cheapest way to ask your AI team questions so that you are 99.99% sure you get the right answer, no matter what the truth is."

It turns a chaotic, expensive guessing game into a precise, efficient, and affordable science.

1. Problem Definition

The paper addresses the offline query planning problem for systems deploying multiple heterogeneous Large Language Models (LLMs) to classify an unknown ground-truth label.

Context: Instead of relying on a single model, organizations query a collection of LLMs and aggregate their responses (e.g., via Majority Voting or MAP estimation) to improve reliability.
Objective: Determine the optimal number of queries ( $r_m$ ) to allocate to each model $m$ to minimize the total cost ( $C(r) = \sum c_m r_m$ ) while satisfying statewise error constraints.
Constraints: The system must guarantee that the probability of misclassification is below a specific tolerance $\alpha_y$ for every possible ground-truth label $y$ , not just on average.
Challenges:
1. Heterogeneity: Models have different costs ( $c_m$ ) and varying discriminative power across different label pairs.
2. Intractability: The exact error probability $P_e(y; r)$ involves summing over all possible observation sequences, leading to exponential complexity.
3. Combinatorial Nature: The problem requires selecting integer query counts, making it a combinatorial optimization problem.

2. Methodology

A. Hardness Result (NP-Hardness)

The authors first establish that the exact query design problem is NP-hard.

Proof: They provide a polynomial-time reduction from the Minimum-Weight Set Cover problem.
Intuition: Ensuring every label is correctly classified requires selecting a "cover" of models capable of distinguishing every pair of labels. Since different models cover different subsets of label pairs with varying costs, finding the minimum cost set is equivalent to the set cover problem.

B. The Chernoff Surrogate Approach

To overcome intractability, the authors develop a tractable surrogate problem that replaces the exact error constraints with an analytically computable upper bound. The construction involves two main steps:

Union Bound Decomposition: The multi-class error event (MAP estimator $\neq$ true label) is decomposed into a union of pairwise comparison events (True Label vs. Competitor Label).
$P_e(y; r) \leq \sum_{y' \neq y} \Pr(\Delta_{y,y'}(r) \geq 0 \mid Y=y)$
where $\Delta_{y,y'}(r)$ is the log-likelihood difference between the competitor and the true label.
Chernoff Bounding: Each pairwise probability is bounded using a Chernoff-type exponential bound. This introduces a Chernoff affinity factor $M^{(y,y')}_{m}(s)$ $M_{m}^{(y, y^{'})} (s)$ , which measures the statistical overlap between distributions of labels $y$ $y$ and $y'$ $y^{'}$ under model $m$ $m$ .
- The resulting surrogate bound $\bar{P}_e(y; r)$ is multiplicatively separable across models and query counts:
  $\bar{P}_e(y; r) = \sum_{y' \neq y} \min_{s \in [0,1]} \left( \frac{\pi(y')}{\pi(y)} \right)^s \prod_{m=1}^K \left( M^{(y,y')}_{m}(s) \right)^{r_m}$
- This separability transforms the problem into a form where constraints can be evaluated efficiently.

C. Asymptotic Tightness

The paper proves that solving the surrogate problem yields a solution nearly identical to the true optimal solution in the high-reliability regime (where error tolerances $\alpha_{min} \to 0$ ).

Result: The ratio of the surrogate-optimal cost to the true optimal cost converges to 1.
Rate: The excess cost vanishes at a rate of $O\left(\frac{\log \log(1/\alpha_{min})}{\log(1/\alpha_{min})}\right)$ .
Implication: The surrogate captures the correct first-order cost structure; the "gap" introduced by the bound is negligible compared to the total query budget required for high accuracy.

D. Approximation Algorithm (AFPTAS)

Since the surrogate problem still involves integer constraints and an inner minimization over the tilting parameter $s$ , the authors design an Asymptotic Fully Polynomial-Time Approximation Scheme (AFPTAS).

Algorithm Steps:
1. Discretization: The continuous tilting parameters $s$ are discretized onto a finite grid.
2. Rounding: Discrimination weights derived from the Chernoff factors are rounded down to integers to ensure conservativeness.
3. Dynamic Programming: A DP (similar to the unbounded knapsack problem) is run for each grid point to find the minimum cost integer query plan.
Guarantee: The algorithm returns a query plan with cost within a factor of $(1 + \epsilon)$ of the surrogate optimum, with runtime polynomial in the number of models ( $K$ ), $\log(1/\alpha_{min})$ , and $1/\epsilon$ .

3. Key Contributions

Formulation: A rigorous robust optimization framework for offline multi-LLM query planning that accounts for heterogeneous costs, state-dependent model performance, and worst-case (statewise) error guarantees.
Complexity Analysis: Proof of NP-hardness via reduction from Minimum-Weight Set Cover, establishing the theoretical difficulty of the exact problem.
Surrogate Construction: Development of a Chernoff-based surrogate bound that is:
- Feasibility-preserving: Any plan satisfying the surrogate satisfies the original constraints.
- Asymptotically Tight: The cost penalty of using the surrogate vanishes as reliability requirements increase.
- Separable: Allows for efficient computation and optimization.
Algorithmic Solution: An AFPTAS that solves the surrogate problem efficiently with provable approximation guarantees, making the approach practical for real-world deployment.

4. Results

Theoretical: The paper proves that the "Chernoff surrogate" is not just a heuristic relaxation but an asymptotically exact proxy for the true optimal cost in high-reliability settings.
Computational: The proposed AFPTAS avoids the exponential complexity of the exact error calculation, reducing the problem to a series of polynomial-time dynamic programming subproblems.
Practical: The framework provides a principled alternative to ad-hoc heuristics (like trial-and-error or equal allocation), enabling organizations to allocate budgets optimally across diverse LLMs.

5. Significance

Bridging Theory and Practice: This work bridges the gap between theoretical ensemble learning and the practical constraints of LLM deployment (cost, latency, API fees).
Resource Efficiency: By optimizing query allocation, organizations can significantly reduce compute costs while maintaining strict reliability guarantees, which is critical for applications in healthcare, legal services, and e-commerce.
Generalizability: While focused on LLMs, the methodology applies to any setting involving the aggregation of heterogeneous, noisy classifiers with varying costs and reliability profiles.
Robustness: The focus on statewise constraints (guaranteeing performance for every label) rather than average performance is crucial for safety-critical applications where failure on a specific class (e.g., a rare disease diagnosis) is unacceptable.