Robust Batch-Level Query Routing for Large Language Models under Cost and Capacity Constraints

Imagine you run a busy restaurant kitchen (a Large Language Model system) that has to serve thousands of customers (queries) every hour. You have a menu of different chefs (LLMs) available:

The "Star Chefs": Highly skilled, can handle complex dishes (hard questions), but they are expensive and slow.
The "Line Cooks": Fast, cheap, and great at simple dishes (easy questions), but they might burn a complex meal.

Your goal is to get the best food quality for your customers while staying within your daily budget and not overworking your kitchen staff (GPU resources).

The Old Way: The "One-by-One" Mistake

Previously, most restaurants used a "Per-Query" rule. Every time a customer ordered, a manager looked at the dish and decided: "Is this a simple salad? Give it to the Line Cook. Is it a complex soufflé? Give it to the Star Chef."

The Problem:
Imagine a sudden rush where 50 customers in a row all order the most expensive, complex soufflés.

The manager sends them all to the Star Chefs.
Result: The kitchen explodes! The Star Chefs are overwhelmed, the bill skyrockets, and the next 50 customers (who ordered simple salads) have to wait because the kitchen is backed up.
The old method couldn't see the "big picture" of the whole rush. It only looked at one order at a time.

The New Solution: The "Batch-Level" Manager

This paper proposes a smarter system called Robust Batch-Level Routing. Instead of looking at one order, the manager looks at the entire tray of 100 orders that just came in and plans the whole shift at once.

Here is how it works, broken down into three simple concepts:

1. The Group Plan (Batch-Level Optimization)

Instead of deciding chef assignments one by one, the manager looks at the whole group of 100 orders.

The Math: They use a smart calculator (Integer Linear Programming) to solve a puzzle: "How do we split these 100 orders between the Star Chefs and Line Cooks so that everyone gets good food, the total bill stays under $500, and no chef is overwhelmed?"
The Benefit: If 20 complex orders come in, the manager might say, "Okay, we'll send 15 to the Star Chefs and 5 to the Line Cooks (who will try their best), and we'll save the budget for the next group." This prevents the kitchen from crashing and keeps the costs steady.

2. The "Safety Net" (Robustness)

Sometimes, the manager's guess about how good a chef is might be wrong. Maybe the "Star Chef" is having a bad day, or the "Line Cook" is actually better than we thought.

The Risk: If the manager is too confident and sends a hard dish to a Line Cook who fails, the customer gets a burnt meal.
The Fix: The new system uses a "Worst-Case Scenario" approach. It assumes the chefs might perform slightly worse than expected. It plans the schedule based on the lowest likely performance.
The Analogy: It's like packing an umbrella even if the weather forecast says "sunny." If it does rain, you're safe. If it doesn't, you just carried a little extra weight. This ensures the system never fails catastrophically, even if the predictions are slightly off.

3. The Kitchen Setup (Offline Instance Allocation)

Before the restaurant even opens for the day, the owner has to decide: "How many Star Chefs and Line Cooks should we hire today?"

The Old Way: Hire the same number of chefs every day, regardless of the menu.
The New Way: The owner looks at the expected menu for the week. If the week is full of complex dishes, they hire more Star Chefs. If it's mostly simple salads, they hire more Line Cooks.
The Benefit: This ensures the kitchen isn't full of expensive chefs standing around doing nothing, or cheap cooks who are too slow for the workload. It matches the resources to the actual demand.

Why Does This Matter?

The paper tested this system on real-world data and found:

Better Quality: Customers got better food (higher accuracy) because the right chef was always chosen for the right group of orders.
Cheaper: The restaurant didn't overspend on the Star Chefs during simple rushes.
More Stable: Even when a "bad batch" of difficult orders arrived (Adversarial Batching), the system didn't crash or go over budget.

In a Nutshell

Think of this paper as teaching a restaurant manager to stop looking at individual orders and start planning the whole shift. By looking at the group, preparing for the worst-case scenario, and hiring the right number of staff beforehand, they serve better food, spend less money, and keep the kitchen running smoothly without chaos.

1. Problem Statement

The paper addresses the challenge of routing queries to Large Language Models (LLMs) in production environments where queries arrive in batches rather than individually. Existing routing methods typically operate on a per-query basis, optimizing a trade-off between quality and cost for each query independently (e.g., $\max l(q, m_j) - \lambda \cdot c(q, m_j)$ ).

The authors identify three critical limitations of per-query routing in real-world batched inference systems:

Uncontrolled Batch-Level Costs: Per-query methods cannot strictly enforce budget constraints at the batch level. Under non-uniform or adversarial batching (e.g., many difficult queries arriving simultaneously), per-query routing can cause cost spikes or exceed capacity limits.
Ignored Capacity Constraints: Per-query approaches often ignore hardware limitations (e.g., GPU concurrency limits for locally hosted models) or the distinction between cloud-based (cost-constrained) and on-premise (capacity-constrained) deployments.
Estimation Uncertainty: Performance estimates for LLMs are noisy. Relying on point estimates can lead to over-confident routing decisions that fail catastrophically when the actual performance deviates from the prediction.

2. Methodology

The authors propose a Robust Batch-Level Query Routing Framework that integrates Integer Linear Programming (ILP), robust optimization, and offline resource allocation.

A. Batch-Level Routing Formulation (Online)

Instead of optimizing per query, the system solves an optimization problem for a batch of $N$ queries and $M$ available models.

Objective: Maximize the average routing quality across the batch.
Constraints:
- Cost: Total inference cost for the batch must not exceed a global budget.
- Capacity: The number of queries assigned to a specific model instance cannot exceed its concurrency limit ( $l_j \times I_j$ , where $I_j$ is the number of instances).
- Assignment: Each query must be assigned to exactly one model.
Solver: The problem is formulated as an Integer Linear Program (ILP), solved efficiently using off-the-shelf solvers like SCIP (taking milliseconds for moderate batch sizes).

B. Robust Optimization

To handle uncertainty in performance estimates ( $a_{i,j}$ ), the framework introduces a robust variant:

Instead of using a point estimate for performance, the method uses the lower bound of a prediction interval (e.g., the 10th percentile derived via bootstrap resampling).
The objective becomes maximizing the worst-case average performance within the estimated uncertainty range. This ensures the system meets quality thresholds even under adverse estimation errors.

C. Offline Instance Allocation

Before inference, the system optimizes the allocation of computational resources (e.g., number of GPU instances $I_j$ ) for open-source models.

Goal: Determine the optimal number of instances for each model to balance accuracy and throughput across a distribution of historical batches.
Method: A separate offline optimization problem (also an ILP) simulates inference-time performance to find the optimal $I_j$ values, which are then fixed for the online routing phase.

D. Performance Estimators

The framework is model-agnostic and compatible with various estimators, including:

MIRT: Multidimensional Item Response Theory (neural network-based).
kNN: k-Nearest Neighbors.
XGBoost: Gradient-boosted decision trees (proposed as a strong, efficient baseline).

3. Key Contributions

Identification of Limitations: Demonstrated that per-query routing fails to control batch-level costs and capacity, particularly under adversarial batching scenarios.
Robust Batch-Level Framework: Introduced an ILP-based routing framework that explicitly enforces cost and hardware constraints while optimizing for worst-case performance via robust optimization.
Offline Resource Allocation: Proposed a method to jointly optimize the number of model instances (GPUs) prior to deployment, bridging the gap between offline planning and online routing.
Comprehensive Evaluation: Validated the approach on two multi-task LLM benchmarks (Dataset 1 and Dataset 2) covering diverse tasks and model types.

4. Experimental Results

Experiments were conducted on two datasets involving multiple LLMs (e.g., GPT-4, Llama, DeepSeek) with varying costs and capabilities.

Robustness Gains:
- Robust routing improved accuracy by 1–14% over non-robust counterparts, depending on the estimator and dataset.
- Robust methods systematically favored models with lower predictive uncertainty (shorter prediction intervals), reducing risk.
Batch-Level vs. Per-Query:
- Batch-level routing outperformed per-query methods by up to 24% under adversarial batching (where difficult queries are grouped together).
- It significantly reduced cost variance across batches, ensuring strict adherence to budget constraints.
Instance Allocation:
- Optimizing the number of model instances (GPUs) offline yielded additional performance gains of up to 3% compared to fixed, uniform allocations.
- The optimizer dynamically prioritized smaller, efficient models under tight budgets and larger models under loose budgets.
Full Optimization:
- The full framework (Robust + Batch-Level + Instance Allocation) outperformed state-of-the-art baselines like MIRT and individual strong models (e.g., DeepSeek_Chat) while requiring significantly fewer resources (e.g., 177 GPUs vs. 800 GPUs to match DeepSeek's performance).
Efficiency: The ILP solver (SCIP) solved routing problems in <0.4 seconds even for large batches ( $B=400$ ), making it viable for real-time deployment.

5. Significance and Impact

This work shifts the paradigm of LLM routing from a greedy, per-query decision process to a system-level, constrained optimization problem. Its significance lies in:

Cost and Resource Control: It provides a mathematically rigorous way to guarantee that inference costs and GPU usage stay within strict limits, a critical requirement for industrial deployment.
Robustness: By accounting for prediction uncertainty, it prevents system failures during high-variance scenarios, making LLM routing more reliable in production.
Scalability: The ability to solve these optimization problems in milliseconds enables the framework to scale to large-scale, high-throughput agentic applications.
Holistic Deployment: It uniquely integrates offline resource planning with online routing, offering a complete solution for managing heterogeneous model fleets (cloud vs. on-premise).

The paper concludes that constrained batch-level optimization is essential for stable, cost-efficient, and high-quality LLM inference in real-world systems.