Scheduling the Unschedulable: Taming Black-Box LLM… — Plain-Language Explanation

Original authors: Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

Published 2026-04-09

📖 5 min read🧠 Deep dive

Original authors: Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a high-end, exclusive restaurant (the LLM API). You have a line of customers (requests) waiting to be served. Some customers just want a quick coffee (short prompts), while others want a 10-course tasting menu that takes hours to prepare (long prompts).

The problem? The kitchen is a "Black Box." You are the host, but you can't see inside the kitchen. You don't know how busy the chefs are, you can't tell them to stop cooking a steak to make a salad faster, and you can't peek at the order to see how long it will actually take.

For a long time, the only rule was: "First come, first served." This caused chaos. If a customer ordered the 10-course menu first, everyone behind them (even the coffee drinkers) had to wait hours. The restaurant got clogged, and the coffee drinkers left angry.

This paper introduces a new way to manage the line, called SageSched, based on a simple breakthrough: We can now guess, pretty accurately, how big the order is before the customer even sits down.

Here is how the paper solves the problem using three simple layers, explained with everyday analogies:

1. The Big Idea: "Semi-Clairvoyance"

Previously, the host had no idea what the customers wanted. Now, thanks to better prediction tools, the host can look at the order slip and say, "Ah, this looks like a quick coffee, but that one looks like a massive banquet."

The paper argues that even if your guess isn't 100% perfect (it's "coarse" or "semi-clairvoyant"), it's enough to make the line run smoothly. It's like knowing a truck is "big" even if you don't know its exact weight.

2. The Three-Layer Solution

The authors break the solution down into three distinct jobs, like a well-organized restaurant team:

Layer 1: The Hostess (Allocation)

The Job: Deciding which type of customer gets to sit at the table next.
The Old Way: Just let anyone sit down.
The New Way: The hostess uses a "Fair Ticket System" (Deficit Round Robin).

If the kitchen is busy, she makes sure the "Coffee" customers (short requests) get a seat every few minutes so they don't wait forever.
If the kitchen is empty, she lets the "Banquet" customers (long requests) sit down too so the kitchen doesn't sit idle.
Analogy: It's like a bouncer at a club who ensures the VIPs (short requests) get in quickly, but doesn't turn away the regulars (long requests) unless the club is absolutely packed.

Layer 2: The Waiter (Ordering)

The Job: Deciding which specific customer in the line gets served next.
The Old Way: Serve the person who has been waiting the longest.
The New Way: The waiter looks at the "size" of the order.

If two people are waiting, and one has a small order and one has a huge order, the waiter might serve the small one first to clear the line quickly.
Analogy: Imagine a checkout line at a grocery store. If the person with 10 items is behind the person with 100 items, the cashier (the waiter) might ask the 10-item person to go first to keep the line moving.

Layer 3: The Manager (Overload Control)

The Job: Deciding when to say "No" or "Come back later."
The Old Way: Let everyone in until the kitchen catches fire, then everyone waits forever.
The New Way: The manager has a "Cost Ladder."

If the kitchen is getting too full, the manager stops letting in the "Banquet" customers first. They might say, "Sorry, the kitchen is full. Please come back in an hour."
They never kick out the "Coffee" customers.
Analogy: Think of a lifeboat. If the boat is full, you don't throw the person with the heavy suitcase (the long request) off the boat; you ask them to wait on the dock. You keep the people with just a backpack (short requests) on the boat so they can get to safety quickly.

3. The Results: Why It Matters

The paper tested this system in a simulated environment with different types of crowds (some mostly coffee drinkers, some mostly banquet-goers).

The "Blind" Test: When the host didn't know the order sizes, the "Coffee" customers waited 5.8 times longer than necessary.
The "Smart" Test: With the new system, the "Coffee" customers got their drinks almost instantly, even when the restaurant was packed.
The "Fairness" Test: The system can be tuned. If you want to be super fair, you can let everyone wait a bit longer. If you want speed for the small orders, you can prioritize them. The system handles both without breaking.

The Bottom Line

This paper proves that you don't need to be a mind reader to run a busy LLM service. You just need a rough guess of how big the job is.

By splitting the problem into who gets in line, who goes first, and who gets turned away, the authors created a system that:

Keeps short, interactive chats (like asking "What's the weather?") fast.
Still lets big, long jobs finish eventually.
Prevents the whole system from crashing when too many people show up at once.

It's the difference between a chaotic, screaming line at a theme park and a well-organized FastPass system where everyone gets a ride, but the short lines stay short.

1. Problem Statement

The paper addresses the challenge of scheduling requests sent to black-box Large Language Model (LLM) APIs.

The Constraint: Clients have no visibility into the provider's internal state (batching, caching, GPU scheduling) and cannot preempt or reorder in-flight requests.
The Historical Barrier: Traditionally, clients could not schedule effectively because the "size" of a request (output token count) was unknown at submission time. Without a known workload unit, classical scheduling heuristics (like weighted fair queuing or size-based admission) were inapplicable.
The New Premise: Recent work (Gan et al., 2026) demonstrates that output token counts can be predicted with sufficient accuracy. This transforms the problem from "blind" to "semi-clairvoyant": clients can make decisions based on coarse priors of request cost (token count) even though the provider's internals remain hidden.
The Goal: Design a client-side scheduler that manages the trade-off between protecting short-request latency (interactive traffic), ensuring long-request completion, and maximizing "useful goodput" (throughput of requests that finish within their Service Level Objectives).

2. Methodology: The Three-Layer Decomposition

The authors propose a novel three-layer client-side control plane that decomposes the scheduling problem into analytically separable concerns. This structure operates entirely on the client side, using only API feedback and token priors.

Layer 1: Allocation (Inter-Class Share)

Function: Decides which class of requests (e.g., short/interactive vs. heavy/batch) receives the next opportunity to send a request.
Mechanism: Uses Deficit Round Robin (DRR) with congestion-aware weight adjustment.
- Interactive traffic retains a protected share even under high load.
- Unused quota from an idle class is borrowed by backlogged peers (work-conserving).
- Weights are dynamically scaled based on congestion feedback to bias opportunities toward latency-sensitive work when stress is detected.

Layer 2: Ordering (Intra-Class Sequencing)

Function: Determines the order in which eligible requests within a specific class are released.
Mechanism: A feasible-set scoring rule for the "heavy" class to minimize head-of-line blocking.
- Score Formula: $Score = w_1 \cdot (wait / cost) - w_2 \cdot (size / ref) + w_3 \cdot urgency$ .
- This prioritizes older jobs and smaller jobs while respecting deadline urgency, reducing the risk of a large job blocking smaller, time-sensitive ones.

Layer 3: Overload Control (Admission/Rejection)

Function: Decides whether to admit, defer, or reject a request before it enters the black box.
Mechanism: A Cost Ladder based on a severity score derived from latency, queue pressure, and tail behavior.
- Thresholds: Progressive thresholds trigger specific actions (e.g., defer medium, reject x-long).
- Policy: Short requests are never rejected. Expensive (long/xlong) requests are the primary targets for deferral or rejection.
- Goal: Replace implicit timeout failures with explicit, objective-aligned shedding.

3. Key Contributions

Formulation: The first systematic formulation of client-side LLM scheduling as a semi-clairvoyant arrival-shaping problem, leveraging output-length predictability.
Decomposition: A novel three-layer architecture (Allocation, Ordering, Overload) that isolates failure modes (starvation, head-of-line blocking, saturation) and allows independent tuning.
Controlled Evaluation: A rigorous evaluation using a congestion-aware mock provider across four regimes (Balanced/Heavy $\times$ Medium/High congestion) and five random seeds.
Information Ladder: An experiment proving that coarse magnitude priors (not just class labels) are the critical threshold for effective control. Removing magnitude information inflates short-request P95 latency by up to 5.8×.
Policy Alternatives: Demonstration that the allocation layer can support different fairness objectives (Short-Priority vs. Fair Queuing) without altering the rest of the stack.

4. Key Results

The evaluation was conducted using a mock provider calibrated against real production API latency data.

Performance under Congestion:
- The full stack (Final OLC) achieved 100% completion and 100% deadline satisfaction in balanced/high congestion regimes.
- It achieved a useful goodput of 4.2 ± 1.6 SLO-meeting requests/second.
- Short-request P95 latency remained within tens of milliseconds of the ideal "quota-tiered" isolation baseline.
The Value of Magnitude Priors (Information Ladder):
- No-Information (Blind): Short P95 latency increased by 5.8× compared to the coarse-prior setting in high-congestion balanced regimes.
- Class-Only vs. Coarse: While "Class-Only" (knowing the bucket but not the size) sometimes yielded higher raw throughput, it degraded short-request latency and pushed the system closer to saturation. Coarse priors allowed for better anticipation of congestion.
Fairness Trade-offs:
- Short-Priority (DRR): Improved short-request P90 by +27% over FIFO but increased long-request P90 overhead by +116%.
- Fair Queuing (Round-Robin): Improved short-request P90 by +32% over FIFO with only +17% overhead for long requests. This demonstrates the framework's flexibility in balancing fairness vs. latency.
Robustness to Noise:
- The system exhibits graceful degradation under predictor noise. Even with 60% multiplicative error in token predictions, metrics drifted smoothly rather than collapsing.
- The system remains actionable with "coarse" priors; exact oracle knowledge is not required.
Overload Shedding:
- The "Cost Ladder" policy (rejecting only the most expensive xlong requests) maximized useful goodput while maintaining full completion rates.
- Alternative policies (e.g., uniform rejection) led to lower goodput or dropped completion rates, proving that targeted shedding is superior.
Real-World Validation:
- Replay of a ShareGPT distribution confirmed that the policy ordering holds under real-world workload mixes, validating external validity beyond synthetic data.

5. Significance

This paper fundamentally shifts the paradigm for interacting with black-box LLM APIs:

From Reactive to Proactive: Instead of reacting to HTTP 429 (rate limit) errors, clients can proactively shape their arrival process based on predicted costs.
Client-Side Leverage: It proves that even without internal server access, clients can significantly improve system-wide efficiency and fairness by controlling when and what they send.
Practical Deployability: By demonstrating robustness to prediction noise and providing a modular architecture, the paper offers a blueprint for production-ready client-side schedulers that can coexist with various LLM providers.
Metric Alignment: It champions "Useful Goodput" (finished, SLO-compliant work) as the primary success metric, preventing the optimization of latency at the cost of dropping work.

In summary, the paper provides a robust, theoretically grounded, and empirically validated framework for "taming" black-box LLM inference, turning an unschedulable problem into a manageable semi-clairvoyant control task.

Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale