Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale

This paper proposes a semi-clairvoyant, three-layer client-side scheduling framework for black-box LLM APIs that leverages coarse output token priors to optimize allocation, ordering, and overload control, achieving high completion rates, deadline satisfaction, and flexible fairness trade-offs even under significant prediction noise.

Original authors: Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

Published 2026-04-09
📖 5 min read🧠 Deep dive

Original authors: Renzhong Yuan, Yijun Zeng, Xiaosong Gao, Linxi Yu, Haochun Liao, Han Wang

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a high-end, exclusive restaurant (the LLM API). You have a line of customers (requests) waiting to be served. Some customers just want a quick coffee (short prompts), while others want a 10-course tasting menu that takes hours to prepare (long prompts).

The problem? The kitchen is a "Black Box." You are the host, but you can't see inside the kitchen. You don't know how busy the chefs are, you can't tell them to stop cooking a steak to make a salad faster, and you can't peek at the order to see how long it will actually take.

For a long time, the only rule was: "First come, first served." This caused chaos. If a customer ordered the 10-course menu first, everyone behind them (even the coffee drinkers) had to wait hours. The restaurant got clogged, and the coffee drinkers left angry.

This paper introduces a new way to manage the line, called SageSched, based on a simple breakthrough: We can now guess, pretty accurately, how big the order is before the customer even sits down.

Here is how the paper solves the problem using three simple layers, explained with everyday analogies:

1. The Big Idea: "Semi-Clairvoyance"

Previously, the host had no idea what the customers wanted. Now, thanks to better prediction tools, the host can look at the order slip and say, "Ah, this looks like a quick coffee, but that one looks like a massive banquet."

The paper argues that even if your guess isn't 100% perfect (it's "coarse" or "semi-clairvoyant"), it's enough to make the line run smoothly. It's like knowing a truck is "big" even if you don't know its exact weight.

2. The Three-Layer Solution

The authors break the solution down into three distinct jobs, like a well-organized restaurant team:

Layer 1: The Hostess (Allocation)

The Job: Deciding which type of customer gets to sit at the table next.
The Old Way: Just let anyone sit down.
The New Way: The hostess uses a "Fair Ticket System" (Deficit Round Robin).

  • If the kitchen is busy, she makes sure the "Coffee" customers (short requests) get a seat every few minutes so they don't wait forever.
  • If the kitchen is empty, she lets the "Banquet" customers (long requests) sit down too so the kitchen doesn't sit idle.
  • Analogy: It's like a bouncer at a club who ensures the VIPs (short requests) get in quickly, but doesn't turn away the regulars (long requests) unless the club is absolutely packed.

Layer 2: The Waiter (Ordering)

The Job: Deciding which specific customer in the line gets served next.
The Old Way: Serve the person who has been waiting the longest.
The New Way: The waiter looks at the "size" of the order.

  • If two people are waiting, and one has a small order and one has a huge order, the waiter might serve the small one first to clear the line quickly.
  • Analogy: Imagine a checkout line at a grocery store. If the person with 10 items is behind the person with 100 items, the cashier (the waiter) might ask the 10-item person to go first to keep the line moving.

Layer 3: The Manager (Overload Control)

The Job: Deciding when to say "No" or "Come back later."
The Old Way: Let everyone in until the kitchen catches fire, then everyone waits forever.
The New Way: The manager has a "Cost Ladder."

  • If the kitchen is getting too full, the manager stops letting in the "Banquet" customers first. They might say, "Sorry, the kitchen is full. Please come back in an hour."
  • They never kick out the "Coffee" customers.
  • Analogy: Think of a lifeboat. If the boat is full, you don't throw the person with the heavy suitcase (the long request) off the boat; you ask them to wait on the dock. You keep the people with just a backpack (short requests) on the boat so they can get to safety quickly.

3. The Results: Why It Matters

The paper tested this system in a simulated environment with different types of crowds (some mostly coffee drinkers, some mostly banquet-goers).

  • The "Blind" Test: When the host didn't know the order sizes, the "Coffee" customers waited 5.8 times longer than necessary.
  • The "Smart" Test: With the new system, the "Coffee" customers got their drinks almost instantly, even when the restaurant was packed.
  • The "Fairness" Test: The system can be tuned. If you want to be super fair, you can let everyone wait a bit longer. If you want speed for the small orders, you can prioritize them. The system handles both without breaking.

The Bottom Line

This paper proves that you don't need to be a mind reader to run a busy LLM service. You just need a rough guess of how big the job is.

By splitting the problem into who gets in line, who goes first, and who gets turned away, the authors created a system that:

  1. Keeps short, interactive chats (like asking "What's the weather?") fast.
  2. Still lets big, long jobs finish eventually.
  3. Prevents the whole system from crashing when too many people show up at once.

It's the difference between a chaotic, screaming line at a theme park and a well-organized FastPass system where everyone gets a ride, but the short lines stay short.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →