SageSched: Efficient LLM Scheduling Confronting Demand Uncertainty and Hybridity

SageSched is an efficient LLM scheduler that addresses demand uncertainty and hybrid resource requirements by integrating lightweight output-length prediction with a comprehensive cost model and an uncertainty-aware policy, achieving over 28.7% efficiency improvement in diverse testbed experiments.

Zhenghao Gan, Yichen Bao, Yifei Liu, Chen Chen, Quan Chen, Minyi Guo

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you run a very busy, high-end bakery called "The LLM Bakery." Customers (users) walk in and ask for custom cakes (LLM responses). Some orders are for a tiny cupcake, while others are for a massive, 10-tier wedding cake.

The problem is, you don't know how big the cake will be until the baker finishes baking it. You also don't know if the baker will need more oven space (compute) or more counter space (memory) for a specific order.

This is exactly the challenge the paper SageSched solves for Large Language Models (LLMs). Here is the story of how they fixed the chaos, explained simply.

The Problem: The "Guessing Game" Bakery

In the past, bakeries (LLM servers) handled orders in two bad ways:

  1. The "First-Come, First-Served" Line: If a customer asks for a 10-tier cake, they get in line first. While they wait, 50 people asking for cupcakes are stuck behind them. Everyone waits forever. This is called Head-of-Line Blocking.
  2. The "Shortest Job First" Guess: Some smart bakeries tried to guess the cake size. They'd say, "That looks like a small order, let's do it first!" But their guesses were often wrong because they relied on heavy, complicated training to guess the size. Plus, they only looked at the oven (compute) and forgot about the counter space (memory). If a "small" cake needed a huge counter, it would still clog up the kitchen.

The result? The bakery was slow, customers were angry, and the bakers were confused.

The Solution: Enter "SageSched"

The authors built a new manager for the bakery called SageSched. It uses three clever tricks to run the kitchen perfectly.

1. The "Memory Match" Trick (Predicting the Future)

Instead of trying to magically predict the future with a complex crystal ball (a heavy AI model), SageSched looks at the history.

  • The Analogy: If a customer walks in and says, "I want a chocolate cake with sprinkles," the manager doesn't guess. They look at their logbook and say, "Ah, last Tuesday, three people asked for exactly that. They all got 500 grams of cake. Let's assume this one will be similar."
  • Why it's better: It doesn't need to be retrained every time. It just matches the "flavor" (prompt) of the current request to past requests. It predicts a range of possibilities (e.g., "It's likely between 400 and 600 grams") rather than just one guess.

2. The "Full Kitchen" Cost (Computing the Real Price)

Old managers only counted how long the cake would take to bake (compute time). SageSched knows that baking isn't just about time; it's about space.

  • The Analogy: Imagine two orders:
    • Order A: A tiny cake that takes 1 minute but requires a giant, custom mold that takes up the whole counter.
    • Order B: A huge cake that takes 10 minutes but fits on a small tray.
    • If the kitchen is running out of counter space, Order A is actually the "expensive" one because it blocks everyone else, even though it's fast.
  • The Fix: SageSched calculates the "True Cost" by weighing both the baking time and the counter space needed. It knows when the oven is the bottleneck and when the counter is the bottleneck.

3. The "Fair Dice" Scheduler (Handling Uncertainty)

This is the magic sauce. Since the manager only has a probability (a range) of how big the cake will be, they can't just pick the "shortest" one. They need a strategy that handles the risk.

  • The Analogy: Imagine you have two customers.
    • Customer X: You are 90% sure they want a tiny cupcake, but there's a 10% chance they want a giant cake.
    • Customer Y: You are 50% sure they want a medium cake, and 50% sure they want a huge one.
    • Who do you serve first?
  • The Fix: SageSched uses a mathematical tool called the Gittins Index. Think of this as a "Fairness Dice." It doesn't just look at the average size; it calculates the best possible outcome if you serve them now. It prioritizes the order that is most likely to finish quickly and free up resources for everyone else. It constantly updates this "dice roll" as the cake is being baked, ensuring the line keeps moving efficiently.

The Result

When the researchers tested this new manager in a real kitchen (using powerful GPUs):

  • Customers waited 28.7% less time for their final cake (Time-to-Last-Token).
  • The system handled heavy traffic without crashing.
  • It worked great even when the "guesses" weren't perfect, because it was designed to handle uncertainty.

In a Nutshell

SageSched is like a super-smart restaurant manager who:

  1. Remembers what similar customers ordered in the past to guess the size.
  2. Counts both the cooking time and the table space needed.
  3. Uses math to decide who to serve next, ensuring that even if the future is uncertain, the whole restaurant runs as fast as possible.

It turns a chaotic, guessing game into a smooth, efficient operation.