HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL

HEXGEN-FLOW is a hierarchical scheduling framework designed to optimize agentic Text-to-SQL inference on heterogeneous GPU clusters, achieving significant improvements in tail latency and throughput over existing LLM serving systems by addressing multi-stage dependencies and strict latency requirements.

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, Binhang Yuan

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you run a bustling, high-end restaurant where the chefs are not humans, but super-intelligent AI robots. Your customers (the users) don't just order a simple burger; they order complex, multi-course tasting menus that require the chefs to work together in a specific sequence.

This is the world of Text-to-SQL: turning a simple question like "Show me sales for last month" into a complex database command.

Here is the problem: In a typical restaurant, if a customer orders a steak, the kitchen just starts cooking it. But in this AI restaurant, the "steak" (the final answer) requires a 4-step process:

  1. Menu Check: The chef reads the ingredients list (Schema Linking).
  2. Drafting: The chef writes three different recipes (SQL Generation).
  3. Tasting & Fixing: The chef tries the recipes, burns one, fixes another, and tries again (Self-Correction).
  4. Final Review: A critic tastes the best one and picks the winner (Evaluation).

The Challenge:
The restaurant has a mix of super-fast ovens (powerful GPUs) and older, slower ovens (weaker GPUs).

  • The Old Way: The manager (the scheduler) just throws orders at ovens randomly or in a "first-come, first-served" line.
    • Result: A complex order gets stuck in a slow oven, while a simple order sits in a fast oven doing nothing. The customer waits too long, gets angry, and leaves (SLO violation).
  • The New Way (HEXGEN-FLOW): The paper introduces a new, smart manager system called HEXGEN-FLOW.

How HEXGEN-FLOW Works (The Analogy)

HEXGEN-FLOW acts like a super-organized, two-level traffic controller for your AI kitchen.

1. The Global Dispatcher (The Smart Host)

Instead of just handing out tickets in order, this host looks at two things before sending a task to a chef:

  • How heavy is the task? (Is it a simple salad or a 10-course meal?)
  • Which oven is free and fast enough?

The Metaphor: Imagine a heavy, slow-cooking stew. The Smart Host knows to send this to the Super Oven (A100 GPU) immediately, even if the Super Oven is slightly busy, because the stew needs that power. Meanwhile, a light salad gets sent to the Old Oven (A6000 GPU) so the Super Oven isn't wasted on simple tasks. This ensures no oven sits idle while another is overwhelmed.

2. The Local Priority Queue (The Urgent Sous-Chef)

Once a task arrives at a specific oven, it doesn't just sit in a line. The oven has its own Urgency Meter.

  • The Old Way: "First in, first out." Even if the first person in line has a relaxed deadline, they get served before the person behind them who is about to explode with impatience.
  • The New Way: The Sous-Chef constantly checks the "Time Left" on every order.
    • If Order A has 10 minutes left and Order B has 1 minute left, Order B jumps the line, even if it arrived later.
    • If Order A finishes early, the system instantly recalculates the time left for the next steps, making them more urgent.

The Metaphor: Think of it like a hospital triage nurse. A patient with a broken toe (low urgency) waits behind a patient with a heart attack (high urgency), even if the toe patient arrived first. HEXGEN-FLOW ensures the "heart attacks" (requests about to miss their deadline) get treated immediately.

3. The Self-Learning Coach (Alpha-Tuning)

The system has a built-in coach that watches the kitchen in real-time.

  • If the kitchen gets too chaotic, the coach asks: "Should we focus more on sending tasks to the fastest ovens, or balancing the load evenly?"
  • It runs tiny, invisible simulations in the background (like a coach watching game tape) to tweak the settings automatically. It learns that "Today, we need to prioritize speed," or "Today, we need to balance the load."

Why This Matters (The Results)

The paper tested this system against the current "best" methods (like vLLM or Ray). The results were like upgrading from a bicycle to a sports car:

  • Faster Service: The system reduced the "longest wait times" (tail latency) by 1.4 to 1.5 times. That means the slowest customers are now served much faster.
  • More Customers: The system can handle 1.5 to 1.8 times more customers per hour without crashing.
  • No More Angry Customers: It drastically reduced the number of times a customer had to wait too long and leave (SLO violations).

Summary

HEXGEN-FLOW is a smart scheduling system that treats AI requests like a complex, multi-step restaurant order. It doesn't just guess; it matches the right task to the right machine and prioritizes the most urgent tasks dynamically. It ensures that even in a chaotic kitchen with mixed-quality equipment, every customer gets their complex meal on time, every single time.