HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL

Imagine you run a bustling, high-end restaurant where the chefs are not humans, but super-intelligent AI robots. Your customers (the users) don't just order a simple burger; they order complex, multi-course tasting menus that require the chefs to work together in a specific sequence.

This is the world of Text-to-SQL: turning a simple question like "Show me sales for last month" into a complex database command.

Here is the problem: In a typical restaurant, if a customer orders a steak, the kitchen just starts cooking it. But in this AI restaurant, the "steak" (the final answer) requires a 4-step process:

Menu Check: The chef reads the ingredients list (Schema Linking).
Drafting: The chef writes three different recipes (SQL Generation).
Tasting & Fixing: The chef tries the recipes, burns one, fixes another, and tries again (Self-Correction).
Final Review: A critic tastes the best one and picks the winner (Evaluation).

The Challenge:
The restaurant has a mix of super-fast ovens (powerful GPUs) and older, slower ovens (weaker GPUs).

The Old Way: The manager (the scheduler) just throws orders at ovens randomly or in a "first-come, first-served" line.
- Result: A complex order gets stuck in a slow oven, while a simple order sits in a fast oven doing nothing. The customer waits too long, gets angry, and leaves (SLO violation).
The New Way (HEXGEN-FLOW): The paper introduces a new, smart manager system called HEXGEN-FLOW.

How HEXGEN-FLOW Works (The Analogy)

HEXGEN-FLOW acts like a super-organized, two-level traffic controller for your AI kitchen.

1. The Global Dispatcher (The Smart Host)

Instead of just handing out tickets in order, this host looks at two things before sending a task to a chef:

How heavy is the task? (Is it a simple salad or a 10-course meal?)
Which oven is free and fast enough?

The Metaphor: Imagine a heavy, slow-cooking stew. The Smart Host knows to send this to the Super Oven (A100 GPU) immediately, even if the Super Oven is slightly busy, because the stew needs that power. Meanwhile, a light salad gets sent to the Old Oven (A6000 GPU) so the Super Oven isn't wasted on simple tasks. This ensures no oven sits idle while another is overwhelmed.

2. The Local Priority Queue (The Urgent Sous-Chef)

Once a task arrives at a specific oven, it doesn't just sit in a line. The oven has its own Urgency Meter.

The Old Way: "First in, first out." Even if the first person in line has a relaxed deadline, they get served before the person behind them who is about to explode with impatience.
The New Way: The Sous-Chef constantly checks the "Time Left" on every order.
- If Order A has 10 minutes left and Order B has 1 minute left, Order B jumps the line, even if it arrived later.
- If Order A finishes early, the system instantly recalculates the time left for the next steps, making them more urgent.

The Metaphor: Think of it like a hospital triage nurse. A patient with a broken toe (low urgency) waits behind a patient with a heart attack (high urgency), even if the toe patient arrived first. HEXGEN-FLOW ensures the "heart attacks" (requests about to miss their deadline) get treated immediately.

3. The Self-Learning Coach (Alpha-Tuning)

The system has a built-in coach that watches the kitchen in real-time.

If the kitchen gets too chaotic, the coach asks: "Should we focus more on sending tasks to the fastest ovens, or balancing the load evenly?"
It runs tiny, invisible simulations in the background (like a coach watching game tape) to tweak the settings automatically. It learns that "Today, we need to prioritize speed," or "Today, we need to balance the load."

Why This Matters (The Results)

The paper tested this system against the current "best" methods (like vLLM or Ray). The results were like upgrading from a bicycle to a sports car:

Faster Service: The system reduced the "longest wait times" (tail latency) by 1.4 to 1.5 times. That means the slowest customers are now served much faster.
More Customers: The system can handle 1.5 to 1.8 times more customers per hour without crashing.
No More Angry Customers: It drastically reduced the number of times a customer had to wait too long and leave (SLO violations).

Summary

HEXGEN-FLOW is a smart scheduling system that treats AI requests like a complex, multi-step restaurant order. It doesn't just guess; it matches the right task to the right machine and prioritizes the most urgent tasks dynamically. It ensures that even in a chaotic kitchen with mixed-quality equipment, every customer gets their complex meal on time, every single time.

Here is a detailed technical summary of the paper "HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL".

1. Problem Statement

The paper addresses the challenges of deploying Agentic Large Language Model (LLM) systems for Text-to-SQL tasks in production environments. Unlike traditional single-step LLM inference, agentic Text-to-SQL workflows involve a multi-stage pipeline (Schema Linking $\rightarrow$ SQL Candidate Generation $\rightarrow$ Self-Correction $\rightarrow$ Evaluation) with strict inter-stage dependencies.

Deploying these workflows on heterogeneous GPU clusters (mixing different GPU generations like A100, L40, A6000) introduces three core challenges that existing LLM serving frameworks (e.g., vLLM, TGI) fail to address:

C1. Request Dependencies: Delays in early stages cascade to downstream stages, increasing the risk of end-to-end Service Level Objective (SLO) violations. Existing systems treat requests as independent.
C2. Heterogeneity: Variability exists in both request complexity (token lengths vary by stage) and hardware capabilities (different GPUs have different compute speeds). Standard load-balancing (e.g., Round-Robin) ignores these nuances.
C3. Varying SLOs: Multi-tenant environments require handling concurrent queries with different deadlines. Current systems lack per-query deadline awareness across the entire workflow, leading to suboptimal prioritization.

2. Methodology: HEXGEN-FLOW

HEXGEN-FLOW is a hierarchical scheduling framework designed to execute agentic Text-to-SQL workflows on heterogeneous clusters. It employs a two-level scheduling architecture:

A. Global Coordinator (Workload-Balanced Dispatching)

The central coordinator manages incoming queries and dispatches individual LLM inference requests to specific model instances.

Dependency Tracking: It explicitly tracks the progress of each query's workflow stages. A request is only dispatched once its prerequisites are completed.
Heterogeneity-Aware Routing: Instead of simple round-robin, it uses a heuristic score to select the best instance ( $m$ $m$ ) for a request ( $q_{i,j}$ $q_{i, j}$ ):
$\text{Score}(q_{i,j}, m) = (1 - \alpha) \cdot \frac{\beta}{t_{queue}^m} - \alpha \cdot t_{comp}^m$
Where:
- $t_{comp}^m$ : Estimated computation time based on input/output token prediction.
- $t_{queue}^m$ : Estimated queuing delay based on the current backlog.
- $\alpha$ : A tunable hyperparameter balancing execution speed vs. load balancing.
- $\beta$ : A normalization constant.
Dynamic Tuning: The system uses a lightweight simulation-based method to tune $\alpha$ online. It monitors tail latency (P95) in sliding windows and uses a trace-driven simulator to find the optimal $\alpha$ that minimizes latency when workload patterns shift.

B. Local Priority Queue (Urgency-Driven Scheduling)

Each model instance maintains a local queue that reorders pending requests at batch boundaries.

SLO Budget Allocation: The system allocates a specific time budget ( $t_{SLO}^{i,j}$ ) for each request based on the remaining end-to-end deadline and the estimated cost of remaining stages.
$t_{SLO}^{i,j} = (T_{SLO}^i - \tau_{elapsed}^i) \cdot \frac{t_{comp}^{i,j}}{\sum t_{comp}^{i,k}}$
This ensures that if a query falls behind, downstream stages receive tighter constraints.
Urgency Metric: Requests are prioritized using a Least-Laxity-First (LLF) inspired metric:
$U_{i,j} = t_{comp}^m - (t_{SLO}^{i,j} - \tau_{i,j})$
Where $\tau_{i,j}$ is the actual queuing delay. Requests with higher urgency (closer to missing their deadline) are processed first, preventing starvation and SLO violations.

3. Key Contributions

Formalization of Design Principles: The paper identifies and formalizes three critical requirements for agentic Text-to-SQL serving: explicit multi-stage dependency management, heterogeneity-aware allocation, and end-to-end SLO guarantees.
HEXGEN-FLOW Framework: Proposes a novel two-level scheduler combining global workload-balanced dispatching with local urgency-driven prioritization, specifically tailored for heterogeneous GPU clusters.
Simulation-Guided Auto-Tuning: Introduces a low-overhead method to dynamically tune the dispatching hyperparameter ( $\alpha$ ) based on real-time workload and hardware dynamics.
Comprehensive Evaluation: Demonstrates significant performance gains over state-of-the-art baselines (vLLM, VTC, QLM, LLF, Ray) across diverse traces and hardware configurations.

4. Experimental Results

The framework was evaluated using realistic traces from the BIRD-bench and Spider datasets, running on heterogeneous clusters (A100, L40, A6000) with the LLaMA3.1-70B model.

Performance Gains:
- Latency: Reduced P95 tail latency by 1.42 $\times$ to 1.56 $\times$ compared to state-of-the-art systems.
- Throughput: Increased system throughput by 1.49 $\times$ to 1.81 $\times$ .
Robustness:
- Maintained strong performance under high load (up to 30 QPS) and fluctuating workloads.
- Successfully handled multi-tenant scenarios with distinct SLOs, achieving a Jain's fairness index of 0.98.
- Showed consistent improvements across different models (QWEN3-30B) and workflows (MAC-SQL).
Ablation Studies:
- Removing the Workload-Balanced (WB) dispatching (replacing with Round-Robin) increased P95 latency by 10.5%–34.8%.
- Removing the Local Priority (PQ) queue (replacing with FCFS) increased P95 latency by up to 48%.
- The $\alpha$ -tuning mechanism was shown to be effective, with tuning overhead being negligible (1.9s–3.1s per window) compared to workload drift timescales.

5. Significance

HEXGEN-FLOW bridges the gap between advanced agentic AI workflows and practical infrastructure deployment. It demonstrates that general-purpose LLM serving frameworks are insufficient for complex, multi-stage pipelines. By explicitly modeling workflow dependencies and hardware heterogeneity, HEXGEN-FLOW enables enterprises to deploy high-quality, agentic Text-to-SQL systems with predictable latency and high efficiency. This work provides a blueprint for scheduling not just Text-to-SQL, but any complex, multi-stage agentic DAG (Directed Acyclic Graph) workflow in heterogeneous computing environments.