Token Management in Multi-Tenant AI Inference Platforms

Imagine you run a massive, high-tech coffee shop that serves millions of customers every day. This isn't just any coffee shop; it's an AI "inference" shop. Instead of baristas, you have powerful GPU computers (the "baristas") that brew complex "cups of coffee" (AI answers) for different customers (tenants).

The problem? Making a cup of coffee isn't always the same.

Sometimes a customer just wants a quick espresso (a short question).
Sometimes they want a 10-hour latte art masterpiece (a long, complex reasoning task).
Sometimes 500 people order at once (a traffic burst).

In the past, managing this shop was a mess. Here is the problem and the solution proposed in the paper, explained simply.

The Old Way: The Broken Queue System

Previously, the shop tried to manage demand in two bad ways:

The "Private Booth" Approach: You gave every customer their own private booth and their own barista.
- The Problem: If a customer leaves for lunch, their barista sits there doing nothing, wasting money. If a new customer arrives, you have to build a whole new booth. It's incredibly inefficient.
The "One Cup Per Minute" Approach: You put a sign up saying, "Everyone gets 10 cups per minute."
- The Problem: This treats a quick espresso the same as a 10-hour latte. If someone orders a massive latte, it ties up the barista for hours, and everyone else waits in line. The shop gets clogged, and the "quick" customers suffer because of the "big" ones.

The New Solution: "Token Pools"

The author, William Cunningham, proposes a new system called Token Pools. Think of this as a smart, dynamic currency system for the coffee shop.

Instead of counting "requests" (cups), the system counts "Tokens" (the actual effort and resources required to make the coffee).

1. The Three Currencies

The system realizes that running an AI model costs three different things, and it tracks all of them:

Speed (Tokens/Second): How fast the barista can work.
Memory (KV Cache): How much counter space is needed to hold the ingredients while making the drink. (A long latte needs a huge counter; a quick espresso needs a tiny one).
Concurrency: How many drinks can be made at the exact same time.

2. The "Entitlement" (Your VIP Pass)

Every customer gets a VIP Pass (an Entitlement). This pass doesn't just say "You can order 10 times." It says:

"You are guaranteed enough speed, counter space, and barista time to make X drinks per second."
It also tells the system who you are: Are you a VIP (Guaranteed), a Regular (Elastic), or a Walk-in (Spot)?

3. The Service Classes (The Hierarchy)

The system treats different customers differently based on their VIP status:

Dedicated/Guaranteed (The VIPs): They have a reserved table. Even if the shop is empty, their table is theirs. If the shop is full, they never get kicked out.
Elastic (The Regulars): They get a table, but if the VIPs need more space, the Regulars might have to squeeze in or wait a moment. However, if they were squeezed out earlier, the system remembers and gives them a "coupon" (Debt) to get a better spot later.
Spot/Preemptible (The Walk-ins): They only get a table if there is extra space. If the VIPs or Regulars need the space, the Walk-ins are politely asked to leave immediately.

4. The "Debt" Mechanism (The Fairness Fairy)

This is the cleverest part. Imagine a Regular customer (Elastic) gets pushed out of their seat because a VIP arrived. They are annoyed.

The system tracks this as "Debt."
The more they are pushed out, the higher their "Debt" score gets.
When the shop gets less busy, the system looks at the Debt scores. The customer with the highest Debt gets priority to sit down first, even if they are technically a "Regular."
This ensures that no one is starved forever. It creates a fair-share system where everyone gets what they need over time.

How It Works in Real Life (The Experiments)

The author tested this in a simulated Kubernetes cluster (a digital coffee shop).

Experiment 1: The VIP vs. The Walk-in

Scenario: A huge rush of "Walk-in" (Spot) traffic floods the shop.
Old Way: Everyone gets stuck in a long line. The VIPs wait 19+ seconds for their coffee.
New Way (Token Pools): The system sees the line is getting too long. It politely tells the Walk-ins, "Sorry, come back later" (HTTP 429 error). The VIPs get their coffee in under 1.2 seconds. The shop stays efficient, and the important customers are happy.

Experiment 2: The Fairness Test

Scenario: The shop loses half its baristas (a server failure). Two Regular customers (a fast "Coding Assistant" and a slow "Data Pipeline") have to share the remaining space.
Old Way: They might fight, or the slow one might hog the barista.
New Way: The system knows the Coding Assistant needs speed (tight deadline) and the Data Pipeline can wait (loose deadline). It gives the Coding Assistant priority.
The Twist: The Data Pipeline gets "pushed out" a lot, so its Debt score goes up. As the outage continues, the system slowly gives the Data Pipeline more time so it doesn't starve. Once the baristas return, the Debt is paid off, and everyone goes back to normal.

Why This Matters

The genius of Token Pools is that it acts like a bouncer at the door (the API Gateway) rather than trying to rearrange the furniture inside the kitchen (the GPU scheduler).

It doesn't need to change how the AI models work.
It doesn't need to rewrite the operating system.
It just decides who gets in and who waits based on a fair, dynamic currency system before the work even starts.

In short: It turns a chaotic, first-come-first-served AI server into a well-managed, fair, and efficient club where VIPs get their drinks instantly, and everyone else gets a fair shot based on how long they've been waiting.

1. Problem Statement

Multi-tenant AI inference platforms face a critical challenge in balancing high resource utilization with strict Service Level Objectives (SLOs) under variable demand. Existing approaches fail to address the unique characteristics of Large Language Model (LLM) inference:

Dedicated Endpoints: Provisioning GPU instances per tenant or model ensures isolation but leads to significant capacity stranding when models are idle, particularly for the "long tail" of infrequently used models.
Conventional Rate Limits: These govern request admission based on simple quotas (e.g., tokens-per-minute) but ignore the heterogeneous cost of inference. Two requests with the same token count can differ by orders of magnitude in GPU time and Key-Value (KV) cache consumption depending on sequence length and model architecture.
Lack of Work-Conservation: Neither approach allows idle capacity to be dynamically borrowed by other tenants during bursts.
Burst Sensitivity: Inference traffic exhibits complex bursts (prompt-length, output-length, and concurrency) that exhaust specific resources (like KV cache) even if total token throughput seems manageable.

Current systems (e.g., vLLM, Orca) optimize execution efficiency but lack a control-plane abstraction to manage multi-tenant capacity allocation (who gets how much, under what guarantees) without modifying the underlying inference runtime.

2. Methodology: Token Pools

The authors propose Token Pools, a control-plane abstraction that represents inference capacity as explicit entitlements in inference-native units.

A. Resource Model

Token pools decompose capacity into three schedulable resources:

Token Throughput ( $\lambda$ ): Tokens/second (bounding GPU time).
KV Cache Capacity ( $\chi$ ): Memory in bytes (bounding attention state).
Concurrency ( $r$ ): Active sequences (bounding decode slot competition).

B. Service Classes & Entitlements

Tenants are granted entitlements to portions of the pool capacity, categorized into service classes that dictate protection ordering:

Dedicated/Guaranteed: Reserved baseline allocation; never shrunk or evicted.
Elastic: Time-averaged guarantees; can burst but may be shrunk below baseline if needed, accumulating "debt."
Spot: No baseline guarantee; consumes surplus capacity; throttled first during contention.
Preemptible: Can be fully evicted (terminated) to free resources.

C. Priority and Debt Mechanism

A dynamic priority weight ( $w_e$ ) drives admission and autoscaling decisions. It combines:

Service Class Weight: A base multiplier ensuring class hierarchy dominates.
SLO Urgency: Tighter latency targets yield higher priority.
Burst History: Penalizes sustained overconsumption.
Service Debt ( $d_e$ ): A feedback mechanism tracking underservice. If an entitlement receives less than its baseline, its debt increases, boosting its priority for future allocation to ensure fair-share convergence.
- Formula: $w_e = w_{\kappa} \cdot (1 + \alpha_{slo} \cdot \ell^*)^{-1} \cdot (1 + \alpha_{burst} \cdot b_e)^{-1} \cdot (1 + \alpha_{debt} \cdot d_e)$ .

D. System Architecture

The system operates as a layer above existing infrastructure (e.g., Kubernetes, vLLM) without modifying the inference runtime:

Virtual Nodes: Synthetic Kubernetes nodes represent token pool capacity.
Admission Control: An API gateway intercepts requests, validates them against entitlements (checking concurrency, token budget, and priority), and decides admission.
Feedback Loop: The gateway updates Redis with actual resource usage (tokens consumed, latency) after request completion, updating burst and debt metrics in real-time.

3. Key Contributions

Formalization of Token Pools: A new resource abstraction decomposing capacity into throughput, KV cache, and concurrency, managed via a priority mechanism combining service class, SLO, and debt.
Kubernetes-Native Architecture: Repurposing the Kubernetes scheduler via Virtual Nodes to handle token capacity admission, enabling declarative configuration and GitOps integration.
Debt-Based Fairness: A novel mechanism that ensures elastic workloads receive compensatory allocation after periods of underservice, preventing starvation while respecting SLOs.
Experimental Validation: Evidence showing the system maintains bounded latency for protected workloads during overload and achieves fair-share convergence among elastic workloads.

4. Results

Experiments were conducted on a Kubernetes cluster with vLLM backends using two scenarios:

Experiment 1: Cross-Class Protection (Overload Scenario)
- Setup: Guaranteed and Spot tenants shared a pool. A "flood" of Spot traffic created a 38% overload.
- Outcome: With Token Pools, P99 latency for guaranteed workloads remained bounded at <1.2s. Excess Spot requests were rejected (HTTP 429) immediately.
- Baseline Comparison: Without admission control, the request queue grew to 34, causing P99 latency to degrade to >19s for all workloads.
- Finding: Admission control at the API boundary prevents queue buildup, protecting high-priority traffic without modifying the GPU scheduler.
Experiment 2: SLO-Aware Fair Share (Capacity Scarcity)
- Setup: Three elastic tenants with different SLOs (500ms, 5s, 30s) shared reduced capacity (simulating a node failure).
- Outcome: The system correctly throttled the latency-tolerant tenant (30s SLO) while prioritizing the latency-critical tenant (500ms SLO).
- Debt Mechanism: The underserved tenant accumulated debt, which increased its priority weight over time (from ~20 to ~83), narrowing the priority gap and preventing starvation. Upon capacity recovery, debt decayed, and priority returned to SLO-based baselines.
- Finding: The system dynamically adapts to capacity changes without manual reconfiguration, unlike static rate limits.

5. Significance

Inference-Native Management: Moves beyond CPU/GPU-centric scheduling to manage resources in units that matter for LLMs (tokens, KV cache, concurrency).
Non-Invasive Integration: The solution sits in the control plane (gateway/scheduler), allowing adoption on mature backends (vLLM, TensorRT-LLM) without rewriting execution engines.
Economic & Operational Efficiency: Enables "work-conserving" backfill (using idle capacity for low-priority spot traffic) while guaranteeing SLOs for critical production workloads.
Fairness & Stability: The debt mechanism provides a principled way to handle "noisy neighbors" and capacity fluctuations, ensuring that temporary resource shortages do not permanently penalize specific tenants.

In conclusion, Token Pools provide a principled foundation for multi-tenant AI infrastructure, resolving the tension between high utilization and service guarantees by shifting admission control from the container level to the request level, driven by inference-native metrics.