Original authors: Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

Published 2026-05-06

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are running a highly efficient, super-fast kitchen where a master chef (the AI) is cooking complex meals for many customers at once.

The Problem: The "Stop-and-Start" Kitchen

In a normal AI chatbot, the chef cooks a dish, serves it, and then immediately starts the next one. If the kitchen gets crowded, the chef throws away the half-prepped ingredients for the current dish to make room for a new customer's order. This works fine for simple chats.

But modern AI "agents" are different. They don't just chat; they act. They think, then they call a tool (like checking the weather or searching the web), wait for the result, and then continue cooking the same meal.

Here is the glitch in current systems:

The chef starts cooking a meal.
The chef pauses to call a tool (e.g., "Check the weather").
Because the chef is "paused," the kitchen system assumes the order is finished. It throws away the half-prepped ingredients (the KV Cache) to make room for other orders.
The tool finishes in 2 seconds. The chef is ready to continue.
Disaster: The ingredients are gone! The chef has to either re-buy them from a distant warehouse (CPU offloading) or re-chop everything from scratch (re-computation).
Worse, because the ingredients were thrown away, the chef has to wait in line behind other customers just to get a spot on the cutting board again.

This happens over and over. If an agent takes 20 steps to solve a problem, it might waste 20 times re-doing work and waiting in line.

The Solution: CacheTTL (The "Keep-It-Ready" Timer)

The researchers built a new system called CacheTTL. Think of it as giving the chef a special "Keep-It-Ready" timer for every order.

Instead of immediately throwing away the ingredients when the chef pauses to call a tool, the system says: "Wait! This chef might be back in 2 seconds. Let's keep the ingredients on the counter for a specific amount of time (Time-To-Live, or TTL)."

Here is how it works simply:

Smart Prediction: The system looks at history. "Usually, when the chef calls 'Check Weather,' it takes about 2 seconds. When they call 'Search the Web,' it takes 5 seconds."
The Timer: It sets a timer based on that prediction. If the tool call is expected to take 2 seconds, the ingredients stay on the counter for 2.5 seconds.
The Payoff:
- If the chef returns in time: The ingredients are still there! The chef picks up right where they left off. No re-chopping, no waiting in line.
- If the chef is late: If the tool takes 10 seconds instead of 2, the timer runs out. The system safely throws the ingredients away to make room for other customers, preventing the kitchen from getting clogged up.

Why is this better than what we had before?

Previous systems tried to guess if they should keep the ingredients, but they only looked at one thing: "Is it expensive to re-buy the ingredients?" They ignored the bigger problem: "How long will the chef have to wait in line to get back to work?"

CacheTTL looks at both:

The cost of re-making the food.
The cost of waiting in line (queueing delay).

It calculates the perfect amount of time to keep the ingredients on the counter to save the most time overall.

The Results

The researchers tested this with real-world AI agents that solve software bugs, search the web, and write code. They found that:

Speed: The agents finished their tasks up to 8 times faster in some real-world tests.
Efficiency: The kitchen (GPU) could handle more orders at once without getting stuck.
Robustness: Even if the tool calls took longer than expected, the system didn't crash or get stuck; it just let the timer expire and moved on.

In a Nutshell

CacheTTL is like a smart kitchen manager who knows that when a chef pauses to make a phone call, they aren't done cooking. By keeping the ingredients ready for just the right amount of time, it stops the chef from having to start over or wait in line, making the whole kitchen run much smoother and faster.

Technical Summary: CacheTTL

Problem Statement

Large Language Model (LLM) inference engines currently rely on "end-of-turn" eviction policies for Key-Value (KV) cache management. In these systems, once a request finishes decoding, its KV cache is evicted from GPU memory to maximize utilization for new incoming requests. While effective for standard multi-turn chatbots, this policy fails for agentic workloads (e.g., software engineering agents, tool-using agents) that follow the ReAct paradigm.

In agentic workflows, inference steps are interleaved with external tool calls. These tool calls introduce pauses that are often shorter than human typing speeds but long enough to trigger KV cache eviction in standard engines. When the agent returns from a tool call to initiate the next inference step, the system must either:

Recompute the prefix (prefill), incurring significant latency.
Reload the KV cache from CPU memory (if offloading is enabled), which introduces per-turn queueing delays. Even if the reload is fast, the request must wait in the queue for GPU memory to be freed by other active requests.

Existing solutions fail to address this holistically:

InferCept considers reload costs but ignores the accumulating per-turn queueing delay and lacks robustness against variable tool durations.
Autellix and Pie focus on static workflows or lack specific retention policies for dynamic tool calls.
Static retention strategies risk deadlocking GPU memory if tool calls take longer than expected.

The core challenge is to retain KV caches long enough to avoid recomputation and queueing delays without monopolizing GPU memory during unpredictable tool execution times.

Methodology: CacheTTL

The authors propose CacheTTL, a serving system that introduces a Time-to-Live (TTL) mechanism for KV cache retention. Instead of immediate eviction, CacheTTL selectively pins KV caches in GPU memory for a calculated duration.

1. Cost-Benefit Utility Model

CacheTTL determines the optimal TTL ( $\tau$ ) by modeling the trade-off between the cost of retaining the cache and the benefit of avoiding eviction.

Cost ( $Cost(\tau, r)$ ): The opportunity cost of occupying GPU memory, calculated as the ratio of the request's memory usage to the average memory footprint of active requests, multiplied by the TTL duration ( $\tau$ ). This represents the latency added to other requests blocked by the pinned cache.
Benefit ($Benefit(r)$): The sum of two avoided costs:
1. CacheMissCost: The time required to reconstruct the KV cache (prefill) or reload it from CPU.
2. OutofOrderCost: The per-turn queueing delay that would occur if the request were evicted and had to wait in the queue for GPU memory to become available. This term is critical and distinguishes CacheTTL from prior work like InferCept. It is scaled by a memoryfulness factor ( $\eta$ ), which measures how the number of remaining steps correlates with the current progress (e.g., fixed-length programs have high memoryfulness).

2. TTL Calculation

The system calculates the optimal TTL ( $\tau^*$ ) by maximizing the expected net benefit:
$\tau^* = \arg\max_{\tau} \left( P(\tau, f) \times Benefit(r) - Cost(\tau, r) \right)$
Where $P(\tau, f)$ is the probability that tool call $f$ completes within time $\tau$ , estimated using an empirical Cumulative Distribution Function (CDF) from historical tool-call records.

Cold Start Handling: When historical data is scarce, CacheTTL defaults to a fixed TTL derived from an exponential distribution assumption or global averages.
Robustness: If a tool call exceeds the TTL, the KV cache is automatically evicted, preventing indefinite memory occupation and potential deadlocks.

3. Scheduling and System Design

Program-Level FCFS: CacheTTL combines TTL with a First-Come-First-Serve (FCFS) policy at the program level (not just request level). This ensures that requests belonging to the same agent program are prioritized to maintain continuity.
TTL-Aware Priority: The scheduler assigns a multi-key priority tuple:
1. Preempted status.
2. TTL Status: Requests with active TTLs (pinned) are prioritized over unpinned ones.
3. Program-level arrival time.
Implementation: Built on top of vLLM with a modular Tool-Call Handler. This handler parses LLM outputs to identify tool calls, records inter-request intervals, and communicates TTL values to the scheduler. The design adds minimal overhead (~1k lines of Python) and is compatible with CPU/SSD offloading.

Key Contributions

Problem Identification: The paper identifies that existing KV cache eviction policies degrade agentic performance by ignoring per-turn queueing delays and the variability of tool call durations.
CacheTTL System: A novel serving system utilizing a TTL mechanism to dynamically retain KV caches. It balances prefill/reload costs against memory contention costs.
Novel Cost Model: The introduction of the OutofOrderCost term, which quantifies the queueing delay penalty of eviction, and the memoryfulness factor to adapt to different workload structures.
Robustness: The system handles unpredictable tool durations by enforcing a maximum retention time, preventing deadlocks and memory pressure.

Experimental Results

The authors evaluated CacheTTL using Llama-3.1 (8B/70B), Gemma-3 (12B), and GLM-4.5 (355B) across SWE-Bench, BFCL, and OpenHands workloads on various hardware (A100, H100, B200).

Latency Reduction: CacheTTL reduces average job completion time by 1.12x to 3.66x compared to baselines (vLLM, Autellix, InferCept). In real-world SWE-agent tests on Company A's internal testbed, it achieved up to 8.18x improvement in delay.
Throughput: Throughput improved by 1.10x to 3.22x.
Robustness: The system maintains stable performance as the number of turns increases, whereas baselines degrade significantly due to accumulated queueing delays.
Compatibility: CacheTTL outperforms baselines even when combined with CPU and SSD offloading, demonstrating that its scheduling benefits are orthogonal to memory offloading techniques.
Overhead: The scheduling overhead is negligible (single-digit milliseconds), far outweighed by the end-to-end latency savings.

Significance and Claims

The paper claims that CacheTTL provides the first robust, tool-aware KV cache management strategy specifically designed for multi-turn agentic workloads. By shifting from "end-of-turn" eviction to a TTL-based retention policy, the system effectively bridges the gap between LLM inference and external tool execution.

The authors emphasize that their approach is modular and can be integrated into existing inference engines with minimal changes. They argue that principled, tool-aware KV management is essential for the future of efficient agent serving, particularly as agentic workflows become more complex and multi-turn. The paper concludes by open-sourcing their traces, code, and testbed to foster further research in agent serving.

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live