Original authors: Shubham Tiwari, Tapan Chugh, Nash Rickert, Simon Peter, Ratul Mahajan, Haiying Shen

Published 2026-06-16

📖 5 min read🧠 Deep dive

Original authors: Shubham Tiwari, Tapan Chugh, Nash Rickert, Simon Peter, Ratul Mahajan, Haiying Shen

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a highly skilled chef (the AI) working in a busy kitchen (the computer server) to help a team of researchers build complex software.

In a normal chatbot scenario, a customer walks up, asks a question, the chef answers, and the customer leaves. It's a quick transaction.

But Coding Agents are different. They are like a chef who has been hired to build a whole house. The chef doesn't just answer one question; they spend hours in a continuous loop:

The chef thinks of a plan.
They grab a tool (like a hammer or a saw) to do a task.
They wait for the tool to finish.
They look at the result, think of the next step, and grab a different tool.
They repeat this hundreds of times for a single project.

The problem is that the kitchen has a very small counter space (the GPU memory). Every time the chef works on a step, they leave a pile of ingredients and notes on the counter (this is called the KVCache). Because the project is so long, the counter gets full.

The Problem: The "Thrashing" Kitchen

In current kitchens (existing AI systems), the manager runs on a simple rule: "First Come, First Served."

If a new order comes in, the manager clears the counter to make space, even if the current chef is just about to finish a step and needs those exact ingredients back in 10 seconds.
The manager also uses a rule called "Least Recently Used" (LRU). This means they throw away the ingredients that haven't been touched for the longest time.

Why this fails for coding agents:

The "First Come" mistake: The manager interrupts the chef building the house to let a new, unrelated order start. The chef has to clear their notes, go get the ingredients from the freezer (slow), and start over. This wastes huge amounts of time.
The "LRU" mistake: The chef might be waiting for a slow tool (like a paint drying) to finish. The manager sees the chef hasn't touched the counter in a while and throws away the notes, thinking they are useless. But the chef was just waiting! Now, the chef has to re-read the blueprints from scratch.

This constant clearing and re-filling of the counter is called "thrashing." It's like a chef running back and forth to the freezer so much that they never actually cook anything.

The Solution: CacheWise

The researchers built a new kitchen manager called CacheWise. It uses two clever tricks to stop the chef from wasting time:

1. The "Prefix-Aware" Scheduler (The Smart Queue)

Instead of just taking orders in the order they arrive, CacheWise looks at the chef's current notes.

Analogy: If Chef A is already halfway through a recipe and just needs one more ingredient, and Chef B is starting a brand new recipe, CacheWise lets Chef A go first.
Why: It's much faster to finish a recipe that's already 90% done than to start a new one from scratch. This keeps the "notes" (KVCache) on the counter where they belong, so the chef doesn't have to run to the freezer.

2. The "Predictive" Eviction (The Crystal Ball)

When the counter must be cleared because it's full, the manager has to decide whose notes to throw away.

Old Way: Throw away the notes that haven't been touched in the longest time.
CacheWise Way: Look at the tool the chef is using.
- If the chef is using a "grep" tool (a quick search), the manager knows they will be back in 1 second. Don't throw away the notes!
- If the chef is using a "pytest" tool (running a huge test suite), the manager knows that will take 5 minutes. It's safe to throw away the notes for now.
How it works: The system looks at the "name" and "arguments" of the tool the chef is using. It has learned from past cooking sessions that certain tools take a long time and others are instant. It uses this to guess when the chef will need those notes again.

The Results

When the researchers tested this new manager on real-world coding data:

Less Running Around: The chef had to go to the freezer (re-compute or move data) 2 to 2.6 times less often.
Faster House Building: The total time to finish a coding project (the "session") was up to 3.5 times faster.
Better Efficiency: The kitchen produced more useful code per hour because the chef spent less time waiting and re-doing work.

Summary

CacheWise is like a smart kitchen manager who understands that coding agents are long-term projects. Instead of treating every request as a separate, isolated event, it keeps the chef's notes on the counter as long as possible and only throws them away when it's certain the chef won't need them for a while. This stops the "thrashing" and lets the AI build software much faster.

Technical Summary: CacheWise

Problem Statement

Coding agents represent a rapidly growing class of Large Language Model (LLM) applications where the model generates code and executes it via a closed-loop sequence of tool calls (e.g., reading files, running tests, modifying code). Unlike traditional chatbots, which involve short, discrete user-turn interactions, coding agent sessions are long-running, accumulate massive context prefixes, and are dominated by tool-initiated turns rather than direct user input.

Existing LLM serving systems (e.g., vLLM, Mooncake) are optimized for chat workloads using First-Come-First-Served (FCFS) scheduling and Least-Recently-Used (LRU) KVCache eviction. These workload-agnostic policies fail to address the specific characteristics of coding agents:

High Prefix Overlap: Requests within the same session frequently reuse large, growing prefixes of the KVCache.
Tool-Dependent Reuse Intervals: The time until a session's KVCache is reused depends heavily on the duration of the currently executing tool call, which varies significantly by tool type and arguments.
Memory Thrashing: FCFS scheduling interleaves requests from multiple sessions, expanding the active working set and forcing the eviction of prefixes that will be needed imminently. LRU eviction, relying solely on past access recency, cannot distinguish between a session waiting for a short tool call and one waiting for a long one, leading to the eviction of high-value blocks and subsequent recomputation or data movement.

This mismatch results in KVCache thrashing, reduced token goodput (useful tokens generated per unit time), and significantly increased session completion times.

Methodology: CacheWise

The authors propose CacheWise, a KVCache management layer designed specifically for coding agent workloads. Implemented as an extension to vLLM, CacheWise introduces two core mechanisms to optimize KVCache reuse and minimize eviction overhead:

1. Prefix-Aware Request Scheduling

Instead of FCFS, CacheWise prioritizes inference requests based on the degree of prefix overlap with KVCache blocks already resident in accelerator memory (XPU).

Mechanism: At any time $t$ , the scheduler selects the request $r_i$ that requires the fewest additional blocks ( $a_i(t)$ ) to be allocated.
Rationale: This greedy approach minimizes the immediate memory pressure and reduces the "thrashing" effect on other sessions. It approximates Shortest-Job-First scheduling, optimizing for end-to-end session completion time rather than per-request latency (TTFT/TBT), which is less critical in closed-loop agent workflows where no user is waiting on individual requests.

2. Predictive KVCache Eviction

CacheWise replaces LRU eviction with a policy that estimates the time to next reuse ( $\tau_i$ ) for each session's resident blocks.

Insight: While predicting exact tool execution times is difficult, the relative order of reuse across sessions can be estimated using lightweight predictors trained on historical tool call metadata.
Predictor: The system analyzes tool call metadata (tool name and arguments) to estimate the expected remaining time for a tool call to complete.
- Semantic Clustering: To handle the high variance in execution times for the same tool (e.g., bash commands), CacheWise clusters historical samples based on TF-IDF embeddings of tool arguments. This allows the predictor to distinguish between a simple ls command and a complex pytest run.
- Decision Logic: When memory pressure requires eviction, CacheWise evicts blocks from the session with the largest predicted $\tau_i$ (i.e., the one that will be reused furthest in the future), adhering to a practical approximation of Belady's optimal algorithm.
Implementation: The system maintains an eviction heap. As time elapses, predictions are periodically re-evaluated to account for the shifting "time-to-next-reuse" as tool calls progress.

Key Contributions

Dataset and Characterization: The authors collected and analyzed CATraces, a dataset of real-world coding assistant traces from researchers using Claude Code. This is the first public dataset of its kind, revealing that coding agents differ fundamentally from chatbots in terms of session length, context growth, and the dominance of tool-initiated turns.
System Design (CacheWise): The design and implementation of a KVCache management layer that combines prefix-aware scheduling with predictive eviction based on tool metadata.
Empirical Evaluation: Comprehensive evaluation using real-world traces on a multi-GPU testbed, demonstrating significant improvements over state-of-the-art baselines.

Results

Evaluated on real-world coding agent traces using a 32B parameter model (Qwen2.5-Coder-32B-Instruct) on H200 GPUs:

Session Completion Time: CacheWise reduces total agent session completion time by 2.7× to 3.5× compared to vLLM and InferCept.
KVCache Evictions: It reduces the number of KVCache evictions by 2–2.6×, significantly lowering the overhead of recomputation and data movement.
Token Goodput: CacheWise improves token goodput by 1.64× to 2× at high load levels ( $N > 10$ concurrent sessions).
Throughput: Request throughput increases by 1.5× to 2×.
Data Movement: In systems utilizing offloading (GPU $\leftrightarrow$ CPU), CacheWise reduces KVCache transfer volume by 2–2.6×.
Ablation Studies:
- Predictive Eviction: Contributes a 1.2×–1.6× improvement in token goodput over systems with only prefix-aware scheduling.
- Prefix-Aware Scheduling: Contributes a 1.38×–1.7× improvement in token goodput even without predictive eviction.
- Predictor Accuracy: Finer-grained clustering of tool arguments (up to 100 clusters) yields the best performance, improving session completion time by up to 19% compared to coarse estimators.
Overhead: While CacheWise incurs 2.4×–3× higher CPU scheduling overhead than vLLM, this is negligible compared to the massive reduction in GPU model execution time (e.g., a net reduction of ~4.7s per request at $N=40$ ).

Significance

The paper argues that coding agents constitute a distinct class of LLM workloads that cannot be efficiently served by systems designed for chatbots. The significance of CacheWise lies in its ability to decouple the serving system from the agent implementation framework, allowing it to optimize for the specific "closed-loop" nature of coding agents without requiring changes to the agents themselves. By leveraging tool metadata to predict reuse, CacheWise effectively mitigates the memory pressure and thrashing inherent in long-running, context-heavy sessions, demonstrating that session-level optimization is critical for the scalability of agentic AI applications. The authors have open-sourced both the dataset and the implementation to facilitate further research in this domain.

CacheWise: Understanding Workloads and Optimizing KVCache Management for Efficiently Serving LLM Coding Agents