CacheWise: Understanding Workloads and Optimizing KVCache Management for Efficiently Serving LLM Coding Agents

The paper introduces CacheWise, a KVCache management layer designed for LLM coding agents that leverages real-world workload analysis to combine prefix-aware scheduling with metadata-guided eviction, significantly reducing cache evictions and accelerating session completion times compared to conventional serving policies.

Original authors: Shubham Tiwari, Tapan Chugh, Nash Rickert, Simon Peter, Ratul Mahajan, Haiying Shen

Published 2026-06-16
📖 5 min read🧠 Deep dive

Original authors: Shubham Tiwari, Tapan Chugh, Nash Rickert, Simon Peter, Ratul Mahajan, Haiying Shen

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are a highly skilled chef (the AI) working in a busy kitchen (the computer server) to help a team of researchers build complex software.

In a normal chatbot scenario, a customer walks up, asks a question, the chef answers, and the customer leaves. It's a quick transaction.

But Coding Agents are different. They are like a chef who has been hired to build a whole house. The chef doesn't just answer one question; they spend hours in a continuous loop:

  1. The chef thinks of a plan.
  2. They grab a tool (like a hammer or a saw) to do a task.
  3. They wait for the tool to finish.
  4. They look at the result, think of the next step, and grab a different tool.
  5. They repeat this hundreds of times for a single project.

The problem is that the kitchen has a very small counter space (the GPU memory). Every time the chef works on a step, they leave a pile of ingredients and notes on the counter (this is called the KVCache). Because the project is so long, the counter gets full.

The Problem: The "Thrashing" Kitchen

In current kitchens (existing AI systems), the manager runs on a simple rule: "First Come, First Served."

  • If a new order comes in, the manager clears the counter to make space, even if the current chef is just about to finish a step and needs those exact ingredients back in 10 seconds.
  • The manager also uses a rule called "Least Recently Used" (LRU). This means they throw away the ingredients that haven't been touched for the longest time.

Why this fails for coding agents:

  1. The "First Come" mistake: The manager interrupts the chef building the house to let a new, unrelated order start. The chef has to clear their notes, go get the ingredients from the freezer (slow), and start over. This wastes huge amounts of time.
  2. The "LRU" mistake: The chef might be waiting for a slow tool (like a paint drying) to finish. The manager sees the chef hasn't touched the counter in a while and throws away the notes, thinking they are useless. But the chef was just waiting! Now, the chef has to re-read the blueprints from scratch.

This constant clearing and re-filling of the counter is called "thrashing." It's like a chef running back and forth to the freezer so much that they never actually cook anything.

The Solution: CacheWise

The researchers built a new kitchen manager called CacheWise. It uses two clever tricks to stop the chef from wasting time:

1. The "Prefix-Aware" Scheduler (The Smart Queue)

Instead of just taking orders in the order they arrive, CacheWise looks at the chef's current notes.

  • Analogy: If Chef A is already halfway through a recipe and just needs one more ingredient, and Chef B is starting a brand new recipe, CacheWise lets Chef A go first.
  • Why: It's much faster to finish a recipe that's already 90% done than to start a new one from scratch. This keeps the "notes" (KVCache) on the counter where they belong, so the chef doesn't have to run to the freezer.

2. The "Predictive" Eviction (The Crystal Ball)

When the counter must be cleared because it's full, the manager has to decide whose notes to throw away.

  • Old Way: Throw away the notes that haven't been touched in the longest time.
  • CacheWise Way: Look at the tool the chef is using.
    • If the chef is using a "grep" tool (a quick search), the manager knows they will be back in 1 second. Don't throw away the notes!
    • If the chef is using a "pytest" tool (running a huge test suite), the manager knows that will take 5 minutes. It's safe to throw away the notes for now.
  • How it works: The system looks at the "name" and "arguments" of the tool the chef is using. It has learned from past cooking sessions that certain tools take a long time and others are instant. It uses this to guess when the chef will need those notes again.

The Results

When the researchers tested this new manager on real-world coding data:

  • Less Running Around: The chef had to go to the freezer (re-compute or move data) 2 to 2.6 times less often.
  • Faster House Building: The total time to finish a coding project (the "session") was up to 3.5 times faster.
  • Better Efficiency: The kitchen produced more useful code per hour because the chef spent less time waiting and re-doing work.

Summary

CacheWise is like a smart kitchen manager who understands that coding agents are long-term projects. Instead of treating every request as a separate, isolated event, it keeps the chef's notes on the counter as long as possible and only throws them away when it's certain the chef won't need them for a while. This stops the "thrashing" and lets the AI build software much faster.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →