Adaptive Multi-Objective Tiered Storage Configuration for KV Cache in LLM Service

This paper introduces Kareto, an adaptive multi-objective optimizer that efficiently navigates the complex configuration space of tiered KV cache storage to dynamically balance cost, throughput, and latency, significantly outperforming static strategies in LLM inference services.

Xianzhe Zheng, Zhengheng Wang, Ruiyan Ma, Rui Wang, Xiyu Wang, Rui Chen, Peng Zhang, Sicheng Pan, Zhangheng Huang, Chenxin Wu, Yi Zhang, Bo Cai, Kan Liu, Teng Ma, Yin Du, Dong Deng, Sai Wu, Guoyun Zhu, Wei Zhang, Feifei Li

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are running a massive, super-smart library (a Large Language Model, or LLM) that helps people write stories, code, and answers. To work fast, this library keeps a "cheat sheet" of the most recent words it has read in its immediate memory (the GPU's HBM). This is called the KV Cache.

However, there's a problem: The cheat sheet gets huge. If too many people ask questions at once, or if the questions are very long, the cheat sheet runs out of space on the expensive, high-speed memory.

The Old Way: Buying Too Much or Too Little

Traditionally, library managers had to guess how much extra storage to buy.

  • The "Over-Prepared" Manager: Buys a massive, expensive warehouse (DRAM) just in case. It's fast, but it costs a fortune, and often sits half-empty.
  • The "Under-Prepared" Manager: Saves money by buying a tiny warehouse. When the cheat sheet gets too big, they have to throw things away and re-calculate everything from scratch. This makes the library slow and frustrating for users.
  • The "One-Size-Fits-All" Rule: They use a simple rule like "Keep everything for 5 minutes." But some words are used again and again (like "the" or "hello"), while others are one-time only. Keeping the one-time words wastes space, and deleting the popular words too soon hurts speed.

The New Solution: Kareto (The Smart Librarian)

The paper introduces Kareto, a smart system that acts like an adaptive, data-driven librarian. Instead of guessing, Kareto figures out the perfect balance between speed, cost, and capacity automatically.

Here is how Kareto works, broken down into three simple concepts:

1. The "Crystal Ball" Simulator

Before making any changes, Kareto uses a high-fidelity simulator. Think of this as a "flight simulator" for the library.

  • It takes real historical data (what people actually asked yesterday and last week).
  • It runs thousands of "what-if" scenarios in a computer: "What if we have 500GB of fast memory and 2TB of slow disk? What if we keep things for 10 minutes vs. 1 hour?"
  • It predicts exactly how fast the library would be and how much it would cost in each scenario.

2. Finding the "Sweet Spot" (The Pareto Frontier)

Kareto knows you can't have everything. If you want the fastest speed, it costs more. If you want the cheapest cost, it might be slower.

  • Kareto draws a map of all possible options.
  • It finds the Pareto Frontier: This is the "Goldilocks Zone" of configurations. These are the setups where you cannot get faster without paying more, and you cannot get cheaper without slowing down.
  • It gives the library manager a menu of these perfect options, so they can choose based on their current needs (e.g., "We need speed today" vs. "We need to save money today").

3. The "Smart Filing System" (Group TTL)

This is Kareto's secret sauce. Instead of treating every piece of information the same, Kareto looks at patterns.

  • The Problem: Imagine a filing cabinet where you throw out a file after 5 minutes. If a file is used 100 times an hour, throwing it out is a disaster. If a file is never used again, keeping it is a waste.
  • The Kareto Fix: Kareto groups files by their "family" (prefixes).
    • Example: In a chatbot, the phrase "Once upon a time" is used in thousands of stories. Kareto sees this pattern and says, "Keep this specific family of files forever!"
    • Meanwhile, for a unique, one-time question, it says, "Throw this away immediately."
  • This "Group TTL" (Time-To-Live) ensures the storage is filled with the right things, not just random things.

The Results: Why It Matters

When the researchers tested Kareto with real-world data, the results were impressive:

  • Speed: It made the library 58% faster in some cases by keeping the right things in the fast memory.
  • Cost: It saved 20% on costs by avoiding the purchase of unnecessary expensive memory.
  • Throughput: It handled 9% more requests per second.

The Big Picture

Think of Kareto as the difference between a static, rigid robot and a flexible, intelligent human.

  • Old Systems: "We have 1TB of memory. That's it. Deal with it."
  • Kareto: "I see you have a rush hour on Mondays and a quiet Tuesday. I'll rent a cheap, slow warehouse for the slow days and upgrade to a fast, expensive one for the rush. Plus, I'll only keep the popular books on the front shelf and the rare ones in the basement."

Kareto turns the complex math of memory management into an automatic, self-adjusting system that saves money and makes AI faster for everyone.