Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

This paper introduces a system for multi-agent LLM inference on edge devices that persists 4-bit quantized KV caches to disk, enabling direct cache restoration to eliminate redundant prefill computations and achieve up to 136x faster time-to-first-token while fitting four times more agent contexts into limited RAM.

Yakov Pyotr Shkolnikov

Published 2026-03-06
📖 5 min read🧠 Deep dive

The Big Problem: The "Too Many Cooks" Kitchen

Imagine you are running a busy restaurant (your computer) with a very small kitchen counter (your RAM). You have a team of 10 chefs (AI Agents) who need to cook complex meals (generate text).

Each chef has a recipe book (the "KV Cache") that contains everything they've learned so far in the conversation.

  • The Issue: The kitchen counter is too small to hold the recipe books for all 10 chefs at once.
  • The Old Way: When Chef A finishes, you throw their recipe book in the trash to make space for Chef B. When Chef A needs to cook again, you have to re-read the entire book from scratch to remember what they were doing.
    • Result: If the book is thick (long conversation), re-reading it takes 15 seconds. If you have 10 chefs switching back and forth, you spend most of your time just re-reading, not cooking. The customers (users) get angry because the food takes forever.

The Solution: The "Magic Fridge" (Persistent Disk Cache)

This paper proposes a new system: Don't throw the recipe book away. Put it in a Magic Fridge (your SSD hard drive).

  1. Compressing the Book (Q4 Quantization):
    Before putting the book in the fridge, you shrink it down. Imagine taking a 500-page novel and compressing it into a tiny, 125-page pocket guide without losing the story. This is called 4-bit quantization. It makes the recipe book 4 times smaller, so you can fit way more of them in your kitchen.

  2. The Magic Fridge (Disk Persistence):
    Instead of throwing the book away, you save it to the fridge. When Chef A comes back, you don't re-read the whole book. You just pull the pocket guide out of the fridge and hand it to them.

    • Result: Instead of taking 15 seconds to re-read, it takes 0.5 seconds to grab the book from the fridge.
  3. The "Hidden" Wait (Interleaving):
    Here is the clever part. While Chef A is grabbing their book from the fridge (0.5 seconds), Chef B is already cooking their meal. Because the fridge is so fast, Chef A is ready before Chef B even finishes their first bite. The "waiting time" is completely hidden.

Why This Matters for Your Phone or Laptop

Most powerful AI servers are huge data centers with massive counters. But this paper is about Edge Devices—like your MacBook, iPhone, or a small laptop.

  • Privacy: Your data stays on your device. No one else sees your recipe books.
  • Cost: You don't need to pay a cloud company to run your AI.
  • Speed: On a standard laptop, switching between 10 different AI conversations used to be agonizingly slow. With this system, it feels instant.

The "Magic" Analogy: The Library vs. The Bookshelf

  • Without this system: Every time you switch topics, you have to walk to the library, find the book, read the first 100 pages to remember the plot, and then continue writing.
  • With this system: You keep a bookmark in your pocket. When you switch topics, you just open the book to the bookmark. It's instant.

What Did They Actually Do? (The Technical Bits Simplified)

  1. The "Block Pool": They built a smart filing cabinet that organizes these compressed recipe books by "Agent ID." It keeps them separate so Chef A's notes don't get mixed with Chef B's.
  2. The "Batched" Kitchen: They figured out how to let multiple chefs cook at the same time using the same stove, even though the stove (the computer chip) is small.
  3. The "Cross-Phase" Memory: If a conversation has different "phases" (e.g., Phase 1: Planning, Phase 2: Execution), the system remembers the planning phase without making you re-read it. It just adds the new "Execution" notes to the existing file.

The Results: How Much Faster?

The researchers tested this on three different types of AI models (Gemma, DeepSeek, and Llama) on an Apple M4 Pro chip.

  • The "Cold Start" (No cache): Taking 15 seconds to start a conversation.
  • The "Warm Start" (With this system): Taking 0.5 seconds.
  • The Speedup: In some cases, it was 136 times faster.
  • Capacity: They could fit 4 times more active conversations in the same amount of memory.

The Trade-off: Is the Food Still Good?

When you shrink a book (quantization), does the story change?

  • The Test: They checked if the AI made mistakes or sounded "dumb" after using the compressed books.
  • The Verdict: Almost perfect. The quality dropped by less than 3% (which is barely noticeable to humans). The AI still sounds smart, but it's much faster and fits on your laptop.

Summary

This paper is about teaching your computer to remember things efficiently without needing a supercomputer.

By saving AI "memories" to the hard drive in a compressed format, they turned a slow, painful process of re-learning into a fast, instant retrieval. It's like upgrading from a library where you have to re-read every book from page one, to a library where you just pull a bookmark off the shelf and keep going.

The Bottom Line: You can now run complex, multi-agent AI workflows on your personal laptop with the speed of a data center, keeping your data private and your wallet happy.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →