Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

The Big Problem: The "Too Many Cooks" Kitchen

Imagine you are running a busy restaurant (your computer) with a very small kitchen counter (your RAM). You have a team of 10 chefs (AI Agents) who need to cook complex meals (generate text).

Each chef has a recipe book (the "KV Cache") that contains everything they've learned so far in the conversation.

The Issue: The kitchen counter is too small to hold the recipe books for all 10 chefs at once.
The Old Way: When Chef A finishes, you throw their recipe book in the trash to make space for Chef B. When Chef A needs to cook again, you have to re-read the entire book from scratch to remember what they were doing.
- Result: If the book is thick (long conversation), re-reading it takes 15 seconds. If you have 10 chefs switching back and forth, you spend most of your time just re-reading, not cooking. The customers (users) get angry because the food takes forever.

The Solution: The "Magic Fridge" (Persistent Disk Cache)

This paper proposes a new system: Don't throw the recipe book away. Put it in a Magic Fridge (your SSD hard drive).

Compressing the Book (Q4 Quantization):
Before putting the book in the fridge, you shrink it down. Imagine taking a 500-page novel and compressing it into a tiny, 125-page pocket guide without losing the story. This is called 4-bit quantization. It makes the recipe book 4 times smaller, so you can fit way more of them in your kitchen.
The Magic Fridge (Disk Persistence):
Instead of throwing the book away, you save it to the fridge. When Chef A comes back, you don't re-read the whole book. You just pull the pocket guide out of the fridge and hand it to them.
- Result: Instead of taking 15 seconds to re-read, it takes 0.5 seconds to grab the book from the fridge.
The "Hidden" Wait (Interleaving):
Here is the clever part. While Chef A is grabbing their book from the fridge (0.5 seconds), Chef B is already cooking their meal. Because the fridge is so fast, Chef A is ready before Chef B even finishes their first bite. The "waiting time" is completely hidden.

Why This Matters for Your Phone or Laptop

Most powerful AI servers are huge data centers with massive counters. But this paper is about Edge Devices—like your MacBook, iPhone, or a small laptop.

Privacy: Your data stays on your device. No one else sees your recipe books.
Cost: You don't need to pay a cloud company to run your AI.
Speed: On a standard laptop, switching between 10 different AI conversations used to be agonizingly slow. With this system, it feels instant.

The "Magic" Analogy: The Library vs. The Bookshelf

Without this system: Every time you switch topics, you have to walk to the library, find the book, read the first 100 pages to remember the plot, and then continue writing.
With this system: You keep a bookmark in your pocket. When you switch topics, you just open the book to the bookmark. It's instant.

What Did They Actually Do? (The Technical Bits Simplified)

The "Block Pool": They built a smart filing cabinet that organizes these compressed recipe books by "Agent ID." It keeps them separate so Chef A's notes don't get mixed with Chef B's.
The "Batched" Kitchen: They figured out how to let multiple chefs cook at the same time using the same stove, even though the stove (the computer chip) is small.
The "Cross-Phase" Memory: If a conversation has different "phases" (e.g., Phase 1: Planning, Phase 2: Execution), the system remembers the planning phase without making you re-read it. It just adds the new "Execution" notes to the existing file.

The Results: How Much Faster?

The researchers tested this on three different types of AI models (Gemma, DeepSeek, and Llama) on an Apple M4 Pro chip.

The "Cold Start" (No cache): Taking 15 seconds to start a conversation.
The "Warm Start" (With this system): Taking 0.5 seconds.
The Speedup: In some cases, it was 136 times faster.
Capacity: They could fit 4 times more active conversations in the same amount of memory.

The Trade-off: Is the Food Still Good?

When you shrink a book (quantization), does the story change?

The Test: They checked if the AI made mistakes or sounded "dumb" after using the compressed books.
The Verdict: Almost perfect. The quality dropped by less than 3% (which is barely noticeable to humans). The AI still sounds smart, but it's much faster and fits on your laptop.

Summary

This paper is about teaching your computer to remember things efficiently without needing a supercomputer.

By saving AI "memories" to the hard drive in a compressed format, they turned a slow, painful process of re-learning into a fast, instant retrieval. It's like upgrading from a library where you have to re-read every book from page one, to a library where you just pull a bookmark off the shelf and keep going.

The Bottom Line: You can now run complex, multi-agent AI workflows on your personal laptop with the speed of a data center, keeping your data private and your wallet happy.

1. Problem Statement

The paper addresses a critical bottleneck in multi-agent Large Language Model (LLM) systems running on edge devices (e.g., Apple Silicon Macs): memory management and latency during context switching.

Memory Constraints: Edge devices have fixed, soldered RAM (e.g., 24 GB). In a multi-agent workflow, each agent requires its own independent Key-Value (KV) cache to avoid "position bias" (where middle tokens in a concatenated prompt receive less attention).
The Capacity Gap: Storing KV caches for multiple agents in FP16 precision quickly exhausts memory. On an M4 Pro with a ~10.2 GB cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow forces constant eviction and reloading.
Latency Penalty: When an agent is evicted, the system must re-run the prefill phase (O(n) computation) to regenerate the KV cache. On edge devices, this is prohibitively slow (e.g., 15.7 seconds for a 4K context on Gemma 3). This "dead time" makes interactive multi-agent workflows unresponsive.
Volatility: Standard in-memory caches are lost on server restarts or device sleep, forcing a full cold-start for every agent.

2. Methodology

The authors propose a system that treats KV caches as persistent working memory, moving them from volatile RAM to disk (SSD) and utilizing 4-bit quantization (Q4) to maximize density.

Core Components

Persistent Block Pool:
- The system partitions KV caches into fixed-size blocks (256 tokens) organized by Agent ID.
- Caches are stored on disk in safetensors format using a Q4 quantized pipeline (uint32 packed weights + bfloat16 scales/biases).
- This provides per-agent isolation, preventing cache corruption or mixing between agents and ensuring data sovereignty (critical for GDPR/HIPAA compliance).
Q4 Quantization Pipeline:
- Format: KV caches are quantized to 4 bits (Q4) with a group size of 64.
- Efficiency: This reduces memory footprint by ~72% compared to FP16 (ratio of 0.281).
- Direct Attention: The system uses a fused Quantized Scaled Dot-Product Attention kernel that operates directly on Q4 tensors, eliminating the need to dequantize to FP16 before attention calculation.
BatchQuantizedKVCache & Scheduler:
- Implements concurrent inference over multiple agents' quantized caches.
- Uses an interleaved prefill+decode scheduler (inspired by Orca) that alternates between agents. While Agent A is decoding, Agent B's cache is loaded from disk.
- Latency Hiding: The disk reload latency (~500 ms) is hidden behind the decode phase of the currently generating agent, as decode typically takes longer than the reload.
Cross-Phase Context Injection:
- For multi-phase workflows (e.g., negotiation, debate), the system accumulates KV state across phases.
- Instead of re-prefilling, Phase $N$ loads the cache from Phase $N-1$ , extends the prompt (EXTEND match), and continues generation. This treats the KV cache as persistent working memory.

3. Key Contributions

Persistent Q4 KV Cache System: A novel architecture that persists agent-specific KV caches to disk in a quantized format, surviving server restarts and enabling sub-second context restoration.
Model-Agnostic Abstraction: The system supports diverse architectures (Dense GQA, MoE with MLA, sliding-window attention) via a ModelCacheSpec abstraction, handling complex masking and dimension differences (e.g., DeepSeek's asymmetric K/V heads).
Batched Quantized Inference: Implementation of BatchQuantizedKVCache allowing concurrent inference on multiple agents' quantized caches on edge devices, a capability missing in existing MLX-based solutions.
OpenAI-Compatible API: The system exposes a standard API, allowing any existing agent framework (AutoGen, CrewAI) to utilize persistent caching without code modification.

4. Experimental Results

Evaluated on an Apple M4 Pro (24 GB RAM) using three models: Gemma 3 12B, DeepSeek-Coder-V2-Lite 16B, and Llama 3.1 8B.

Time-to-First-Token (TTFT) Reduction:
- Cold Prefill vs. Warm Cache: The system reduces TTFT by 11× to 136×.
  - Gemma 3 (32K context): 172s (cold) $\to$ 1.3s (hot) = 136× speedup.
  - DeepSeek (32K context): 47.3s $\to$ 0.6s = 76× speedup.
  - Llama 3.1 (16K context): 47.6s $\to$ 0.5s = 91× speedup.
- Disk Reload: Even with disk I/O ("warm" state), TTFT drops to ~500–800 ms, crossing the 1-second "acceptable" latency threshold for interactive AI.
Memory Capacity:
- Q4 quantization allows 4× more agents to fit in memory compared to FP16.
- At 8K context on M4 Pro: FP16 fits 3 agents; Q4 fits 12 agents.
- At 16K context: FP16 fits 1 agent; Q4 fits 6 agents.
Quality Impact (Perplexity):
- Q4 quantization introduces minimal degradation:
  - Gemma 3: -0.7% (within noise).
  - Llama 3.1: +2.8%.
  - DeepSeek: +3.0%.
- These results are consistent with prior literature on Q4 KV cache.
Comparison with vllm-mlx:
- vllm-mlx (FP16): Fails under multi-agent memory pressure (eviction occurs at 8K+ context) and loses all state on restart.
- Agent-Memory (Q4): Survives restarts, handles 16K+ contexts without eviction, and matches vllm-mlx's prefix-cache latency when warm.

5. Significance and Implications

Enabling Edge Multi-Agent Systems: This work makes complex, multi-agent workflows (e.g., 10+ agents) feasible on consumer hardware by solving the memory capacity and latency bottlenecks that previously required datacenter GPUs.
Data Sovereignty & Privacy: By keeping KV caches on-device and isolated per agent, the system mitigates prompt leakage attacks (like PROMPTPEEK) and complies with strict data regulations (GDPR, HIPAA) without needing external servers.
Virtual Memory for Attention: The system effectively implements "virtual memory" for LLM attention states, paging caches between SSD and RAM to provide the illusion of unbounded context for each agent.
Practical Deployment: The system bridges the gap between theoretical quantization research and practical edge deployment, demonstrating that disk persistence + quantization is a viable strategy for interactive, stateful AI agents on consumer devices.

Conclusion: The paper demonstrates that by shifting KV cache management from volatile RAM to persistent, quantized disk storage, edge devices can support complex, multi-agent LLM workflows with datacenter-grade responsiveness and significantly reduced latency, all while maintaining high model fidelity.