AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Here is an explanation of the paper AMV-L using simple language and everyday analogies.

The Problem: The "Cluttered Garage" Effect

Imagine you have a personal assistant (an AI agent) who helps you with your life. Over time, this assistant collects a massive amount of information: your favorite coffee order, the code for a project you worked on last year, a recipe you tried once, and a thousand random facts you mentioned in passing.

The Current Way (TTL):
Most AI systems today manage this memory like a garage with a strict "expiration date" rule. If you put a box in the garage, it stays there for exactly 30 days. After 30 days, it gets thrown out, no questions asked.

The Flaw: This keeps the garage from overflowing, but it doesn't stop the search from getting slow. When you ask, "What's my coffee order?", the assistant has to dig through every single box in the garage that hasn't expired yet to find the right one. If you have 10,000 boxes, that search takes forever. Sometimes, the search is fast; other times, the assistant gets stuck digging through a mountain of irrelevant boxes, causing a massive delay (a "tail latency" spike).

The Result: The assistant is reliable for simple tasks but gets overwhelmed and slow when you ask complex questions after months of use.

The Solution: AMV-L (The "Smart Librarian")

The paper introduces AMV-L, a new way to manage memory. Instead of just looking at how old a memory is, AMV-L looks at how useful it is.

Think of AMV-L as a super-intelligent librarian who organizes a library not by the date the book was published, but by how often people actually read it and how much they love it.

How It Works: The Three Shelves

The librarian divides the library into three specific zones (Tiers):

The "Hot" Shelf (Front Desk):
- This is where the most useful, frequently used items live.
- When you ask a question, the librarian only looks here first.
- Why it helps: The search area is tiny and fast. You get an answer instantly.
The "Warm" Shelf (Back Room):
- These are items that are useful but not needed every day. They are kept safe but aren't on the front desk.
- The librarian only pulls a few of these out if the "Hot" shelf doesn't have the answer.
The "Cold" Shelf (The Basement):
- These are old, rarely used items. They are stored away so they don't clutter the main search area.
- If an item stays in the basement too long without being used, it gets thrown away to save space.

The Magic: "Value" vs. "Age"

In the old system (TTL), a memory is only kept if it's "young."
In the new system (AMV-L), a memory is kept if it has Value.

Scenario A: You mention your coffee order every day. The "Value" score goes up. The item stays on the Hot Shelf, even if it's been there for a year.
Scenario B: You mention a random fact once, and never again. Its "Value" score slowly drops. It moves from the Hot Shelf to the Warm Shelf, then to the Cold Shelf, and eventually gets deleted.
The Benefit: The assistant never wastes time searching through the basement (Cold Shelf) or the back room (Warm Shelf) unless absolutely necessary. It focuses its energy only on the "Hot" items.

The Results: Speed and Stability

The researchers tested this new system against the old "Garage" system and a middle-ground system (LRU, which just keeps the most recently used items).

Speed: The new system was 3 times faster at handling requests than the old system.
No More "Freezing": The old system had moments where it would freeze for 2+ seconds because it was searching through too much junk. The new system almost eliminated these freezes (dropping from 13% of requests being slow to 0.007%).
Smarter Answers: Because the system keeps high-value information (like your coffee order) even if it's old, it doesn't forget important things just because they aren't "fresh."

The Big Takeaway

The paper argues that for AI agents to be truly reliable, we can't just treat their memory like a storage closet where things rot after a set time. We need to treat memory like a resource that needs active management.

By separating what is stored (the whole library) from what is searched (only the Hot Shelf), AMV-L ensures that the AI stays fast and responsive, no matter how long you've been using it. It trades a tiny bit of speed on average requests to completely eliminate the "nightmare" slow requests that ruin the user experience.

In short: AMV-L stops the AI from digging through the whole attic to find a single screw; it keeps the screw right on the workbench where it belongs.

Here is a detailed technical summary of the paper "AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems."

1. Problem Statement

Long-running Large Language Model (LLM) agents require persistent memory to maintain state across interactions (e.g., user preferences, project context). However, most deployed systems rely on Time-to-Live (TTL) for memory retention. While TTL bounds storage age, it fails to bound the computational footprint of memory on the request path.

The Core Mismatch: Under TTL, all non-expired items remain eligible for retrieval. As the total retained memory grows, the "retrieval working set" (the pool of items scanned for similarity) grows unpredictably.
Consequence: This leads to heavy-tailed latency. While median latency might appear stable, rare requests trigger massive vector similarity scans, causing extreme outliers (p99/p99.9) that violate Service Level Objectives (SLOs) and destabilize throughput.
Limitations of Existing Solutions:
- TTL: Bounds storage but not retrieval cost.
- LRU (Least Recently Used): Bounds the working set by recency but is "value-agnostic." It may evict long-lived, high-utility information during phase shifts, failing to preserve critical long-term context.

2. Methodology: AMV-L Framework

The authors propose AMV-L (Adaptive Memory Value Lifecycle), a framework that treats agent memory as a managed systems resource rather than a passive store. It decouples total retained memory from the request-path working set using value-driven lifecycle management.

Key Components:

Memory Value Model:
- Each memory item $m$ is assigned a scalar utility score $V(m)$ .
- Updates: The score is updated incrementally and locally based on three signals:
  - Access: Was the item considered during retrieval?
  - Contribution: Was the item actually injected into the final prompt?
  - Decay: Exponential decay over time ( $\lambda$ ) to penalize inactivity.
- Formula: $V(m) \leftarrow \min(V(m)e^{-\lambda \Delta t} + \alpha I_{access} + \beta I_{contrib}, V_{max})$ .
Tiered Lifecycle Organization:
Memory is partitioned into three tiers based on $V(m)$ :
- Hot Tier ( $T_H$ ): High-utility items. Eligible for normal request-path retrieval.
- Warm Tier ( $T_W$ ): Moderate utility. Retained but excluded from the default high-frequency retrieval path (or sampled with a strict budget).
- Cold Tier ( $T_C$ ): Low utility. Retained at minimal cost; eligible for eviction if value drops below a threshold.
Bounded Retrieval Path:
- Retrieval is restricted to a bounded candidate set: $R = T_H \cup \text{Sample}_k(T_W)$ .
- Eligibility Control: Limits which items can be scanned (controlling vector search cost).
- Injection Control: A fixed top- $n$ cap limits how many items enter the final prompt (controlling prompt length).
- Crucial Insight: AMV-L bounds the search space (eligibility), whereas previous methods often only bounded the output (prompt cap).
System Architecture:
- Implemented in a full-stack LLM serving system (API Gateway, Vector Engine, Lifecycle Manager).
- Lifecycle transitions (promotion/demotion) and value updates occur asynchronously or lazily to avoid blocking the critical request path.

3. Key Contributions

Formulation: Identifies uncontrolled memory working-set growth as the primary driver of tail latency in long-running agents, distinct from storage capacity issues.
Policy Design: Introduces AMV-L, a value-driven policy that explicitly decouples retention from eligibility, ensuring that request-path computation remains bounded regardless of total memory size.
Trade-off Analysis: Demonstrates a clear performance frontier between LRU (better median latency) and AMV-L (better extreme-tail latency and token efficiency).
Implementation: Provides a full-stack implementation and evaluation, proving that lifecycle management is superior to simple age-based retention for SLO compliance.

4. Experimental Results

The system was evaluated against TTL and LRU baselines under identical long-running workloads (50k writes, 10k retrievals, 10k inference requests).

Performance Metrics (vs. TTL):

Throughput: Improved by 3.1× (9.0 $\to$ 37.0 req/s).
Latency Reduction:
- Median (p50): 4.2× faster.
- p95: 4.7× faster.
- p99: 4.4× faster.
Tail Latency: The fraction of requests exceeding 2 seconds dropped from 13.8% (TTL) to 0.007% (AMV-L).

Comparison with LRU:

Median/P95: LRU is slightly faster (+26% median, +3% p95).
Extreme Tail (p99): AMV-L is 15% faster (1233ms vs 1453ms).
Outliers: AMV-L reduced requests >2s by 98% compared to LRU (0.007% vs 0.343%).
Token Efficiency: AMV-L uses ~6% fewer tokens per request than LRU while maintaining comparable retrieval quality.
Retrieval Quality: Both AMV-L and LRU significantly outperform TTL in retrieval quality (High-value hit rate: ~99.9% for both vs. 99.3% for TTL). AMV-L slightly improves Top-1 retrieved value over LRU.

Mechanism Analysis:

Retrieval Footprint: TTL allowed p95 candidate sets to grow to ~4,800 items. AMV-L bounded this to ~690, and LRU to ~261.
Insight: The massive latency gains come from bounding the vector search work (scanning fewer vectors), not from shortening the final prompt.

5. Significance and Implications

Paradigm Shift: The paper argues that for long-running agents, memory must be managed as a computational resource (like CPU or RAM working sets) rather than just a persistence layer.
SLO Compliance: In production systems constrained by p95/p99 latency, AMV-L offers a superior operating point. It sacrifices a negligible amount of median performance to virtually eliminate catastrophic tail latency events that cause SLO violations and capacity planning nightmares.
Scalability: By decoupling total storage growth from retrieval cost, AMV-L enables agents to scale indefinitely in memory size without degrading inference latency.
Practical Guidance:
- Use TTL only if storage is the primary constraint and latency is insensitive.
- Use LRU if the workload is stationary and median latency is the priority.
- Use AMV-L for production services with strict tail SLOs or non-stationary workloads, as it provides the most robust protection against extreme latency outliers.

In conclusion, AMV-L demonstrates that value-driven lifecycle management is essential for predictable, high-performance long-running LLM agents, offering a solution that balances utility preservation with strict computational bounds.