Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Imagine you are talking to a very smart, but very forgetful, digital assistant. You've been working together for months, solving complex coding problems, debugging errors, and planning projects. You remember the big picture: "Oh right, we fixed that weird database timeout issue three weeks ago."

But your assistant? It has amnesia. Every time you start a new chat, it's like meeting for the first time. It doesn't know about the database issue unless you tell it again.

The problem is, if you try to feed your assistant all your past conversations to remind it of what happened, you run out of space. It's like trying to carry a 500-page encyclopedia in your pocket just to remember one phone number. It's too heavy, too expensive, and too slow.

This paper proposes a clever solution: Structured Distillation. Think of it as turning your massive, messy history into a set of highly organized, tiny index cards.

The Problem: The "Encyclopedia" vs. The "Index Card"

In the old way, to remember a past conversation, you'd have to load the entire transcript (the "encyclopedia") into the assistant's brain.

The Cost: A single conversation exchange might be 371 words long. If you have 1,000 exchanges, that's nearly 400,000 words. That's too much for the assistant to hold at once.
The Old Fix (Summarization): Usually, people try to summarize these chats. But generic summarization is like a bad photocopier that keeps making copies of copies. The more you copy, the blurrier the image gets. Important details (like specific error codes or file names) get lost in the blur.

The Solution: The "Memory Palace"

The authors created a system that doesn't just summarize; it distills. They take every conversation and break it down into a structured "compound object" with four specific parts, like a filing system:

The Core (The "What"): A 1-2 sentence summary of what was actually accomplished. Analogy: The title of a book chapter.
The Context (The "Details"): The specific technical details, like error messages or parameter names. Analogy: The specific ingredients in a recipe.
The Rooms (The "Where"): Thematic tags that sort the conversation into "rooms" (e.g., "Database," "Security," "Deployment"). Analogy: Walking into a specific room in a house to find what you need.
The Files (The "Evidence"): A list of actual file paths touched during the chat. Analogy: The specific pages in a manual you opened.

The Magic: They compress a 371-word conversation down to just 38 words (an 11x reduction). It's like shrinking a whole novel down to a single sticky note, but keeping the sticky note organized so you can find the right one instantly.

The Test: Does the "Sticky Note" Work?

The big question was: If we throw away 90% of the text, will the assistant still find the right memory when you ask a question?

They tested this with 201 questions (like "How did we fix the login bug?") against thousands of conversations. They used two ways to search:

Keyword Search (BM25): Looking for exact words.
Semantic Search (Vector Search): Looking for meaning and concepts, even if the exact words are different.

The Results:

Keyword Search: When they tried to search the tiny "sticky notes" using exact keywords, it failed. Why? Because the distillation process removed the specific jargon needed for keyword matching. It's like trying to find a book by searching for a word that was removed from the index card.
Semantic Search: When they used "meaning-based" search, the tiny sticky notes worked almost perfectly (96% as good as the full text). The assistant could understand the concept of the conversation even without the full text.
The Hybrid Winner: The absolute best result came from a "two-layer" approach. You use the tiny sticky notes to quickly find the right conversation (Semantic Search), and then you pull up the full, original text to read the details.

The Real-World Impact

This isn't just about saving space; it's about personalized memory.

Before: Your assistant starts every day with a blank slate.
After: Your assistant carries a compact "memory palace" in its pocket. It can instantly recall, "Ah, in the 'Database' room, we discussed the timeout issue." It then pulls up the full, original conversation for you to read.

The Takeaway

You don't need to carry the whole library to find a specific book. You just need a really good, organized card catalog.

This paper proves that by turning messy conversations into structured, tiny summaries, we can give AI agents a long-term memory that fits in their pocket, without losing the ability to find exactly what they need. The original text isn't deleted; it's just hidden away in the "archive," ready to be pulled out when the tiny index card points the way.

1. Problem Statement

The paper addresses the asymmetric memory problem in human-AI agent interactions. While humans retain fragments of past interactions (recognition-based memory), AI agents typically start every session with zero context. To maintain continuity, agents must load conversation history into their context window.

The Bottleneck: A typical conversation exchange averages 371 tokens. Loading hundreds of past exchanges (e.g., 500 exchanges ≈ 185,000 tokens) consumes a massive portion of the context window, making it expensive and often infeasible.
The Failure of Current Solutions: Existing approaches often use "lossy summarization" (e.g., summarizing summaries) to compress history. This is unstructured, lacks a schema, and leads to iterative information degradation.
The Goal: Can a single user's conversation history be distilled into a compact, structured retrieval layer that preserves enough information for high-quality recall, while keeping the original verbatim text available for "drill-down" on demand?

2. Methodology

2.1 Structured Distillation

The core innovation is a structured distillation process that converts raw conversation exchanges into "compound objects" (referred to as "palace objects").

Segmentation: Conversations are split into "exchanges" (User message + Agent response).
Distillation Fields: Each exchange is compressed into four specific fields:
1. exchange_core (LLM-generated): A 1–2 sentence summary of what was accomplished (analogous to a Git commit message).
2. specific_context (LLM-generated): One distinguishing technical detail (e.g., specific error messages, parameter names, file paths).
3. room_assignments (LLM-generated): 1–3 thematic tags (e.g., file:auth_middleware, concept:retry_timeout) organizing the exchange into a "memory palace."
4. files_touched (Regex-extracted): Exact file paths referenced in the text.
Surviving Vocabulary Principle: The LLM is instructed to reuse specific terms from the original exchange rather than paraphrasing freely, ensuring that high-IDF (discriminative) tokens survive the compression.
Output: The distilled text (exchange_core + specific_context) averages 38 tokens, achieving an 11× compression ratio (371 → 38 tokens) compared to the verbatim source.

2.2 Two-Tier Architecture

The system separates Index from Display:

Index Layer: The distilled objects are embedded and indexed for search. This layer determines which conversations are retrieved.
Display Layer: The search results return the original verbatim text. The user never sees the distilled summary; they only see the compressed index used for routing.

2.3 Evaluation Setup

Corpus: 4,182 conversations (14,340 exchanges) from a single developer across 6 software engineering projects.
Query Set: 201 recall-oriented queries (Conceptual, Phrase, Exact Term).
Grading: 5 local LLMs (7–12B parameters) graded relevance on a 0–3 scale, with consensus voting and a calibrated Claude Opus grader for tie-breaking.
Search Configurations: 107 configurations tested across:
- Pure Modes: Search on Verbatim vs. Distilled text using HNSW (Vector), Exact (Vector), and BM25 (Keyword).
- Cross-Layer Modes: Fusing Verbatim keyword search with Distilled vector search.

3. Key Contributions

Structured Distillation Method: A novel pipeline that compresses exchanges into structured objects with a 11× token reduction (371 → 38 tokens) while preserving 96.8% of query-relevant vocabulary.
Comprehensive Evaluation: A rigorous 107-configuration study comparing retrieval quality across pure and cross-layer search modes, revealing that retrieval mechanism choice is more critical than the compression method itself.
Mechanism-Dependent Preservation: Demonstrated that Vector Search (HNSW/Exact) preserves retrieval quality almost perfectly after compression, while Keyword Search (BM25) degrades significantly due to the loss of lexical overlap.
Complementary Signals: Showed that fusing Verbatim keyword search with Distilled vector search yields the highest overall performance (MRR 0.759), slightly outperforming the best pure verbatim baseline (MRR 0.745).

4. Key Results

4.1 Retrieval Quality (MRR)

Best Pure Distilled: Distill Core+Rooms using Exact vector search achieved an MRR of 0.717, preserving 96% of the best verbatim baseline (0.745).
Best Cross-Layer: Fusing BM25 (Verbatim) + HNSW (Distilled) achieved an MRR of 0.759, outperforming the best verbatim-only setup.
Compression Efficiency: 1,000 exchanges fit in ~39,000 tokens (distilled) vs. ~407,000 tokens (verbatim).

4.2 Mechanism Dependency

Vector Search (HNSW/Exact): All 20 distilled configurations showed no statistically significant degradation in retrieval quality compared to verbatim. Effect sizes were negligible to small ( $|d| \le 0.25$ ).
Keyword Search (BM25): All 20 distilled configurations showed statistically significant degradation. Effect sizes ranged from small to medium/large ( $|d| = 0.031–0.756$ $∣ d ∣ = 0.031-0.756$ ).
- Reason: BM25 relies on exact lexical overlap. The 11× compression removes specific vocabulary (only 27% of top-IDF tokens survive), breaking keyword matching.

4.3 Query Type Analysis

Exact Term Queries: Distilled search sometimes outperformed verbatim search, likely due to the explicit inclusion of file paths and metadata in the files_touched field.
Conceptual Queries: Showed the largest degradation in distilled modes, as abstract semantic content is harder to preserve without the full context.

5. Significance and Implications

Feasibility of Persistent Agent Memory: The paper proves that an agent can maintain a "working memory" of thousands of past interactions within a single prompt context without overflow, provided the memory is structured and distilled.
Shift from Summarization to Structured Extraction: It argues against generic "summarize this" approaches. Instead, it proposes schema-driven extraction where the agent preserves specific technical handles (errors, paths, parameters) verbatim while compressing the narrative.
Hybrid Retrieval Strategy: The results suggest that the optimal architecture for personalized agent memory is Cross-Layer Fusion: using distilled text for semantic vector search (to find the right conversation) and verbatim text for keyword verification (to ensure exact matches).
Cost Reduction: Enables a single-user agent to operate with a fraction of the token cost, making long-term memory economically viable for local or edge deployment.

6. Limitations

Single-User Corpus: Results are specific to one developer's style; generalizability to other users or domains is untested.
Low Inter-Rater Agreement: LLM graders showed low agreement ( $\kappa=0.175$ ), though the trend (Vector > BM25 for distilled) was consistent across all graders.
No Live Task Evaluation: The study measures retrieval quality, not whether the agent performs better on downstream coding tasks using this memory.
BM25 Degradation: Pure keyword search on distilled text is not viable; the system relies heavily on vector search or hybrid approaches.

Conclusion

Structured distillation offers a viable path to personalized agent memory. By compressing exchanges into structured objects and leveraging vector search, the system achieves 11× token reduction with 96% retention of retrieval quality. The optimal strategy involves a two-tier architecture where the distilled index routes queries, and the verbatim source provides the final evidence, potentially enhanced by cross-layer fusion to maximize recall.