Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

This paper introduces a structured distillation method that compresses personalized agent conversation history into compact, four-field summaries averaging 38 tokens per exchange, achieving an 11x reduction in token usage while preserving retrieval quality comparable to or exceeding verbatim baselines across thousands of software engineering conversations.

Sydney Lewis

Published 2026-03-16
📖 4 min read☕ Coffee break read

Imagine you are talking to a very smart, but very forgetful, digital assistant. You've been working together for months, solving complex coding problems, debugging errors, and planning projects. You remember the big picture: "Oh right, we fixed that weird database timeout issue three weeks ago."

But your assistant? It has amnesia. Every time you start a new chat, it's like meeting for the first time. It doesn't know about the database issue unless you tell it again.

The problem is, if you try to feed your assistant all your past conversations to remind it of what happened, you run out of space. It's like trying to carry a 500-page encyclopedia in your pocket just to remember one phone number. It's too heavy, too expensive, and too slow.

This paper proposes a clever solution: Structured Distillation. Think of it as turning your massive, messy history into a set of highly organized, tiny index cards.

The Problem: The "Encyclopedia" vs. The "Index Card"

In the old way, to remember a past conversation, you'd have to load the entire transcript (the "encyclopedia") into the assistant's brain.

  • The Cost: A single conversation exchange might be 371 words long. If you have 1,000 exchanges, that's nearly 400,000 words. That's too much for the assistant to hold at once.
  • The Old Fix (Summarization): Usually, people try to summarize these chats. But generic summarization is like a bad photocopier that keeps making copies of copies. The more you copy, the blurrier the image gets. Important details (like specific error codes or file names) get lost in the blur.

The Solution: The "Memory Palace"

The authors created a system that doesn't just summarize; it distills. They take every conversation and break it down into a structured "compound object" with four specific parts, like a filing system:

  1. The Core (The "What"): A 1-2 sentence summary of what was actually accomplished. Analogy: The title of a book chapter.
  2. The Context (The "Details"): The specific technical details, like error messages or parameter names. Analogy: The specific ingredients in a recipe.
  3. The Rooms (The "Where"): Thematic tags that sort the conversation into "rooms" (e.g., "Database," "Security," "Deployment"). Analogy: Walking into a specific room in a house to find what you need.
  4. The Files (The "Evidence"): A list of actual file paths touched during the chat. Analogy: The specific pages in a manual you opened.

The Magic: They compress a 371-word conversation down to just 38 words (an 11x reduction). It's like shrinking a whole novel down to a single sticky note, but keeping the sticky note organized so you can find the right one instantly.

The Test: Does the "Sticky Note" Work?

The big question was: If we throw away 90% of the text, will the assistant still find the right memory when you ask a question?

They tested this with 201 questions (like "How did we fix the login bug?") against thousands of conversations. They used two ways to search:

  1. Keyword Search (BM25): Looking for exact words.
  2. Semantic Search (Vector Search): Looking for meaning and concepts, even if the exact words are different.

The Results:

  • Keyword Search: When they tried to search the tiny "sticky notes" using exact keywords, it failed. Why? Because the distillation process removed the specific jargon needed for keyword matching. It's like trying to find a book by searching for a word that was removed from the index card.
  • Semantic Search: When they used "meaning-based" search, the tiny sticky notes worked almost perfectly (96% as good as the full text). The assistant could understand the concept of the conversation even without the full text.
  • The Hybrid Winner: The absolute best result came from a "two-layer" approach. You use the tiny sticky notes to quickly find the right conversation (Semantic Search), and then you pull up the full, original text to read the details.

The Real-World Impact

This isn't just about saving space; it's about personalized memory.

  • Before: Your assistant starts every day with a blank slate.
  • After: Your assistant carries a compact "memory palace" in its pocket. It can instantly recall, "Ah, in the 'Database' room, we discussed the timeout issue." It then pulls up the full, original conversation for you to read.

The Takeaway

You don't need to carry the whole library to find a specific book. You just need a really good, organized card catalog.

This paper proves that by turning messy conversations into structured, tiny summaries, we can give AI agents a long-term memory that fits in their pocket, without losing the ability to find exactly what they need. The original text isn't deleted; it's just hidden away in the "archive," ready to be pulled out when the tiny index card points the way.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →