Key-Value Means

The paper introduces Key-Value Means (KVM), a novel block-recurrence mechanism for attention that unifies the benefits of transformers and linear RNNs by enabling efficient, chunk-parallelizable training with flexible state growth and subquadratic prefill time, all while using standard operations and minimal additional parameters.

Original authors: Daniel Goldstein, Eugene Cheah

Published 2026-05-12✓ Author reviewed
📖 5 min read🧠 Deep dive

Original authors: Daniel Goldstein, Eugene Cheah

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to read a very long book, but your brain has a limited amount of "working memory" to hold the story in your head while you read.

The Problem with Current AI
Current AI models (Transformers) act like a student who tries to remember every single word they have ever read in the book.

  • The Good: They are incredibly accurate because they have the whole story in front of them.
  • The Bad: As the book gets longer, their "working memory" grows huge. Reading a 100-page book takes a tiny bit of effort, but reading a 1,000-page book takes a massive amount of time and energy. It's like trying to carry a backpack that gets heavier with every step you take.

The Problem with Recurrent (RNN-style) Models
RNN-style models take a different approach: they keep a small, fixed-size summary of what they have read so far and update it as they go.

  • The Good: They are super fast and light. Their backpack never gets heavier, no matter how long the book is.
  • The Bad: They have a hard time remembering the beginning of the story. If you ask them about a plot point from page 10, they might not remember it because they only hold onto a compressed version of the last few pages.

The New Solution: Key-Value Means (KVM)
The authors of this paper introduce a new method called Key-Value Means (KVM). Think of KVM as a smart, magical notebook that combines the best of both worlds.

Here is how it works using a simple analogy:

1. The "Sliding Window" (The Immediate Context)

Imagine you are reading a book, and you have a magnifying glass that only lets you see the last few pages clearly. This is the "Sliding Window." KVM pays perfect attention to the most recent words, just like a standard AI does. This ensures it doesn't miss the immediate context.

2. The "Compressed Summary" (The Long-Term Memory)

As you read past those few pages, the old pages slide out of your magnifying glass. Instead of throwing them away (like the RNN-style models) or trying to carry the whole book (like the current AI), KVM does something clever:

  • It looks at the pages that just slid out.
  • It asks: "Which of these pages are the most important or unique?"
  • It writes a short, compressed summary of those important pages into a special notebook.
  • If a new page comes along that is very similar to what's already in the notebook, it just updates the existing note. If it's something totally new and surprising, it adds a fresh line to the notebook.

3. The "Smart Merging" (The Magic Trick)

The paper describes a specific way of merging information called a "Winner-Take-All" rule.

  • Imagine you have a bucket of water (the new information) and a sponge (the notebook).
  • Instead of just dumping the water in, KVM finds the exact spot in the sponge that matches the water best and absorbs it there.
  • It also uses a "Just-in-Time" normalization. Instead of constantly recalculating the average every time you add a new drop of water, KVM keeps the running totals in an unnormalized form (raw sums and counts) while it is writing into the notebook. It only divides through to get the proper average when the notebook is actually being read. Doing the division lazily—just in time—avoids repeatedly renormalizing every time a new entry is folded in.

Why This Matters

  • Flexible Size: You can tell KVM to keep a tiny notebook (fixed size) for speed, or let the notebook grow as the book gets longer (expandable size).
  • Speed vs. Memory: It allows you to choose a middle ground. You don't have to choose between "super fast but forgetful" or "super smart but slow." You can tune it to be fast enough for real-time use but smart enough to remember the whole story.
  • No Custom Hardware: Unlike some other new methods that require special, expensive computer chips to run, KVM can run on standard computers using normal software operations.

The Results

The authors tested this on language models (AI that reads and writes text).

  • Short contexts: KVM matched the performance of the best standard models.
  • Long contexts: When the input grew to thousands of tokens, the expandable variant of KVM remembered details far better than RNN-style fixed-memory models while being much faster than full-attention transformers.
  • Needle-in-a-haystack retrieval: The expandable variant could correctly find a specific fact buried deep in a very long input, showing the compressed notebook genuinely preserves earlier information.

In short, KVM is a new way for AI to read long books without getting tired, without forgetting the beginning, and without needing a backpack that gets infinitely heavy. It does this by keeping a clear view of the present while maintaining a smart, compressed summary of the past.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →