Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Imagine you are hosting a massive dinner party for a large language model (LLM). The model is trying to write a story, sentence by sentence. To do this, it has to remember everything it has written so far.

In the standard way these models work, every time it remembers a word, it writes down a giant, detailed dossier about that word. This dossier includes:

Who the word is (its identity).
What it's doing (its grammar).
Where it fits (its position).
How it relates to other words (its context).

The problem? These dossiers are huge. If you have a long story (a long "context"), the memory required to store all these dossiers becomes so big that it slows the computer down or runs out of space. This is called the KV Cache (Key-Value Cache).

The Big Idea: "Thin Keys, Full Values"

The authors of this paper realized that the standard way of writing these dossiers is wasteful. They noticed that the model actually does two very different jobs when it looks back at its memory:

The "Search" Job (Selection): "Which word from the past is relevant to the current sentence?"
The "Copy" Job (Value Transfer): "Okay, I found the right word. Now, give me its full, rich details so I can use them."

The paper argues that searching is simple, but copying details is complex.

The Creative Analogy: The Library Card vs. The Book

Imagine a massive library (the model's memory).

The Standard Way: Every time you want to find a book, you have to carry the entire book with you just to check its title. You carry a 500-page novel just to see if the title says "Harry Potter." This is incredibly heavy and slow.
The Paper's New Way:
- The Key (The Search): You only need a tiny library card (a "Thin Key") to find the book. The card just has a few numbers or a short code that says "This is the Harry Potter book." You don't need the whole book to find it; you just need enough info to distinguish it from the other 10,000 books.
- The Value (The Content): Once you find the book using the tiny card, then you pull out the full, heavy book (the "Full Value") to read the actual story.

The Insight: You don't need a 500-page dossier to decide which book to pick. You only need a small index card. But once you pick it, you absolutely need the full book to understand the story.

How They Did It

The researchers changed the architecture of the AI so that:

Keys and Queries (The Search): They made these "thin." Instead of using a massive 4096-dimensional vector (a huge list of numbers), they shrunk them down to just 1024 dimensions (or even less). This is like shrinking the library card from a thick booklet to a small sticky note.
Values (The Content): They kept these "full." The actual information the model reads remains huge and detailed.

Why This Matters (The Magic Results)

The paper tested this idea on everything from tiny toy models to massive 7-billion-parameter models (like Mistral-7B). Here is what happened:

The "Search" didn't break: Even with the tiny "sticky note" keys, the model could still find the right words perfectly. It turns out you only need a few dimensions to distinguish between different patterns (like "subject," "verb," or "topic").
Memory Savings: Because the "Keys" are what get stored in the computer's memory (the cache) as the story gets longer, shrinking them saved a massive amount of space.
- For a 7-billion parameter model reading a long document, this saved 25 GB of memory per user.
- Real-world impact: This means a single computer server could handle 60% more people talking to the AI at the same time without crashing.

The "Retrofit" Trick

What if you already have a trained model (like GPT-2 or Mistral) and can't retrain it from scratch? The authors found a clever math trick (SVD compression):

They took the existing "Keys" and mathematically compressed them into a smaller size.
They did a tiny bit of "fine-tuning" (like a quick refresher course) just on the search mechanism.
Result: The model kept almost all of its intelligence but lost 75% of its memory footprint.

Summary in One Sentence

The paper proves that finding the right information in an AI's memory only requires a tiny, low-resolution map, while using that information requires the full, high-resolution picture; by shrinking the map but keeping the picture, we can make AI much faster and cheaper to run without losing its smarts.

1. Problem Statement

Standard Transformer architectures enforce a design symmetry where the projection dimensions for Queries ( $Q$ ), Keys ( $K$ ), and Values ( $V$ ) are identical ( $d_q = d_k = d_v = d_{model}$ ). The authors argue this symmetry is a convention rather than a necessity.

Functional Asymmetry: Attention performs two distinct operations:
1. Selection ( $QK^\top$ ): Determines which tokens are relevant. This is a ranking/scoring operation producing scalar weights.
2. Value Transfer ( $\text{attn} \cdot V$ ): Aggregates and transfers rich semantic information. This requires preserving the full representational capacity of the model.
The Bottleneck: In long-context inference (autoregressive generation), the KV Cache (storing $K$ and $V$ for all previous tokens) is the dominant memory bottleneck. Since $K$ and $V$ are stored at full dimensionality ( $d_{model}$ ), memory usage scales linearly with context length and model size, limiting concurrent user capacity on GPUs.

2. Methodology: Asymmetric Attention

The paper proposes Asymmetric Attention, a modification that decouples the dimensionality of the selection mechanism from the value transfer mechanism.

Core Concept: Project Queries and Keys into a lower-dimensional space ( $d_{select} \ll d_{model}$ $d_{se l ec t} ≪ d_{m o d e l}$ ) while keeping Values at the full model dimension ( $d_{model}$ $d_{m o d e l}$ ).
- $Q = XW_Q, \quad W_Q \in \mathbb{R}^{d_{model} \times d_{select}}$
- $K = XW_K, \quad W_K \in \mathbb{R}^{d_{model} \times d_{select}}$
- $V = XW_V, \quad W_V \in \mathbb{R}^{d_{model} \times d_{model}}$
Theoretical Basis: The authors invoke the Johnson-Lindenstrauss lemma, suggesting that distinguishing among $N$ distinct patterns (selection) requires only $O(\log N)$ dimensions. In contrast, value transfer must preserve the full information content, necessitating $d_{model}$ .
Three Deployment Paths:
1. Training from Scratch: Directly setting $d_{select} < d_{model}$ during pre-training.
2. Post-Training SVD + Fine-tuning: Applying Singular Value Decomposition (SVD) to the pre-trained Key projection matrix ( $W_K \approx AB$ ). The low-rank factor $A$ becomes the new Key projection (stored in cache), while $B$ is absorbed into the Query projection ( $W_Q B^\top$ ) to preserve attention scores. This is followed by lightweight fine-tuning of only the $Q$ and $K$ projections.
3. Zero-Retraining SVD: Applying SVD to $W_K$ without fine-tuning (lower quality recovery but immediate savings).

3. Key Contributions

Theoretical & Empirical Validation of Low-Dimensional Selection: The paper demonstrates that attention selection is inherently low-dimensional. Experiments show that $O(\log N)$ dimensions are sufficient for selection tasks, whereas value transfer requires full dimensionality.
Asymmetric Attention Mechanism: A simple, drop-in architectural change that reduces $Q$ and $K$ dimensions without altering the attention computation logic.
Post-Training Compression Pipeline: A novel method to retrofit existing models (like GPT-2 and Mistral-7B) using SVD on Keys followed by minimal fine-tuning of $Q/K$ projections, recovering nearly all performance loss.
Significant KV Cache Reduction: The method directly targets the inference memory bottleneck, offering substantial savings in GPU memory usage for long-context serving.

4. Experimental Results

The authors validated their hypothesis across seven experiments ranging from algorithmic tasks to 7B parameter models:

Algorithmic Tasks:
- Positional Selection: Achieved 100% accuracy with $d_{select} = 1$ dimension per head.
- Content-Based Retrieval: Achieved 100% accuracy with $d_{select} \approx 2 \log_2 N$ dimensions.
Language Modeling (WikiText-2 & WikiText-103):
- Reducing $d_{select}$ to $d_{model}/4$ resulted in only a 4.3% increase in perplexity while reducing $QK$ parameters by 75%.
- On WikiText-103 (larger dataset, less overfitting), the trade-off was clear: 75% parameter reduction for a 4.3% PPL cost.
Post-Training Compression (GPT-2 & Mistral-7B):
- GPT-2 (124M): SVD compression of $K$ alone to rank 192 ( $d_{model}/4$ ) caused a 27.6% PPL increase. However, fine-tuning only the $Q/K$ projections for 3 epochs reduced the residual gap to +1.8% relative to the uncompressed baseline.
- Mistral-7B (7.2B): Applied the same SVD + Fine-tuning pipeline. At rank 256 (75% Key cache saved), the residual quality cost was just +2.0%.
Architecture Generalization: The degradation ratios were consistent across Vanilla Transformers, LLaMA, and Mistral (with GQA), suggesting the finding is fundamental to the attention mechanism, not an artifact of a specific architecture.

5. Significance and Impact

KV Cache Savings: For a 7B parameter model serving a 128K context, asymmetric attention ( $d_{select} = d_{model}/4$ ) saves 25 GB of KV cache per user. This enables approximately 60% more concurrent users on the same hardware.
Scalability: At a 1M context length, the method saves ~196 GB per user, making long-context inference feasible on standard hardware where it would otherwise be impossible.
Composability: The approach is orthogonal to existing techniques:
- GQA/MQA: Reduces the number of heads; Asymmetric attention reduces the dimension per head. They can be combined.
- Quantization: Reduces bit-width; Asymmetric attention reduces dimensionality. The paper suggests combining them could yield up to 16x compression (e.g., 4x from thin keys + 4x from INT4 quantization).
Practical Adoption: The authors propose a clear path for adoption:
- Immediate: SVD compression for existing models (25% savings, no retraining).
- Short-term: SVD + Light Fine-tuning (75% savings, ~2% quality cost).
- Long-term: Training new models with "Thin Keys" as a standard configuration.

Conclusion

The paper fundamentally challenges the $d_q = d_k = d_v$ convention, proving that selection is a low-rank operation while value transfer is high-rank. By implementing "Thin Keys, Full Values," the authors provide a highly effective, theoretically grounded method to drastically reduce the memory footprint of LLM inference, directly addressing the scalability limits of long-context applications.

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

The Big Idea: "Thin Keys, Full Values"

The Creative Analogy: The Library Card vs. The Book

How They Did It

Why This Matters (The Magic Results)

The "Retrofit" Trick

Summary in One Sentence

1. Problem Statement

2. Methodology: Asymmetric Attention

3. Key Contributions

4. Experimental Results

5. Significance and Impact

Conclusion

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation