Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Imagine you are running a massive, high-speed library where books (data) are constantly being read, analyzed, and rewritten. This library is run by a team of super-intelligent librarians called Transformers.

For years, the standard way these librarians worked involved a three-step process for every single book they touched:

The Query (Q): A librarian asks, "What am I looking for?" (They create a search query).
The Key (K): They look at the book's spine to see if it matches the query.
The Value (V): If it matches, they pull the book off the shelf and read the content.

In the world of AI, these three steps are handled by three giant mathematical "weight" matrices (think of them as complex instruction manuals). The paper argues that the Query instruction manual is actually redundant. You can throw it away, replace it with a simple "Yes, I see it" note (an Identity Matrix), and the library will still run perfectly fine.

Here is the breakdown of their discovery, using simple analogies:

1. The "Telescoping" Trick

The authors realized that the "Query" step is just a way of translating the book's language into a specific dialect before checking the Key. But here's the secret: The next librarian in the line can just speak that dialect directly.

Imagine a relay race.

Old Way: Runner A runs a mile, stops to translate their shoes into a different language, hands the baton to Runner B, who translates it back, then Runner C does the same.
New Way: Runner A just runs the mile. Runner B speaks the language Runner A is already using. Runner C does the same.

By removing the "Query" translator, the team found they could save 25% of the memory and computing power needed for the attention mechanism (the part of the AI that decides what to focus on). It's like realizing you don't need a dictionary to read a book if the author just writes in the language you already know.

2. The "Free Lunch" (Single Layer)

The paper proves that if you have just one layer of these librarians, you can instantly delete the Query manual and rewrite the Key and Value manuals to compensate. It's a "Free Lunch" because you get the same result with fewer ingredients.

However, in a deep library with many layers (like a 12-story building), it gets tricky. If you remove the Query manual on the 1st floor, the 2nd floor might get confused because the "language" of the data has changed.

The Solution: The authors found two ways to fix this:
- Option A: Only put "skip connections" (elevators) around the Query/Key/Value part, not the whole room. This allows the "language" to change smoothly between floors.
- Option B: Make every floor use the exact same instruction manual (Weight Sharing). If everyone speaks the same language, you don't need a translator on any floor.

3. The Real-World Test (The "GPT" Experiment)

Theory is great, but does it work in practice? The authors built a small AI model (a "GPT-style" model) from scratch.

The Control Group: A standard model with all three manuals (Query, Key, Value).
The Experimental Group: A model where the Query manual was deleted and replaced with a blank "Identity" note.

The Result: The experimental model performed just as well as the standard one, even though it had 8% fewer total parameters.

The Bonus: When they took the "saved" space from deleting the Query manual and gave it to the "MLP" (the part of the AI that does the heavy thinking/creative writing), the model actually got better than the standard one.

4. Why This Matters: The "Regularization" Effect

The paper discovered something surprising about why this works.

Old Way: The AI had to learn a complex, quadratic relationship (a fancy curve) to figure out what to look for. This is hard to learn and easy to overfit (memorize the training data instead of learning the rules).
New Way: By removing the Query, the relationship becomes linear (a straight line).
The Analogy: Imagine trying to balance a broom on your finger.
- With the Query: You are balancing the broom on a wobbly, spinning platform. It's unstable.
- Without the Query: You are balancing it on a flat, steady table.
- Because it's more stable, the AI can learn with much less "braking" (weight decay). It's like driving a car that naturally stays in its lane, so you don't need to constantly jerk the steering wheel to keep it straight.

The Big Picture

This paper suggests that for decades, we have been carrying around a heavy backpack (the Query weights) that we didn't actually need.

Efficiency: We can build faster, cheaper AI models.
Simplicity: The math becomes cleaner and easier to understand.
Stability: The models are less likely to crash or get confused during training.

The authors conclude that the "Query-Key-Value" trio might be a historical artifact of how we designed these models, rather than a fundamental necessity. By dropping the Query, we aren't losing intelligence; we are just stopping the AI from doing unnecessary paperwork.

1. Problem Statement

Transformer architectures, particularly the self-attention mechanism, rely on a triplet of weight matrices: Query ( $W_Q$ ), Key ( $W_K$ ), and Value ( $W_V$ ). While these are standard components, the authors investigate whether this triplet is strictly necessary for model expressivity.

Core Question: Can the Query weight matrix ( $W_Q$ ) be eliminated (replaced with the identity matrix) without degrading model performance?
Motivation: The attention mechanism depends on the input $X$ only through the products $XW_Q$ , $XW_K$ , and $XW_V$ . This suggests a "telescoping" redundancy where basis transformations in one layer can be absorbed into the next, potentially allowing $W_Q$ to be fixed as the identity ( $I$ ) across the network, reducing parameters by 25% per attention layer.

2. Methodology

The authors employ a theory-first approach, combining rigorous mathematical proofs under specific architectural assumptions with empirical validation on GPT-style models.

A. Theoretical Framework

The analysis relies on the Reparametrization Lemma, which states that for any invertible matrix $\Theta$ , the self-attention output remains invariant if the input is transformed by $\Theta$ and the weights are transformed by $\Theta^{-1}$ .

Single-Head Analysis: Proved that $W_Q$ and $W_K$ can be merged into a single effective weight ( $W_Q W_K^T$ ), making $W_Q$ redundant locally.
Multi-Layer Analysis: The authors investigate how to propagate this redundancy through deep networks under different constraints:
1. Single-Layer Elimination (Free Lunch): In a single layer without normalization, $W_Q$ can always be set to $I$ via weight reparametrization.
2. Multi-Layer Elimination (Skip-Only): In networks with skip connections only around the attention sublayer (and not the MLP), $W_Q$ can be eliminated in all layers. The basis transformation required to maintain equivalence is absorbed by the skip connection, allowing the MLP to operate in a transformed basis.
3. Weight-Sharing: In architectures where all layers share weights (e.g., ALBERT, TRM), $W_Q$ can be eliminated globally.
4. LayerNorm Handling: The authors derive conditions under which basis transformations commute with Layer Normalization. They prove that exact elimination with LayerNorm is theoretically obstructed for standard MLPs, but approximate elimination is possible if the MLP can learn to compensate for the basis change (verified via MLP basis transfer experiments).

B. Empirical Validation

Model: GPT-style decoder-only models (117M to 124M parameters) trained from scratch on OpenWebText.
Setup: Compared standard baselines against models where $W_Q = I$ .
Adjustments: Two critical hyperparameter adjustments were necessary for stability:
1. Attention Scaling: Reduced the scaling factor from $1/\sqrt{d_k}$ to $1/(2\sqrt{d_k})$ because $W_Q=I$ results in queries being coordinate slices of the input, leading to higher initial variance in attention scores.
2. Weight Decay: Reduced weight decay significantly (from 0.1 to $\approx 0.03$ ). The authors argue that $W_Q$ provides implicit regularization; removing it requires lower weight decay to allow the remaining parameters to utilize the full capacity of the function space.

3. Key Contributions

Theoretical Proof of Redundancy:
- Proved that $W_Q$ is redundant in single-layer transformers without normalization.
- Extended this to multi-layer transformers under specific conditions: (a) skip connections only around attention, or (b) weight sharing across layers.
- Characterized the "structural expressivity boundary" of skip connections, proving that in the ReLU setting, skip connections push MLPs into a generically disjoint function class unless specific algebraic conditions are met (Theorem 8.4).
New Notation for Multi-Head Attention:
- Introduced an index-free "Block Hadamard" notation that makes the redundancy of $W_Q$ mathematically trivial to observe, simplifying the analysis of multi-head interactions.
Empirical Demonstration:
- Validated that models with $W_Q = I$ achieve comparable validation loss to full baselines despite having 8% fewer non-embedding parameters.
- Demonstrated that reallocating the saved parameters to the MLP (increasing hidden dimension) allows the reduced model to outperform the full baseline.
- Showed that training remains stable at 3x lower weight decay, suggesting the removal of $W_Q$ provides implicit regularization.

4. Results

Performance: A 117M reduced model ( $W_Q=I$ ) matched the performance of a 124M standard baseline.
Parameter Efficiency: When the saved parameters were reallocated to the MLP (creating a 124M reduced model with a larger MLP), it achieved a lower validation loss (3.004) than the standard 124M baseline (3.016).
Stability: The reduced models trained stably with significantly lower weight decay, confirming that the remaining weights can encode the necessary basis transformations.
Comparison: The reduced model outperformed parameter-matched baselines (e.g., a standard model with a smaller MLP or smaller embedding dimension), indicating that eliminating $W_Q$ is a more efficient architectural choice than naive parameter reduction.

5. Significance and Implications

Architectural Redundancy: The paper challenges the necessity of the full Query-Key-Value triplet, suggesting that current transformer designs may be overparameterized.
Efficiency Gains: Eliminating $W_Q$ reduces attention parameters by 25% per layer (8% of total block parameters). This saves memory and computational cost during both pre-training and inference.
Compatibility: The approach is compatible with modern optimizations like Grouped-Query Attention (GQA) and KV Caching, as $W_Q$ is the only component removed.
Implicit Regularization: The finding that $W_Q$ removal stabilizes training at lower weight decay suggests that the Query projection acts as a regularizer. Its removal forces the model to learn more robust representations through the remaining parameters.
Future Directions: The authors suggest that while $W_Q$ is redundant, the Key matrix might be an even better candidate for elimination (based on prior literature showing Keys have less impact on performance). They also propose exploring nonlinear Query transformations as a way to enhance expressivity if linear projections are deemed unnecessary.

In conclusion, the paper provides a rigorous theoretical and empirical case that the Query weight matrix is not strictly necessary for transformer expressivity, offering a path to more efficient and stable language models.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

1. The "Telescoping" Trick

2. The "Free Lunch" (Single Layer)

3. The Real-World Test (The "GPT" Experiment)

4. Why This Matters: The "Regularization" Effect

The Big Picture

1. Problem Statement

2. Methodology

A. Theoretical Framework

B. Empirical Validation

3. Key Contributions

4. Results

5. Significance and Implications

More like this

Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Compositional Neuro-Symbolic Reasoning

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

AIVV: Neuro-Symbolic LLM Agent-Integrated Verification and Validation for Trustworthy Autonomous Systems