QKV Projections Require a Fraction of Their Memory

The Big Problem: The "Memory Hoarder" AI

Imagine you are trying to teach a giant robot (a Large Language Model, or LLM) to write stories. To do this, the robot has to read millions of words at once.

In the robot's brain, there is a special section called Attention. This is where the robot decides which words are important and how they relate to each other. To do this, it creates three lists of data for every single word it reads: Q (Query), K (Key), and V (Value).

The Bottleneck:
Think of these lists as massive filing cabinets. As the robot reads more words (longer sentences) and processes more examples at once (larger batches), these filing cabinets get huge.

The robot needs to keep these cabinets open in its "working memory" (RAM) while it learns.
Unfortunately, these cabinets take up so much space that they often fill up the robot's entire memory, forcing the robot to stop working or run very slowly.
Current solutions try to make the filing process faster, but they don't shrink the cabinets themselves.

The Solution: PAMM (The "Smart Summarizer")

The authors of this paper propose a new trick called PAMM (Point-Approximate Matrix Multiplication).

Here is how PAMM works, using a few analogies:

1. The "Group Photo" Analogy

Imagine you are taking a photo of a crowd of 10,000 people (the data tokens).

Old Way: You take a high-resolution photo of every single person. You need massive storage space to save 10,000 individual faces.
PAMM Way: You realize that most people in the crowd look very similar. Maybe there are 50 distinct "types" of people (e.g., the "businessman," the "student," the "tourist").
- Instead of saving 10,000 photos, you take 50 high-quality photos of these representative types (these are the Generators).
- Then, you just write down a simple note for the other 9,950 people: "Person #42 looks like the 'Student' photo, just slightly brighter."

You have now stored the entire crowd using 50 photos + a list of notes, instead of 10,000 photos. You saved 99% of the space!

2. The "Recipe" Analogy

In the robot's brain, the data is like a giant pot of soup.

The Old Way: To remember the soup, you have to keep the whole pot on the stove. It's heavy and takes up a lot of counter space.
The PAMM Way: You realize the soup is mostly water with a few key ingredients (spices, veggies).
- You save a tiny jar of the key ingredients (the Generators).
- You write a recipe card that says: "To make the soup for this batch, take 10% of the 'Spice Jar' and 5% of the 'Veggie Jar'."
- When the robot needs to remember the soup, it doesn't need the whole pot; it just needs the jar and the recipe card.

How It Works in the Robot's Brain

The paper focuses specifically on the Q, K, and V projections. These are the steps where the robot turns raw words into those "filing cabinet" lists.

Compression (The Forward Pass): As the robot reads words, instead of saving every single word's data, PAMM picks a few "representative" words (Generators). It then says, "This new word is just a scaled-up version of that representative word." It saves the representative and the scaling factor, throwing away the rest.
Approximation (The Backward Pass): When the robot needs to learn from its mistakes (backpropagation), it usually needs to look at all the original data again. PAMM says, "No need to look at the original 10,000 words. Just look at our 50 representatives and the recipe cards. We can calculate the learning step almost perfectly using just those."

The Results: Magic Numbers

The authors tested this on various AI models (from small to huge, like LLaMA). Here is what happened:

Memory Savings: They reduced the memory needed for these specific parts by 512 times. That's like turning a 500GB hard drive into a 1GB USB stick.
Performance: Surprisingly, the robot didn't get "dumber." In fact, in some cases, it got slightly better. The authors suggest that the "extra" data the robot was trying to remember was actually just noise or repetition that confused it. By removing the redundancy, the robot learned faster.
Speed: It didn't slow the robot down much. The extra math required to do the "summarizing" was negligible compared to the time saved by not moving massive amounts of data around.

Why This Matters

Think of training an AI like driving a car with a very heavy trunk full of bricks.

Current methods try to drive faster or take a shortcut.
PAMM realizes the bricks are mostly air-filled balloons. It pops the balloons, throws away the air, and keeps only the rubber skins. Suddenly, the car is light, fast, and can drive much further without running out of gas (memory).

In short: PAMM is a clever way to compress the "working memory" of AI models by realizing that most of the data they process is repetitive. By keeping only the "essence" and a few notes on how to recreate the rest, we can train massive AI models on much smaller, cheaper computers without losing any intelligence.

1. Problem Statement

Large Language Models (LLMs) based on the Transformer architecture rely heavily on the Multi-Head Attention mechanism. While significant research has focused on optimizing the compute and memory efficiency of the scaled dot-product attention itself (e.g., FlashAttention, sparse attention), the memory consumption of the linear projection layers that generate the Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) tensors is often overlooked.

The Bottleneck: During training, the input activations ( $x$ ) required to compute $Q, K, V$ must be stored in GPU memory for the backward pass to compute gradients ( $\nabla W_Q, \nabla W_K, \nabla W_V$ ).
The Scale: These intermediate activations scale with both batch size and sequence length. In practical settings, they can consume up to 20% of the total peak GPU memory for attention blocks, often exceeding the memory required for parameters or optimizer states.
The Gap: Existing compression techniques (like low-rank adapters or optimizer state compression) often target the hidden dimension or parameters. The authors identify a massive, underutilized source of redundancy: the sequence dimension (tokens within a batch), where activations are highly redundant due to repeated patterns, padding, and local contextual similarity.

2. Methodology: Point-Approximate Matrix Multiplication (PAMM)

The authors propose PAMM, a novel tensor compression technique designed to drastically reduce the memory footprint of $Q, K, V$ activations during training without modifying model weights or inference behavior.

Core Concept

PAMM treats the rows of the input activation matrix $A \in \mathbb{R}^{b \times n}$ (where $b$ is total tokens and $n$ is hidden dimension) as points in $\mathbb{R}^n$ . Instead of storing all $b$ rows, PAMM approximates them using a small set of generators (representative points) and scaling factors.

The Algorithm

The process occurs in two stages:

Compression (Forward Pass):
- Generator Selection: A small subset of $k$ rows is sampled from the input $A$ to serve as generators $C \in \mathbb{R}^{k \times n}$ . Empirically, random sampling is sufficient.
- Assignment & Scaling: For every row $A_i$ , the algorithm finds the generator $C_j$ that minimizes the distance to the projection of $A_i$ onto the line spanned by $C_j$ .
- Neighborhood Condition: A tolerance parameter $\epsilon$ is used. If the best approximation $\tilde{A}_i$ is within $\epsilon$ of $A_i$ , the row is kept; otherwise, it is dropped (represented as zero).
- Storage: Instead of storing $A$ $A$ , the system stores:
  - The generator matrix $C$ ( $k \times n$ ).
  - An assignment vector $f$ (mapping each token to a generator index).
  - A scaling vector $\alpha$ (coefficients for the linear combination).
- Correction: A normalization factor $\beta$ is applied to correct for the expectation bias introduced by dropped rows.
Approximate Matrix Multiplication (Backward Pass):
- The goal is to compute the gradient $\nabla W = A^\top \nabla Z$ .
- Instead of reconstructing the full $A$ , PAMM computes an intermediate matrix $\tilde{B} = \sum_{i: f(i)=j} \alpha_i B_i$ for each generator $j$ .
- The final gradient is approximated as $\tilde{O} = C^\top \tilde{B}$ .
- Complexity: This reduces the operation from $O(bnm)$ to roughly $O(kn(b+m))$, where $k \ll b$ .

Theoretical Guarantees

Redundancy: The paper proves that as the batch size $b$ increases, the number of generators $k$ required to cover the data distribution grows only logarithmically ( $k \propto \ln(b/\delta)$ ), provided the data has inherent clustering.
Error Bounds: The approximation error is bounded by the Frobenius norm of the dropped rows and the neighborhood condition $\epsilon$ .

3. Key Contributions

Identification of Sequence Redundancy: The paper highlights that token-level redundancy in the sequence dimension is a more significant source of memory waste than hidden-dimension redundancy, which is the focus of most existing low-rank methods.
PAMM Technique: A simple, data-dependent compression method that achieves up to 512x compression of activation memory.
Composability: PAMM is fully compatible with existing efficient training techniques, including FlashAttention, gradient checkpointing, and LoRA (Low-Rank Adaptation). It does not alter the forward pass of the model or the gradients of other layers.
Empirical Validation: Extensive experiments showing that extreme compression (e.g., keeping only 1/512 of the tokens) results in negligible or even improved perplexity compared to full-rank training.

4. Experimental Results

The authors evaluated PAMM on pretraining and fine-tuning tasks using LLaMA (60M to 7B) and RoBERTa models.

Memory Reduction:
- Achieved >97% reduction in memory usage for $Q, K, V$ activations.
- In some configurations, memory usage dropped by a factor of 512x (e.g., from GBs to MBs).
Model Performance:
- Pretraining (LLaMA): PAMM maintained perplexity comparable to the baseline. In some cases (LLaMA-1B), perplexity actually improved, suggesting that redundant tokens in the input may hinder training convergence.
- Fine-tuning (GLUE): On RoBERTa-base, PAMM achieved competitive F1 scores and accuracy across all tasks (CoLA, SST2, MNLI, etc.) while reducing memory by two orders of magnitude.
- Multi-modal (Pixtral-12B): Successfully applied to a Vision-Language Model with LoRA, maintaining performance while saving ~97% of QKV memory.
Throughput:
- The computational overhead of PAMM is negligible. For large models (LLaMA-1B and 7B), throughput degradation was <2.7%.
- The overhead decreases as model size increases because the matrix multiplication cost of the attention mechanism dominates the PAMM compression cost.
Ablation Studies:
- $\epsilon$ Parameter: Setting $\epsilon = \infty$ (dropping the neighborhood constraint) yielded the best performance, indicating that even "imperfect" approximations are sufficient.
- Comparison: PAMM significantly outperformed CompAct (Gaussian random projection) and Uniform-CRS (simple random sampling) in the trade-off between memory savings and perplexity.

5. Significance and Conclusion

This paper presents a paradigm shift in memory-efficient LLM training by targeting the sequence dimension rather than just the hidden dimension.

Practical Impact: PAMM allows researchers to train larger models or use larger batch sizes on the same hardware without requiring architectural changes or complex weight modifications.
Simplicity: The method is a "plug-in" solution that requires no changes to the model architecture and works seamlessly with FlashAttention.
Insight: The results suggest that LLM training is highly redundant; the model does not need every single token's exact activation to learn effectively. By approximating these activations, PAMM effectively "erases" the memory footprint of the QKV projections, solving a critical bottleneck in scaling LLMs.

In summary, PAMM offers a highly efficient, theoretically grounded, and empirically validated method to reduce the memory footprint of LLM training by up to 512x, enabling more scalable and accessible training of large language models.