The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

Imagine a massive library (a Large Language Model) where thousands of librarians (the AI's internal "neurons") are working together to write the next sentence of a story. For a long time, researchers noticed two strange, recurring habits in how these librarians behaved:

The "Spike" (Massive Activations): Occasionally, a few specific librarians would suddenly shout so loudly that their voices drowned out everyone else. These weren't just normal whispers; they were extreme, mathematical outliers.
The "Sink" (Attention Sinks): At the same time, the librarians would almost always ignore the interesting parts of the story and instead stare blankly at the very first word of the sentence, giving it all their attention, even if that word was just "The" or "Once."

For years, people thought these two habits were deeply connected—that the shouting caused the staring. But this new paper, "The Spike, the Sparse and the Sink," reveals that they are actually two different things that just happen to live in the same house because of how the house was built.

Here is the breakdown using simple analogies:

1. The "Spike": The Over-enthusiastic Amplifier

Think of the AI's brain as a series of relay stations.

The Problem: In the early stations, a specific type of switch (called a Feed-Forward Block) acts like a quadratic amplifier. Imagine a microphone that doesn't just make a voice louder, but squares the volume. If you whisper "hello," it becomes a roar.
Who gets amplified? Only a tiny group of "special" tokens (usually the very first word of a sentence or a punctuation mark like a period).
The Result: These tokens get turned up to 11, creating "Massive Activations." They travel through the middle of the library, shouting loudly, until a late station (a "step-down" block) finally turns the volume back down to normal before the final output.

2. The "Sink": The Lazy Librarian

Now, why do the librarians stare at the first word?

The Glitch: The library uses a specific rule called Pre-Norm. Before the librarians speak, they have to normalize their voices (make sure they are all at a standard volume).
The Trick: Because the "Spike" tokens are shouting so incredibly loud, the normalization rule has to turn their volume way down to fit the standard. But here's the catch: when you turn a massive, chaotic shout down to a whisper, it loses all its unique shape. It becomes a flat, boring, identical sound for every single "Spike" token.
The Consequence: To the librarians, the first word (and other special tokens) no longer looks like "The" or "Once." It looks like a constant, boring, safe anchor. Because it's so predictable and stable, the attention mechanism (the librarians' eyes) latches onto it as a "default" place to look. It's like a safety net; the AI uses the first word as a place to dump extra attention so it doesn't have to work as hard on the complex middle parts of the sentence.

3. The Big Revelation: They Are Roommates, Not Twins

The paper's biggest discovery is that the Shouting (Spike) and the Staring (Sink) are not the same thing. They just happen to coexist because of the building's architecture (the Pre-Norm design).

The researchers proved this by renovating the library:

Fixing the Shouting: They changed the normalization rules (like adding a "Sandwich" layer of soundproofing). This stopped the "Spike" tokens from getting so loud. Result: The shouting stopped, but the librarians still stared at the first word.
Fixing the Staring: They changed how the librarians decide what to look at (using "Gated Attention," like giving them a dynamic filter). Result: The staring stopped, but the "Spike" tokens were still shouting.

The Analogy: It's like a car with a loud engine and a sticky steering wheel. For a long time, people thought the loud engine caused the steering wheel to stick. But this paper shows that if you fix the engine, the wheel still sticks. If you fix the wheel, the engine is still loud. They are just two separate quirks of the same car model.

Why Does This Matter?

Understanding this separation is a game-changer for AI efficiency:

Quantization (Compression): If you want to shrink the AI to run on a phone, you usually have to deal with the "Spike" (the loud outliers) because they break the math. Now we know we can fix the "Spike" without breaking the "Sink" (which helps the AI understand short sentences).
Long-Context Reading: The "Sink" happens because the AI is trained on short stories. It uses the first word as a crutch. If we train the AI on long books, it stops needing the crutch, and the "Sink" disappears naturally.

In a Nutshell:
The "Spike" is a mathematical side-effect of how the AI amplifies signals. The "Sink" is a learned habit where the AI uses the first word as a lazy anchor. They look like they are best friends, but they are actually just neighbors who happen to live in the same weirdly designed apartment building. By redesigning the building, we can fix one problem without breaking the other.

Here is a detailed technical summary of the paper "The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks" by Shangwen Sun et al.

1. Problem Statement

Large Language Models (LLMs), particularly decoder-only Transformers with a pre-norm configuration, exhibit two recurring and often co-occurring phenomena:

Massive Activations (Spikes): A small number of tokens (typically the first token or delimiters) exhibit extreme outliers in a few hidden channels, with magnitudes orders of magnitude larger than typical activations.
Attention Sinks: A small number of tokens (often the same ones) attract a disproportionate amount of attention mass across many heads and layers, regardless of semantic relevance.

While prior work has observed that these phenomena frequently overlap, the causal relationship and their distinct functional roles remained unclear. The paper addresses whether this co-occurrence is an inherent property of Transformers or an artifact of specific architectural choices, and how these mechanisms impact model efficiency (quantization, pruning, KV-cache management).

2. Methodology

The authors employ a combination of mechanistic analysis and systematic ablation studies to dissect these phenomena.

Models Analyzed: The study focuses on the Llama (2, 3) and Qwen (2.5, 3) model families, which use pre-norm architectures and SwiGLU feed-forward networks.
Mechanistic Tracing:
- Lifecycle Analysis: Tracking the magnitude of hidden representations through the residual stream to identify "step-up" (injection) and "step-down" (neutralization) blocks.
- Mathematical Derivation: Analyzing the feed-forward block (SwiGLU) to prove it acts as a directional quadratic amplifier. They derive that specific channels correspond to high-norm quadratic forms ( $U_k$ ) that amplify inputs aligned with a specific "spike direction."
- Geometric Analysis: Examining how normalization transforms these massive activations into sparse, near-constant vectors, and how this affects the geometry of Query-Key projections in attention heads.
Ablation Experiments: The authors train Llama-style 7B models from scratch on the DCLM dataset, systematically varying:
- Optimization hyperparameters (learning rate, weight decay).
- Feed-forward block designs (SwiGLU vs. GeLU vs. Linear vs. Attention-only).
- Normalization configurations (Pre-norm, Sandwich norm, QKNorm, DynamicTanh).
- Attention settings (Head dimension, gating mechanisms).
- Training context length distributions.

3. Key Contributions & Findings

A. The Mechanism of Massive Activations (The "Spike")

Origin: Massive activations are generated by SwiGLU-based feed-forward blocks acting as directional quadratic amplifiers.
Process:
1. Step-up Blocks: Early layers contain specific high-gain quadratic forms. When a token's representation aligns with a shared "spike direction" (common for the first token due to attention collapsing to a linear map), these blocks amplify the activation quadratically.
2. Residual Accumulation: In pre-norm architectures, the residual stream is additive. Once injected, these extreme values persist through intermediate layers.
3. Step-down Blocks: Late layers contain symmetric blocks that inject values of the opposite sign to neutralize the spikes before the final output.
Token Specificity: The "spike direction" is a fixed vector in the high-dimensional space. Only tokens whose embeddings align with this direction (e.g., the first token, delimiters) trigger the amplification.

B. The Mechanism of Attention Sinks (The "Sink")

Role of Normalization: The RMSNorm layer is the critical bridge. It takes the massive, unbounded activations and maps them to a bounded, sparse, and near-constant vector.
- Because the norm is dominated by the few spike channels, non-spike channels are suppressed.
- Because the ratios of spike channels are fixed, the normalized vector becomes nearly identical across different spike tokens.
Geometric Alignment: This near-constant, sparse vector is projected by the Key matrix ( $W_K$ $W_{K}$ ) into a low-dimensional subspace.
- In Sink Heads, the Query subspace aligns closely with this fixed Key subspace, creating a large, consistent logit gap that forces attention to the sink token.
- This effectively turns the first token into a "default" position for attention mass.

C. Causal Decoupling

The paper proves that spikes and sinks are not inherently linked; their co-occurrence is an architectural artifact of the standard pre-norm + SwiGLU design.

Spikes can be removed without removing Sinks: Using Sandwich Normalization (adding a norm after the block) or DynamicTanh (element-wise transformation) suppresses massive activations by bounding the residual stream or preventing the formation of sparse vectors. However, the models still develop attention sinks via alternative strategies (e.g., using the first token as a reference point).
Sinks are driven by Context Length and Head Dimension:
- Context Length: Sinks are primarily induced to facilitate short-range dependencies. When models are trained exclusively on long contexts, the sink ratio collapses.
- Head Dimension: Larger head dimensions provide the capacity to geometrically separate sink keys from non-sink keys, strengthening the sink effect.

D. Functional Roles

Massive Activations: Function as global implicit parameters. They induce near-constant hidden representations that persist across layers, effectively acting as a bias term for the model.
Attention Sinks: Function as local modulators. They act as a "learned gate" that biases specific attention heads toward short-range dependencies, effectively offloading the need for explicit gating mechanisms.

4. Results from Ablation Studies

Normalization is Key: Changing the normalization configuration (e.g., to Sandwich Norm or QKNorm) successfully decouples the two phenomena. It eliminates massive activations while preserving (or even increasing) the sink ratio.
Gated Attention: Introducing conditional gating (where the gate depends on the current representation) eliminates the need for attention sinks, confirming that sinks are a workaround for the lack of dynamic routing.
Optimization: The magnitude of spikes is largely independent of perplexity, while the sink ratio correlates with optimization health.
Performance: Suppressing either phenomenon independently does not degrade language modeling performance, suggesting their overlap in standard LLMs is incidental rather than functionally necessary.

5. Significance and Implications

Theoretical Understanding: The paper moves beyond descriptive observations to provide a mechanistic account of how internal representations are shaped by architectural choices (specifically pre-norm and SwiGLU).
Efficiency & Deployment:
- Quantization: Massive activations are a major bottleneck for low-bit quantization. Since they can be suppressed without hurting performance, new architectures can be designed to avoid them, simplifying quantization.
- KV-Cache Management: Understanding that sinks are a learned routing strategy allows for better strategies in pruning or compressing the KV cache (e.g., preserving sink tokens while aggressively pruning others).
- Long-Context Inference: The findings suggest that sinks are a byproduct of short-context training. Models trained on long contexts naturally reduce reliance on sinks, offering a path to more efficient long-context inference.
Architectural Design: The study suggests that future LLM designs can independently tune for "implicit parameter" behavior (spikes) and "attention routing" (sinks) by manipulating normalization and gating mechanisms, rather than accepting the current coupled behavior as a necessity.

In summary, the paper demonstrates that Massive Activations and Attention Sinks are distinct phenomena driven by the interaction of residual accumulation, quadratic amplification in feed-forward layers, and normalization. Their co-occurrence is a side effect of current design choices, not a fundamental requirement for Transformer performance.