The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

This paper reveals that while massive activations and attention sinks frequently co-occur in Transformers due to the pre-norm architecture, they actually serve distinct global and local functions respectively, with the former acting as implicit parameters and the latter modulating short-range attention dependencies.

Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine a massive library (a Large Language Model) where thousands of librarians (the AI's internal "neurons") are working together to write the next sentence of a story. For a long time, researchers noticed two strange, recurring habits in how these librarians behaved:

  1. The "Spike" (Massive Activations): Occasionally, a few specific librarians would suddenly shout so loudly that their voices drowned out everyone else. These weren't just normal whispers; they were extreme, mathematical outliers.
  2. The "Sink" (Attention Sinks): At the same time, the librarians would almost always ignore the interesting parts of the story and instead stare blankly at the very first word of the sentence, giving it all their attention, even if that word was just "The" or "Once."

For years, people thought these two habits were deeply connected—that the shouting caused the staring. But this new paper, "The Spike, the Sparse and the Sink," reveals that they are actually two different things that just happen to live in the same house because of how the house was built.

Here is the breakdown using simple analogies:

1. The "Spike": The Over-enthusiastic Amplifier

Think of the AI's brain as a series of relay stations.

  • The Problem: In the early stations, a specific type of switch (called a Feed-Forward Block) acts like a quadratic amplifier. Imagine a microphone that doesn't just make a voice louder, but squares the volume. If you whisper "hello," it becomes a roar.
  • Who gets amplified? Only a tiny group of "special" tokens (usually the very first word of a sentence or a punctuation mark like a period).
  • The Result: These tokens get turned up to 11, creating "Massive Activations." They travel through the middle of the library, shouting loudly, until a late station (a "step-down" block) finally turns the volume back down to normal before the final output.

2. The "Sink": The Lazy Librarian

Now, why do the librarians stare at the first word?

  • The Glitch: The library uses a specific rule called Pre-Norm. Before the librarians speak, they have to normalize their voices (make sure they are all at a standard volume).
  • The Trick: Because the "Spike" tokens are shouting so incredibly loud, the normalization rule has to turn their volume way down to fit the standard. But here's the catch: when you turn a massive, chaotic shout down to a whisper, it loses all its unique shape. It becomes a flat, boring, identical sound for every single "Spike" token.
  • The Consequence: To the librarians, the first word (and other special tokens) no longer looks like "The" or "Once." It looks like a constant, boring, safe anchor. Because it's so predictable and stable, the attention mechanism (the librarians' eyes) latches onto it as a "default" place to look. It's like a safety net; the AI uses the first word as a place to dump extra attention so it doesn't have to work as hard on the complex middle parts of the sentence.

3. The Big Revelation: They Are Roommates, Not Twins

The paper's biggest discovery is that the Shouting (Spike) and the Staring (Sink) are not the same thing. They just happen to coexist because of the building's architecture (the Pre-Norm design).

The researchers proved this by renovating the library:

  • Fixing the Shouting: They changed the normalization rules (like adding a "Sandwich" layer of soundproofing). This stopped the "Spike" tokens from getting so loud. Result: The shouting stopped, but the librarians still stared at the first word.
  • Fixing the Staring: They changed how the librarians decide what to look at (using "Gated Attention," like giving them a dynamic filter). Result: The staring stopped, but the "Spike" tokens were still shouting.

The Analogy: It's like a car with a loud engine and a sticky steering wheel. For a long time, people thought the loud engine caused the steering wheel to stick. But this paper shows that if you fix the engine, the wheel still sticks. If you fix the wheel, the engine is still loud. They are just two separate quirks of the same car model.

Why Does This Matter?

Understanding this separation is a game-changer for AI efficiency:

  • Quantization (Compression): If you want to shrink the AI to run on a phone, you usually have to deal with the "Spike" (the loud outliers) because they break the math. Now we know we can fix the "Spike" without breaking the "Sink" (which helps the AI understand short sentences).
  • Long-Context Reading: The "Sink" happens because the AI is trained on short stories. It uses the first word as a crutch. If we train the AI on long books, it stops needing the crutch, and the "Sink" disappears naturally.

In a Nutshell:
The "Spike" is a mathematical side-effect of how the AI amplifies signals. The "Sink" is a learned habit where the AI uses the first word as a lazy anchor. They look like they are best friends, but they are actually just neighbors who happen to live in the same weirdly designed apartment building. By redesigning the building, we can fix one problem without breaking the other.