How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective

Here is an explanation of the paper "How Attention Sinks Emerge in Large Language Models," broken down into simple concepts with creative analogies.

The Big Picture: The "First Seat" Phenomenon

Imagine a large classroom of students (the Large Language Model) trying to solve a puzzle together. The teacher gives them a long list of instructions (the input sequence).

In almost every classroom, no matter how smart the students are, they all seem to stare intensely at the very first student in the row. They keep looking back at that first student, even when the conversation has moved on to the 50th word.

In AI terms, this is called an "Attention Sink." The model "sinks" its attention onto the first token (the first word or symbol) disproportionately.

For a long time, scientists thought this was a bug or a quirk caused by a special "Start" button (called the [BOS] token) that models use to know where a sentence begins. They thought, "Oh, the model is just looking at the Start button."

This paper says: "No, that's not it."

The authors discovered that even if you take away the "Start" button, the model still stares at the first word. Why? Because of a clever little trick the model teaches itself to do, which they call the P0-Sink Circuit.

The Mechanism: The "Spotlight Amplifier"

How does the model know which word is #1 without a special button? It uses a two-step process involving the Causal Mask (the rule that says "you can only look at words that came before you").

1. The "Solo" vs. The "Crowd"

Imagine the first student (Position 0) and the second student (Position 1).

The Second Student: Can look at the First Student and themselves. Their view is a mix of two things.
The First Student: Can only look at themselves. They have no one else to look at.

Because of this rule, the First Student's "view" is pure and unmixed. The other students' views are a messy blend of many different people.

2. The Amplifier (The MLP Layer)

The model has a part of its brain called an MLP (a type of neural network layer) that acts like a volume knob or a spotlight amplifier.

The model notices that the First Student's "view" is unique and consistent (because it's the only one looking at just itself).
The model turns up the volume on this specific signal. It makes the First Student's "hidden state" (their internal representation) huge and bright.
Mathematically, this increases the $\ell_2$ norm (a fancy way of saying "magnitude" or "loudness") of that first token.

The Result: Because the First Student is now so loud and bright, every other student in the class naturally turns their heads to look at them. The attention "sinks" there.

Why Do They Do This? (The "Anchor" Theory)

You might ask, "Why does the model want to stare at the first word? Isn't that distracting?"

Think of the first word as a heavy anchor dropped in the ocean.

As the model processes a long sentence, the "currents" of information can get messy.
By keeping a super-strong, fixed anchor at the very beginning, the model stabilizes the whole system. It gives the model a consistent reference point so it doesn't get lost in the middle of a long story.
It's like a ship captain keeping one eye on the lighthouse at the harbor entrance to make sure they haven't drifted off course, even while navigating a stormy sea.

The Training Journey: How the Model Learns This

The authors didn't just look at finished models; they watched a model being trained from scratch (like watching a baby learn to walk). They found the "Attention Sink" happens in three stages:

The Wandering Phase (Early Training):
At first, the model is confused. It tries to focus on the first word, but the signal is weak. It might even try to focus on the second word or the third word. It's like a baby trying to stand up but wobbling around.
The Transition Phase:
The model realizes, "Hey, focusing on the second word is okay, but it's not stable." It starts to shift its focus back toward the beginning.
The Stable Phase (Maturity):
Eventually, the model builds that "Spotlight Amplifier" circuit in the very first two layers of its brain. It locks onto the first word with a laser focus. Once this circuit is built, it stays there forever. It becomes a permanent feature of the model's architecture.

The "Start" Button Myth Busted

The paper proves that the [BOS] token (the special "Start" symbol) is just a helpful crutch, not the cause.

With the [BOS] token: The model uses the crutch to find the start easily.
Without the [BOS] token: The model is forced to build its own internal "Spotlight Amplifier" to find the start. It's a bit harder at first, but once it builds the circuit, it works just as well.

Why This Matters

It's a Feature, Not a Bug: We used to think attention sinks were a mistake. Now we know they are a clever, built-in safety mechanism that helps models handle long texts.
Training Monitor: The authors suggest that by watching when this "Spotlight Amplifier" circuit forms during training, we can tell if a model is "growing up" correctly. If the circuit forms early and stays in the first two layers, the model is likely converging (learning) well.
Future Designs: Understanding this helps engineers build better models. Maybe we can design models that don't need to stare at the first word so much, or maybe we can use this "anchor" trick to make models better at reading very long documents.

Summary in One Sentence

Large Language Models naturally learn to turn up the volume on the very first word of a sentence to create a stable "anchor" for their attention, a clever trick they invent themselves to keep from getting lost, regardless of whether they have a special "Start" button or not.

Here is a detailed technical summary of the paper "How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective."

1. Problem Statement

Large Language Models (LLMs) exhibit a phenomenon known as the Attention Sink, where the model allocates disproportionately high attention probabilities to specific tokens, often the first token of the input sequence (Position Zero, or P0).

The Conflict: While attention sinks at non-initial positions are generally considered detrimental (reducing reasoning capabilities), the persistent focus on the first token (P0) is a structural bias that correlates with improved predictions and is utilized in downstream applications (e.g., StreamingLLM).
The Gap: Despite its prevalence, the precise mechanism behind the emergence and persistence of the P0 sink remains poorly understood. Previous hypotheses suggested it was merely a byproduct of the [BOS] (Beginning-of-Sequence) token embedding or a specific semantic feature. However, recent models without explicit [BOS] tokens still exhibit this behavior, indicating a more fundamental, structural cause.

2. Methodology

The authors employ a combination of mechanistic interpretability, ablation studies, and training dynamics analysis to trace the formation of the P0 sink.

Ablation Studies: They systematically remove the [BOS] token from various models (e.g., LLaMA, Qwen) to observe if the P0 sink persists. They also analyze the impact of repeating the first token to test for out-of-distribution (OOD) artifacts.
Circuit Identification: The authors propose and validate the P0-Sink Circuit, a specific architectural mechanism within the Transformer blocks. They analyze:
- Hidden State Norms: Tracking the $\ell_2$ norm of hidden states across layers.
- Attention Heatmaps: Visualizing attention scores to identify concentration patterns.
- Dimensionality Reduction: Using PCA and t-SNE to observe how hidden states cluster in representation space.
Theoretical Modeling: They derive a theoretical model based on the causal attention mask asymmetry. They model value vectors as lying on a cone and mathematically demonstrate how uniform averaging under causal masking leads to a higher expected squared norm for the first token compared to subsequent tokens.
Training Traces: They pre-train a 30B-A3B MoE (Mixture of Experts) model from scratch and track the evolution of attention patterns at different training checkpoints (from 15B to 780B tokens) to observe the temporal emergence of the circuit.

3. Key Contributions

A. The P0-Sink Circuit

The paper identifies a simple, two-block mechanism responsible for the P0 sink, termed the P0-Sink Circuit. This circuit operates without relying on semantic information or the [BOS] token:

Identification (Layer 0-1): The model exploits the asymmetry of the causal attention mask. Because Position Zero can only attend to itself (no prior context), while Position $i$ attends to $0 \dots i-1$, the averaging of value vectors creates a structural bias. The first token's hidden state retains a "pure" direction, while subsequent tokens mix diverse contexts.
Amplification (Layer 1-2): The MLP (Multi-Layer Perceptron) sublayers detect this directional consistency and amplify the $\ell_2$ norm of the Position Zero hidden state.
Stabilization: The amplified norm, combined with pre-layer RMS normalization, makes the P0 representation robust to gradient updates. It becomes a fixed-direction, high-magnitude vector that acts as a deterministic anchor for attention heads throughout the network.

B. Independence from [BOS] Semantics

The study proves that the P0 sink is not a byproduct of the [BOS] token embedding.

Removing [BOS] eliminates the sink in the very first layer but causes a re-emergence of the sink in Layer 2 via the P0-Sink Circuit.
The mechanism relies on positional asymmetry (causal masking) rather than token semantics.

C. Three-Stage Training Evolution

By analyzing the 30B-A3B MoE model trained from scratch, the authors characterize the formation of the P0-Sink Circuit as a three-stage process:

Early Stage: The circuit emerges in middle/deep layers first. The signal is weak and diffuse.
Transitional Stage: The sink center temporarily shifts to Position 1 (or spreads across early positions) as the model explores different configurations.
Convergence Stage: The sink consolidates into a stable Position Zero circuit within the first two layers. This stage correlates with pre-training convergence.

4. Key Results

Structural Bias: The P0 sink is a direct consequence of the causal attention mechanism. Theoretical analysis (Equation 11) confirms that the expected squared norm of the attention output is highest for the first token due to the lack of mixing with other context vectors.
Robustness: The P0 sink persists even when:
- The [BOS] token is removed.
- The model architecture changes (e.g., Qwen, LLaMA, Mistral).
- The input contains repeated tokens (ruling out OOD artifacts).
Training Dynamics:
- In the 30B MoE model, the P0 sink initially appears in deep layers but migrates to the first two layers as training progresses.
- The Sink Rate (proportion of heads attending to P0) decreases when [BOS] is removed in early training but recovers in later stages as the P0-Sink Circuit matures.
Model Size Effects: Smaller models (e.g., Pythia-14M) may fail to form the circuit, while larger models (LLaMA-3, Qwen) consistently develop it. Interestingly, the layer where the sink emerges follows a non-monotonic trend with model size due to optimization difficulty.

5. Significance and Implications

Interpretability: The paper provides a concrete mechanistic explanation for a ubiquitous LLM behavior, moving beyond "it just happens" to "it is caused by causal masking asymmetry amplified by MLPs."
Training Diagnostics: The stage at which the P0-Sink Circuit stabilizes (shifting from deep layers to the first two layers) serves as a diagnostic signal for pre-training convergence. If a model is still forming the sink in deep layers, it may benefit from further pre-training.
Architectural Design: Understanding this implicit bias suggests that future models could potentially be designed to either:
- Leverage this mechanism for more efficient long-context handling (as seen in StreamingLLM).
- Mitigate it if the fixed attention to the first token becomes a bottleneck for specific reasoning tasks.
Generalizability: The findings suggest that the "first token" bias is a fundamental property of autoregressive transformers, not a quirk of specific tokenizers or embedding strategies.

In summary, the paper demystifies the attention sink by revealing it as a structural necessity of causal attention, formalized as a two-layer circuit that amplifies the unique positional signature of the first token to ensure stable context anchoring.