How Attention Sinks Emerge in Large Language Models: An Interpretability Perspective

This paper identifies a simple, semantics-free "P0 Sink Circuit" that emerges early in training to explain how Large Language Models develop attention sinks on the first token, suggesting this mechanism could serve as a signal for tracking pre-training convergence.

Runyu Peng, Ruixiao Li, Mingshu Chen, Yunhua Zhou, Qipeng Guo, Xipeng Qiu

Published 2026-03-10
📖 6 min read🧠 Deep dive

Here is an explanation of the paper "How Attention Sinks Emerge in Large Language Models," broken down into simple concepts with creative analogies.

The Big Picture: The "First Seat" Phenomenon

Imagine a large classroom of students (the Large Language Model) trying to solve a puzzle together. The teacher gives them a long list of instructions (the input sequence).

In almost every classroom, no matter how smart the students are, they all seem to stare intensely at the very first student in the row. They keep looking back at that first student, even when the conversation has moved on to the 50th word.

In AI terms, this is called an "Attention Sink." The model "sinks" its attention onto the first token (the first word or symbol) disproportionately.

For a long time, scientists thought this was a bug or a quirk caused by a special "Start" button (called the [BOS] token) that models use to know where a sentence begins. They thought, "Oh, the model is just looking at the Start button."

This paper says: "No, that's not it."

The authors discovered that even if you take away the "Start" button, the model still stares at the first word. Why? Because of a clever little trick the model teaches itself to do, which they call the P0-Sink Circuit.


The Mechanism: The "Spotlight Amplifier"

How does the model know which word is #1 without a special button? It uses a two-step process involving the Causal Mask (the rule that says "you can only look at words that came before you").

1. The "Solo" vs. The "Crowd"

Imagine the first student (Position 0) and the second student (Position 1).

  • The Second Student: Can look at the First Student and themselves. Their view is a mix of two things.
  • The First Student: Can only look at themselves. They have no one else to look at.

Because of this rule, the First Student's "view" is pure and unmixed. The other students' views are a messy blend of many different people.

2. The Amplifier (The MLP Layer)

The model has a part of its brain called an MLP (a type of neural network layer) that acts like a volume knob or a spotlight amplifier.

  • The model notices that the First Student's "view" is unique and consistent (because it's the only one looking at just itself).
  • The model turns up the volume on this specific signal. It makes the First Student's "hidden state" (their internal representation) huge and bright.
  • Mathematically, this increases the 2\ell_2 norm (a fancy way of saying "magnitude" or "loudness") of that first token.

The Result: Because the First Student is now so loud and bright, every other student in the class naturally turns their heads to look at them. The attention "sinks" there.

Why Do They Do This? (The "Anchor" Theory)

You might ask, "Why does the model want to stare at the first word? Isn't that distracting?"

Think of the first word as a heavy anchor dropped in the ocean.

  • As the model processes a long sentence, the "currents" of information can get messy.
  • By keeping a super-strong, fixed anchor at the very beginning, the model stabilizes the whole system. It gives the model a consistent reference point so it doesn't get lost in the middle of a long story.
  • It's like a ship captain keeping one eye on the lighthouse at the harbor entrance to make sure they haven't drifted off course, even while navigating a stormy sea.

The Training Journey: How the Model Learns This

The authors didn't just look at finished models; they watched a model being trained from scratch (like watching a baby learn to walk). They found the "Attention Sink" happens in three stages:

  1. The Wandering Phase (Early Training):
    At first, the model is confused. It tries to focus on the first word, but the signal is weak. It might even try to focus on the second word or the third word. It's like a baby trying to stand up but wobbling around.

  2. The Transition Phase:
    The model realizes, "Hey, focusing on the second word is okay, but it's not stable." It starts to shift its focus back toward the beginning.

  3. The Stable Phase (Maturity):
    Eventually, the model builds that "Spotlight Amplifier" circuit in the very first two layers of its brain. It locks onto the first word with a laser focus. Once this circuit is built, it stays there forever. It becomes a permanent feature of the model's architecture.

The "Start" Button Myth Busted

The paper proves that the [BOS] token (the special "Start" symbol) is just a helpful crutch, not the cause.

  • With the [BOS] token: The model uses the crutch to find the start easily.
  • Without the [BOS] token: The model is forced to build its own internal "Spotlight Amplifier" to find the start. It's a bit harder at first, but once it builds the circuit, it works just as well.

Why This Matters

  1. It's a Feature, Not a Bug: We used to think attention sinks were a mistake. Now we know they are a clever, built-in safety mechanism that helps models handle long texts.
  2. Training Monitor: The authors suggest that by watching when this "Spotlight Amplifier" circuit forms during training, we can tell if a model is "growing up" correctly. If the circuit forms early and stays in the first two layers, the model is likely converging (learning) well.
  3. Future Designs: Understanding this helps engineers build better models. Maybe we can design models that don't need to stare at the first word so much, or maybe we can use this "anchor" trick to make models better at reading very long documents.

Summary in One Sentence

Large Language Models naturally learn to turn up the volume on the very first word of a sentence to create a stable "anchor" for their attention, a clever trick they invent themselves to keep from getting lost, regardless of whether they have a special "Start" button or not.