Vision Transformers Need More Than Registers

Imagine you are trying to teach a student (a Vision Transformer, or ViT) how to recognize a cat in a photo.

The Problem: The "Lazy Student"

In the past, researchers thought the student was just being smart. They noticed that when the student looked at a picture of a cat, it didn't just look at the cat. It also looked at the grass, the sky, and the fence in the background.

Why? Because the student found a shortcut.

Instead of doing the hard work of figuring out exactly where the cat is, the student thought: "Hey, if I just look at the whole picture, I can guess it's a cat because cats usually appear in these kinds of backyards."

The student became lazy. It stopped paying attention to the specific details (the cat's ears, tail, or whiskers) and started relying on the background noise to get the right answer.

The Result: The student gets an "A" on the multiple-choice test (Image Classification) because it guesses right. But if you ask it to draw a box around the cat (Object Detection) or color in just the cat (Segmentation), it fails miserably. It draws a box around the whole backyard because that's what it's been "looking at."

This happens whether the student is taught by a strict teacher (Supervised Learning), a text-book (Text-Supervised), or just by looking at pictures alone (Self-Supervised). The "laziness" is a fundamental flaw in how these models are built.

The Old Solution: The "Note-Taker"

Recently, another group of researchers said, "The problem is that the student gets distracted by the background noise. Let's give the student a special 'Note-Taker' token (called a Register) to hold the important global information, so the student doesn't have to look at the background."

Think of this like giving the student a sticky note to write the main idea on, hoping it stops them from staring at the messy desk.

The New Discovery: "Registers" Aren't Enough

The authors of this paper (Cheng Shi, Yizhou Yu, and Sibei Yang) dug deeper. They realized that just adding a sticky note doesn't fix the root problem. The student is still choosing to be lazy. The sticky note just moves the mess from the desk to the sticky note itself.

They found that the student's laziness comes from two things:

Vague Instructions: The teacher only says "This is a cat" (Image-level label) but doesn't point to the cat.
Super-Connectivity: The student can look at every part of the picture at once (Global Attention). This makes it too easy to mix the cat with the background.

The Solution: "LazyStrike" (LaSt-ViT)

The authors propose a new method called LazyStrike. Instead of just adding a sticky note, they change how the student studies.

Imagine the student is now forced to take a frequency test.

The Background: The grass, sky, and fence are chaotic. They change a lot from patch to patch. They are "noisy."
The Cat: The cat's fur, eyes, and shape are consistent. They are "stable."

LazyStrike works like this:

It asks the student to look at the picture and ask: "Which parts of this image are stable and consistent?"
It tells the student: "Ignore the noisy, changing background. Only pay attention to the stable, consistent parts (the foreground)."
It forces the student to build its "Global Idea" (the CLS token) only from those stable parts.

The Analogy:
Imagine you are trying to identify a song by listening to a noisy party.

Old ViT: Listens to the whole room (music, clinking glasses, shouting, laughter) and guesses the song based on the general vibe. It's often right about the genre, but wrong about the specific lyrics.
Register Method: Tries to write down the main melody on a piece of paper while ignoring the noise.
LazyStrike: Tells the student: "Stop listening to the clinking glasses and shouting. Focus only on the steady beat of the drums and the singer's voice. That's where the real song is."

The Results

When they applied LazyStrike:

The student stopped looking at the background.
It started drawing perfect boxes around the cat.
It could separate the cat from the grass perfectly.
It got better at everything, whether it was learning from labels, text, or just looking at pictures.

The Takeaway

The paper concludes that Vision Transformers don't just need a "Register" (a place to store info); they need to be forced to stop being lazy. By teaching them to filter out the noisy background and focus on the stable, important parts of an image, we can fix their "artifacts" (mistakes) and make them true experts at understanding what they see.

In short: Don't just give the student a better notebook; teach them to ignore the distractions and focus on the real subject.

1. Problem Statement

Vision Transformers (ViTs) have become the standard for image recognition and serve as foundational models for various downstream tasks. However, recent studies have identified pervasive artifacts in ViT representations across different supervision paradigms (fully supervised, text-supervised/CLIP, and self-supervised/DINO).

The Phenomenon: ViTs often exhibit "lazy aggregation" behavior. Instead of focusing on foreground objects, the model relies on semantically irrelevant background patches to encode global semantics.
Manifestations:
- High-Norm Tokens: In self-supervised models (like DINO), certain tokens develop abnormally high feature norms, dominating the attention map and hurting object localization.
- Feature Misalignment: In text-supervised models (like CLIP), patch features fail to align accurately with textual cues, leading to poor performance in open-vocabulary dense prediction tasks.
- Attention Deficit: The model attends to background regions rather than the object of interest.
Limitations of Existing Solutions: Previous works (e.g., Registers) attempted to mitigate these issues by adding extra tokens to store global features. The authors argue this is a superficial fix that moves the problem rather than solving the root cause.

2. Core Hypothesis & Analysis

The authors propose that these artifacts stem from a lazy aggregation shortcut driven by two interacting factors:

Coarse-grained Semantic Supervision: Image-level labels (or text pairs) do not provide precise patch-level guidance.
Global Dependencies: The self-attention mechanism allows information to diffuse globally.

Mechanism: During training, the CLS token seeks the "easiest path" to minimize loss. Since natural images contain a vast majority of background patches, the model learns to aggregate these abundant background tokens to represent the global class, effectively "short-circuiting" the learning of true foreground features.

Key Metrics Introduced:

Patch Score: The cosine similarity between a patch feature and the global CLS token. High scores in background regions indicate artifacts.
Point-in-Box (PiB): A metric measuring the proportion of images where the patch with the highest score falls within the annotated foreground bounding box.
- Finding: ViTs consistently show low PiB scores (e.g., ~~42%) compared to ConvNets (~~68%), indicating a strong background bias that persists throughout training.

3. Methodology: LaSt-ViT (LazyStrike ViT)

The authors propose LaSt-ViT, a simple, frequency-aware selective aggregation scheme designed to anchor the CLS token to foreground regions without architectural changes or post-hoc fine-tuning.

Algorithm Steps:

Stability Score Calculation:
- The method assumes foreground signals are semantically homogeneous (stable), while background signals are diverse.
- It applies a 1D Low-Pass Filter (LPF) in the channel dimension of the patch features using a Gaussian weight vector.
- It calculates a Stability Score ( $S_{i,j}$ ) for each patch $i$ and channel $j$ by comparing the original feature with the low-pass filtered feature. Patches that remain stable after filtering (likely foreground) receive higher scores.
Channel-wise Top-K Pooling:
- For each channel $j$ , the model selects the top- $K$ patches with the highest stability scores.
- The CLS token is updated by averaging these selected stable patches for each channel, rather than using global average pooling or standard attention aggregation.
Vote Count:
- The importance of a patch is determined by how many channels selected it as a top- $K$ stable token. High-vote patches correspond to foreground objects.

Key Advantage: This approach forces the CLS token to aggregate only the most semantically stable (foreground) features, effectively suppressing the "lazy" reliance on background noise.

4. Key Contributions

Systematic Analysis: Introduced Patch Score and Point-in-Box (PiB) to unify the definition and quantification of ViT artifacts across different supervision types.
Root Cause Identification: Identified lazy aggregation as the fundamental mechanism, linking coarse-grained supervision and global attention to the reliance on background shortcuts.
LaSt-ViT: Proposed a simple, plug-and-play method that eliminates artifacts by selectively aggregating stable foreground features, outperforming complex solutions like Registers.
Comprehensive Validation: Demonstrated consistent improvements across 12 benchmarks covering object discovery, semantic/instance segmentation, and open-vocabulary detection under label, text, and self-supervision.

5. Experimental Results

The method was evaluated on ImageNet-1k pretraining and various downstream tasks:

Artifact Elimination:
- PiB Score: LaSt-ViT significantly improved PiB scores (e.g., from 42.7% to 55.1% for supervised ViT; from 44.5% to 69.7% for DINO), approaching or exceeding ConvNet performance.
- High-Norm Tokens: The method naturally eliminated the high-norm token phenomenon without needing explicit register tokens.
Downstream Performance:
- Zero-Shot Semantic Segmentation: On CLIP-based models, LaSt-ViT improved mIoU significantly (e.g., +26.0% on Cityscapes for ViT-B/16).
- Open-Vocabulary Detection: Achieved state-of-the-art results on OV-COCO and OV-LVIS, narrowing the gap between ViT and ConvNet backbones.
- Object Discovery: In self-supervised settings (DINO), LaSt-ViT achieved the highest CorLoc scores (67.6% on VOC 2012), surpassing LOST and DINO-seg.
Efficiency: Unlike methods requiring eigenvector computations (like LOST), LaSt-ViT is computationally efficient and maintains high throughput.

6. Significance

Paradigm Shift: The paper challenges the prevailing view that adding extra tokens (Registers) is the solution to ViT artifacts. Instead, it argues for regulating the aggregation process itself.
Unified Perspective: It provides a unified explanation for artifacts observed in supervised, self-supervised, and text-supervised models, suggesting they share a common "lazy" optimization path.
Practical Impact: LaSt-ViT offers a lightweight, effective baseline for future ViT research, improving feature quality for dense prediction tasks without requiring massive architectural overhauls or additional training data.

In conclusion, the paper demonstrates that Vision Transformers need more than just Registers; they need a mechanism to strike the lazy aggregation behavior and force attention onto semantically relevant foreground regions.

Vision Transformers Need More Than Registers

The Problem: The "Lazy Student"

The Old Solution: The "Note-Taker"

The New Discovery: "Registers" Aren't Enough

The Solution: "LazyStrike" (LaSt-ViT)

The Results

The Takeaway

1. Problem Statement

2. Core Hypothesis & Analysis

3. Methodology: LaSt-ViT (LazyStrike ViT)

4. Key Contributions

5. Experimental Results

6. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation