What Do Visual Tokens Really Encode? Uncovering Sparsity and Redundancy in Multimodal Large Language Models

Imagine a Multimodal Large Language Model (MLLM) as a super-smart translator who is trying to describe a picture to a friend. The picture is taken by a camera (the Vision Encoder), chopped into tiny square tiles (patches), and then handed to the translator (the LLM) to turn into words.

For a long time, we assumed the translator had to look at every single tile and do a lot of heavy mental gymnastics to figure out what the picture meant.

This paper, "What Do Visual Tokens Really Encode?", peeks behind the curtain and discovers that the translator is actually doing a lot of unnecessary work. In fact, the picture tiles they receive are mostly junk, and the translator's brain is wired in a way that makes some of its own thinking steps redundant.

Here is the breakdown using simple analogies:

1. The Three Types of Picture Tiles

When the camera sends the picture tiles to the translator, they aren't all equal. The researchers found they fall into three distinct groups:

The "Dead" Tiles (The Static Noise): Imagine you are looking at a photo of a cat, but 30% of the tiles are just blank gray squares or random static. They don't show the cat, the background, or anything useful. They are just "dead weight."
- The Discovery: The model ignores these. If you throw them away, the translator actually works better because it's not distracted by the noise.
The "Sink" Tiles (The Attention Anchors): These are like the "Start" button on a remote control. They don't contain any picture information (like "cat" or "tree"), but the translator's brain is trained to look at them to keep its focus stable. They act like a structural glue.
- The Discovery: These are also useless for understanding the image. You can remove them, and the translator just shifts its attention to the "Start" button in the text prompt instead. No harm done.
The "Alive" Tiles (The Real Info): These are the only tiles that actually matter. They contain the specific details: the cat's ears, the red ball, the text on a sign.
- The Discovery: Surprisingly, only about 60% of the tiles are "Alive." The other 40% are just dead or sink tiles.

2. The "Pre-Translated" Secret

Here is the most surprising part. We used to think the translator had to take these "Alive" tiles and do a lot of work to turn them into concepts.

The Old Idea: The tiles arrive as raw pixels. The translator's brain (the LLM) has to process them through many layers of thinking to figure out, "Oh, that's a red ball."
The New Discovery: The "Alive" tiles arrive already translated. They are like a pre-packaged lunch. By the time they reach the translator, they already smell like "red ball" or "text." They are so well-aligned with language that the translator doesn't need to do much heavy lifting to understand them.

3. The "Middle-Seat" Shortcut

Because the "Alive" tiles arrive so well-prepared, the translator doesn't need to use its whole brain to process them.

The Analogy: Imagine a student taking a test. Usually, they read the question, think about it in their head (shallow layers), and then write the answer.
The Finding: For these picture tiles, the "thinking" part in the early layers of the brain is actually useless. It's like trying to solve a math problem by staring at the paper for 10 seconds before writing anything down. It just wastes time.
The Solution: The researchers found that if you skip the first few "thinking layers" and inject the picture tiles directly into the middle layers of the translator's brain, it works just as well (and sometimes better). It's like handing the answer key directly to the student's middle brain, skipping the confusion.

4. The "Color Confusion" Trap

The paper also found a funny quirk in how the model sees colors.

The Scenario: If you show the model a black letter "A" on a bright green background, the model often says the letter is green.
The Reason: The model is lazy. Instead of looking at the specific letter, it looks at the "vibe" of the whole patch. It sees the green background and assumes the whole thing is green. It's like judging a person's personality based on the color of their shirt rather than who they actually are.

Why Does This Matter? (The "So What?")

This research is a game-changer for making AI faster and cheaper:

Pruning the Junk: Since 40% of the picture tiles are useless (Dead/Sink), we can just delete them before the model even starts thinking. This makes the model run faster and use less memory.
Skipping the Boring Stuff: Since the early layers of the brain don't help much with pictures, we can tell the model to skip them. This is like telling a worker, "Don't fill out the paperwork; just go straight to the assembly line."
Better Design: Future AI models can be built to inject pictures directly into the middle of the brain, making them more efficient and less prone to hallucinations (making things up).

In a nutshell: The paper reveals that current AI models are carrying around a lot of heavy, useless luggage (dead tokens) and walking in circles (redundant processing) when they could just take a shortcut (mid-layer injection) and leave the junk behind.

1. Problem Statement

Multimodal Large Language Models (MLLMs) typically project visual features from a vision encoder (e.g., CLIP) into the embedding space of a Large Language Model (LLM). While this architecture has achieved impressive results, the internal mechanics of how visual semantics are structured, distributed, and processed within the LLM remain poorly understood.
Key unresolved questions include:

Are all visual patches (tokens) equally informative, or is there redundancy?
Do visual tokens arrive at the LLM with "pre-linguistic" semantic meaning, or do they require extensive internal LLM processing to become understandable?
How do different layers of the LLM contribute to refining visual information?

2. Methodology

The authors introduce EmbedLens, a novel, model-agnostic probing framework, and a two-fold analytical approach to dissect visual token processing:

EmbedLens (Nearest Embedding Retrieval): Instead of relying on unembedding matrices (which may inject language bias), EmbedLens measures the cosine similarity between a target visual token embedding and the model's input vocabulary. It retrieves the top- $k$ text tokens to determine the intrinsic semantic content of a visual token before it undergoes LLM transformation.
Clustering Analysis: The authors perform similarity-based clustering on post-projection visual embeddings to identify macro-level structural patterns across thousands of images.
Targeted Benchmarks:
- Patch-Compression Benchmark: Forces object or OCR information into a single patch to test if a single token can encode multiple semantic attributes (object, color, count).
- Layer Decoupling & Skipping: Experiments where specific sublayers (Visual Self-Attention vMHA, Visual Feed-Forward Networks vFFN) are disabled, or shallow layers are skipped entirely, to measure the necessity of internal visual computation.
- Pruning Experiments: Systematically removing identified token categories (Sink, Dead, Alive) to measure performance impact.

3. Key Findings & Results

A. Tri-Partition of Visual Tokens

The authors discover that visual tokens are not homogeneous but partition into three distinct, stable categories:

Sink Tokens (~10%):
- Characteristics: High activation norms, input-agnostic (nearly identical across different images), and attract disproportionate attention.
- Function: They act as structural placeholders (similar to <bos> tokens) to stabilize attention distributions but carry no image-specific semantics.
- Result: Pruning them has zero impact on performance; attention is simply reallocated to textual sinks.
Dead Tokens (~30%):
- Characteristics: Image-independent, low attention, and semantically void (often fragmented subwords or punctuation).
- Function: They are inert residues that do not evolve semantically or participate in cross-modal reasoning.
- Result: Pruning them improves performance by removing noise/distractions.
Alive Tokens (~60%):
- Characteristics: Cluster near the text semantic center, carry image-specific meaning, and are the primary interface for visual-to-language translation.
- Result: These are the only tokens essential for downstream tasks.

B. Semantic Sparsity and "Pre-linguistic" Encoding

High Density: Alive tokens are highly informative. A single alive token often bundles multiple semantic attributes (e.g., object identity, color, shape, and OCR glyphs) simultaneously.
Pre-linguistic Alignment: These tokens are already "pre-linguistic," meaning they encode discrete, language-like concepts directly from the vision encoder and projector, requiring minimal translation by the LLM.
Sparsity: Only a small fraction of tokens (approx. 10% at the input, stabilizing at ~10% after alignment) actually carry grounded object semantics. This explains why aggressive token pruning often works.

C. Redundancy in Internal Visual Computation

Shallow Layers are Detrimental: Visual tokens have high $L_2$ norms upon entry. The projector intentionally amplifies these norms to bypass the shallow layers of the LLM. Forcing visual tokens through early layers often degrades performance or introduces noise.
Mid-Layer Alignment: Alive tokens naturally align with representations in the middle layers of the LLM, not the initial embedding space.
Task-Dependent Necessity:
- General Tasks (VQA, Hallucination): Bypassing visual-only vMHA and vFFN layers has negligible impact or even improves performance (by reducing bias).
- Vision-Centric Tasks (OCR, Localization): These tasks still benefit from internal visual processing, particularly vMHA for spatial reasoning and vFFN for knowledge refinement.

D. Contextual Bias

The study reveals that MLLMs often suffer from contextual color bias. When an object's color differs from its background, the model tends to predict the dominant background color rather than the object's intrinsic color, inferring color from surrounding statistics rather than grounded object features.

4. Significance and Contributions

Unified Mechanistic View: The paper provides the first comprehensive explanation of how visual semantics flow through MLLMs, moving from a "black box" view to a structured understanding of token roles (Sink, Dead, Alive).
EmbedLens Tool: Introduces a practical, model-agnostic tool for fine-grained semantic probing of visual tokens, validated across multiple MLLM families (LLaVA, Qwen-VL, InternVL).
Efficiency Guidelines:
- Token Pruning: Safe removal of ~40% of input tokens (Sinks + Dead) without performance loss.
- Architecture Optimization: Suggests that mid-layer injection of visual tokens is sufficient, rendering shallow-layer visual processing redundant for many tasks.
- Selective Processing: Proposes that internal visual computation (FFN/Attention) should be selectively applied only to vision-centric tasks, reducing computational overhead.
Interpretability: By identifying that visual tokens are largely "pre-linguistic" and aligned with mid-layers, the paper challenges the assumption that LLMs must perform heavy lifting to "understand" images, suggesting the heavy lifting is done by the vision encoder and projector.

Conclusion

The paper fundamentally shifts the understanding of MLLMs from a model that processes raw visual data to one that selects and aligns pre-processed semantic units. By uncovering massive sparsity and redundancy, the authors pave the way for more efficient, interpretable, and robust multimodal architectures through selective token pruning and optimized injection strategies.