LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Imagine you have a brilliant, world-class translator (a Large Language Model, or LLM) who speaks perfect human language but has never seen a picture in their life. Now, you want to show them a photo of a "red brick building."

To do this, you hire a small, simple adapter (a vision encoder) to take the photo and translate the visual pixels into a secret code that the translator can understand. The big mystery has always been: What does that secret code actually look like inside the translator's brain?

For a long time, researchers thought these visual codes were like alien gibberish—completely unintelligible to the language model. They tried to decode them using standard tools, but the results were messy, like trying to read a book by looking at the ink stains on the page rather than the words.

This paper introduces a new tool called LATENTLENS that changes the game. Here is the story of what they found, explained simply.

1. The Old Way: Trying to Match Single Letters

Imagine you have a secret code for a picture of a "clock tower."

The Old Method (LogitLens/EmbeddingLens): Researchers tried to match this code against a giant dictionary of single words (like "clock," "tower," "time").
The Problem: It was like trying to guess the plot of a movie by looking at a single letter, like "t." Sometimes it guessed "t" for "tower," but often it just guessed "t" for "the" or "t" for "top." It was a low-resolution, blurry guess. They concluded that visual tokens were mostly "uninterpretable."

2. The New Way (LATENTLENS): Matching Whole Sentences

The authors realized the mistake: Visual concepts aren't single words; they are full scenes.

The Analogy: Instead of matching the secret code against a dictionary of single words, LATENTLENS matches it against a library of full sentences the model has already read.
How it works: When the model sees a "clock tower," LATENTLENS asks: "Which sentence in our library does this secret code look most like?"
The Result: Instead of getting a confusing single letter, the model says: *"Ah, this looks exactly like the sentence 'a large stone tower with gold clocks'."*

Suddenly, the "alien gibberish" becomes a clear, descriptive sentence. The visual token is no longer a mystery; it's a vivid description.

3. The Big Surprise: "The Middle-Child Leap"

The researchers discovered something weird and wonderful about when this happens.

The Expectation: You'd think the visual code enters the model at the "front door" (Layer 1) and stays there until the end.
The Reality (The Mid-Layer Leap): The visual code enters the model, but it immediately "jumps" to the middle of the model's brain (around layers 8–16).
The Metaphor: Imagine a tourist (the visual token) entering a city (the LLM). Instead of wandering the streets (the early layers), the tourist is instantly teleported to the city center where the most interesting conversations are happening.
Why? The visual code is already so "smart" and "contextualized" by the time it enters that it doesn't need to learn the basics (like what a noun is). It skips straight to the part of the brain that understands complex ideas and stories.

4. Why This Matters

The "Universal Engine" Theory: This proves that Large Language Models are incredibly flexible. They aren't just text processors; they are universal understanding machines. They can take a picture, turn it into a sentence, and understand it just as well as they understand a book.
Better AI: By understanding how these models see, we can fix their mistakes. If a model "hallucinates" (makes things up), we can now look inside and see if the visual code was actually interpreted correctly.
The "Frozen" Miracle: The most amazing part is that they didn't have to retrain the giant language model. They just added a tiny, simple connector, and the model instantly understood pictures. It's like giving a person who has never seen a photo a pair of glasses, and suddenly they can describe the photo perfectly.

Summary

LATENTLENS is like a high-definition magnifying glass. It showed us that when we show a picture to a language model, the model doesn't see "noise." It sees rich, detailed sentences describing the image. The model is essentially saying, "I see a building with many windows," and we finally have the tool to hear it clearly.

1. Problem Statement

Vision-Language Models (VLMs) are typically constructed by projecting visual tokens from a vision encoder into the embedding space of a frozen Large Language Model (LLM) using a simple connector (e.g., a shallow MLP). While this architecture works empirically, a fundamental question remains: How are visual tokens represented and processed within the LLM's latent space?

Existing interpretability methods, such as LogitLens (projecting latent states to the vocabulary via the unembedding matrix) and EmbeddingLens (comparing latent states to the input embedding matrix), have yielded inconclusive or poor results when applied to visual tokens. These methods often suggest that visual tokens are not semantically interpretable as language tokens, leading to the hypothesis that vision and language representations might reside in disjoint subspaces. The authors argue that these methods may be fundamentally flawed for this specific task because they rely on static vocabulary tokens rather than contextualized language representations.

2. Methodology: LATENTLENS

The authors propose LATENTLENS, a novel, training-free interpretability framework designed to map latent visual token representations to natural language descriptions.

Core Insight

The key insight is that the most natural comparison for a visual token representation is not a static vocabulary token (sub-word) but a contextualized token representation derived from a full sentence. Visual tokens encode semantic information that aligns better with words in context (e.g., "clock" in "stone tower with gold clocks") than isolated tokens.

Technical Workflow

Corpus Construction: The authors encode a large corpus of sentences (Visual Genome captions, ~3M phrases) using the target LLM. They store the contextualized token representations ( $r_{j,t}^{(\ell)}$ ) for every token $t$ in every sentence $j$ at various layers $\ell$ .
Visual Token Extraction: Visual tokens are extracted from the VLM at specific layers $\ell'$ during the forward pass.
Nearest Neighbor Search: For a given visual token representation $h_i^{(\ell')}$ , the method computes the cosine similarity against the pool of pre-computed contextualized text representations.
Description Retrieval: The top- $k$ nearest neighbor text representations are retrieved. The full sentences (or phrases) containing these tokens serve as the "description" of the visual token.
Evaluation: An automated LLM judge (GPT-5) evaluates whether the retrieved descriptions are semantically related to the image patch (categorized as Concrete, Abstract, or Global).

3. Key Contributions

New Interpretability Method: Introduced LATENTLENS, which shifts the paradigm from static embedding/unembedding matrices to contextualized nearest neighbors.
Re-evaluation of Visual Token Interpretability: Demonstrated that visual tokens are highly interpretable across all layers of the LLM, contradicting previous findings that suggested they were opaque or non-linguistic.
The "Mid-Layer Leap" Phenomenon: Discovered that visual tokens at the input layer (Layer 0) align most strongly with mid-layer text representations (e.g., layers 8–16) rather than the input or output layers. This suggests that the projection layer maps visual data directly to semantic, context-aware representations rather than lexical ones.
Comprehensive Benchmarking: Evaluated the method across 10 different VLM configurations (combining 3 LLMs: OLMo, LLaMA3, Qwen2; and 3 Vision Encoders: CLIP, DINOv2, SigLIP).

4. Experimental Results

Interpretability Scores

The authors measured the percentage of visual tokens deemed "interpretable" by the LLM judge:

LATENTLENS: Achieved 72% interpretability on average across all models and layers.
EmbeddingLens: Achieved only 30% interpretability.
LogitLens: Achieved only 23% interpretability.
Observation: LogitLens showed low interpretability in early layers, increasing only in later layers, whereas LATENTLENS showed consistent high interpretability from the input layer onwards.

The Mid-Layer Leap

Analysis of nearest neighbor layers revealed a distinct pattern:

Visual tokens at Layer 0 (input) are most similar to text tokens from mid-layers (e.g., Layer 8 or 16).
Visual tokens undergo very little transformation (drift) as they pass through the LLM layers compared to text tokens, which change significantly as they incorporate context.
This implies the connector learns to map visual features directly to the LLM's "semantic" representation space, bypassing the need for the LLM to "re-contextualize" the visual input from scratch.

Robustness and Generalization

Model Agnostic: The results held true for models with frozen LLMs and different connector architectures (linear vs. MLP).
Off-the-Shelf Models: The method successfully generalized to Qwen2-VL-7B-Instruct, a fully fine-tuned state-of-the-art model, confirming that the alignment is not an artifact of the specific training setup.
Vision Encoders: Even DINOv2 (trained without text supervision) produced highly interpretable visual tokens, suggesting a strong structural alignment between vision and language spaces regardless of pre-training objectives.

Qualitative Analysis

Sentence-Level Richness: LATENTLENS provides full-sentence descriptions (e.g., "large tower with clocks") which are semantically richer and more accurate than the sub-word or next-token predictions often returned by LogitLens (e.g., "clock", "es", "potato").
Visual Text: LATENTLENS correctly identified rendered text in images, whereas LogitLens often predicted plausible next tokens rather than the actual content.

5. Significance and Implications

Challenging Assumptions: The paper refutes the hypothesis that visual tokens are uninterpretable or exist in a "narrow cone" disjoint from language. Instead, it suggests a high degree of structural similarity (supporting the "Platonic Representation Hypothesis").
Mechanism of Adaptation: It explains why freezing an LLM and adding a simple projector works: the LLM already possesses a rich, semantic representation space that visual features can naturally align with, specifically at mid-layers.
Future Directions: The findings open new avenues for:
- Hallucination Mitigation: Using LATENTLENS to verify if visual tokens align with the generated text.
- Causal Analysis: Investigating how interpretable vs. non-interpretable tokens affect downstream task performance.
- Broader Applicability: Extending the method to other non-linguistic modalities (speech, soft prompts) and native multimodal models.

In conclusion, LATENTLENS provides a robust, training-free tool that reveals the deep semantic alignment between vision and language in VLMs, demonstrating that visual tokens are not just "noise" but highly interpretable linguistic concepts when viewed through the correct lens (contextualized embeddings).