Original authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

Published 2026-06-05

📖 4 min read☕ Coffee break read

Original authors: Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Problem: The "Translator" Bottleneck

Imagine a Large Multimodal Model (LMM) as a brilliant translator who is an expert in words but has never learned to "think" in pictures.

When you show this translator a complex image (like a jigsaw puzzle or a scene with many objects) and ask a question, the translator is forced to do all the heavy lifting using only words. They have to look at the picture, describe every single detail in text, and then solve the problem based on that description.

The Issue: Some visual problems are like trying to describe a 3D shape using only a 2D sketch. It's hard to capture the full picture with words alone.
The Old Fix: Previous researchers tried to help by forcing the translator to write down "intermediate steps" (like drawing a box around an object or writing "I see a dog here"). But this is like hiring a human to draw the boxes for the AI every time. It's expensive, slow, and the AI is forced to look at the picture in a very specific, rigid way that might not actually be the best way to solve the problem.

The Solution: The "Silent Interns" (LIVR)

The authors propose a new method called LIVR (Latent Implicit Visual Reasoning). Instead of forcing the AI to write down its thoughts, they give it a team of silent interns.

Here is how it works:

The Silent Interns (Latent Tokens): The AI is given a set of special, invisible tokens (think of them as empty sticky notes). These aren't words; they are just placeholders for information.
The "No-Phone" Rule (Visual Bottlenecking): This is the magic trick. The researchers set up a rule: The AI cannot look at the original picture directly when it is writing its final answer.
- Imagine the AI is taking a test. The picture is on the wall, but the AI is wearing blinders.
- The only way the AI can "see" the picture is by asking its Silent Interns for a summary.
- The Interns must look at the picture, figure out what is important, and write that information onto their sticky notes.
- The AI then reads the sticky notes to answer the question.
Learning by Necessity: Because the AI cannot cheat by looking at the picture directly, it is forced to teach the Silent Interns how to be good at summarizing the visual information. The Interns learn to grab the right details (like "count the cows" or "find the matching shape") without anyone telling them exactly what to look for.

Why This is Better

No Expensive Training: You don't need humans to draw boxes or write step-by-step guides. The AI figures out the best way to "think" visually on its own.
Flexible Thinking: Unlike the old methods that forced the AI to look for specific things (like bounding boxes), these Silent Interns can learn any kind of visual pattern that helps solve the puzzle. If the task is abstract, they learn abstract patterns. If it's about counting, they learn to count.
Better Results: The paper shows that this method beats the old "Direct Supervised Fine-Tuning" (where the AI just learns from question-and-answer pairs) on nine different difficult visual tasks. It also beats other complex methods that rely on expensive intermediate steps.

The Analogy in Action

Imagine you are trying to solve a Jigsaw Puzzle.

Old Way (Text-Centric): You have to describe every piece in words ("This piece has a blue sky and a tree branch") before you can try to fit it. It's slow and you might miss the big picture.
Old "Helper" Way: You hire a human to draw a map of where every piece goes. It works, but it takes forever to draw the map for every new puzzle.
LIVR Way: You give the puzzle to a team of Silent Interns. You tell them, "You can't talk to me until you have figured out the whole picture." The Interns look at the pieces, organize them in their heads, and then hand you a single, perfect summary of the solution. You didn't tell them how to organize the pieces; they figured it out themselves because they had to.

The Bottom Line

The paper claims that by forcing the AI to pass visual information through these "Silent Interns" (latent tokens) via a "No-Phone" rule (bottlenecking), the AI learns to reason visually much better. It does this without needing expensive human labels or rigid instructions, making it a smarter, more flexible way for computers to understand images.

Technical Summary: Latent Implicit Visual Reasoning (LIVR)

1. Problem Statement

Large Multimodal Models (LMMs) have achieved significant progress but remain fundamentally text-centric. Their architecture typically projects visual inputs into a language model that reasons exclusively through text tokens. This creates a language bias, forcing the model to translate complex visual abstractions (e.g., spatial relationships, object counts, or pattern recognition) into linguistic descriptions before reasoning. This translation step often results in a loss of expressivity, making LMMs struggle with vision-centric tasks that require fine-grained visual reasoning.

Existing attempts to improve visual reasoning often rely on explicit intermediate supervision, such as training models to generate bounding boxes, image crops, depth maps, or "helper images" as intermediate steps. The authors identify three critical limitations with these approaches:

Annotation Cost: They require task-specific intermediate labels, which are expensive to collect and do not scale well.
Restrictive Priors: They impose human assumptions about what constitutes "useful" visual reasoning (e.g., assuming the answer lies in a specific bounding box), which may not be optimal for the model.
Generalization Issues: For abstract or complex visual tasks, it is often unclear even to humans what intermediate representation should be supervised, limiting the scalability of these methods across diverse tasks.

2. Methodology: Latent Implicit Visual Reasoning (LIVR)

The paper proposes Latent Implicit Visual Reasoning (LIVR), a task-agnostic mechanism that enables LMMs to discover and utilize latent visual reasoning tokens without explicit intermediate supervision.

Core Components

Latent Tokens: The authors introduce $K$ new special tokens ( $L = \{l_1, l_2, \dots, l_K\}$ ) into the model's vocabulary. These tokens are randomly initialized but have trainable embedding rows. Unlike standard tokens, the model is not trained to generate these tokens as output; rather, they are appended to the input prompt to serve as a dedicated space for visual abstraction.
Visual Bottlenecking: To force the model to utilize these latent tokens, the authors employ a novel bottleneck attention masking strategy during training:
- Stage 1 (Bottleneck Training): The attention mask is modified so that answer tokens and prompt tokens cannot attend directly to the visual input tokens ( $I$ ). They can only attend to the prompt and the latent tokens ( $L$ ). This forces all visual information relevant to the answer to pass through the latent tokens, effectively making them the sole conduit for visual reasoning.
- Stage 2 (Integration): After the latent tokens have learned to encode useful visual information, the model reverts to a standard attention mask where answer tokens can attend to both the original image tokens and the enriched latent tokens. This stage trains the model to jointly use the raw visual input and the learned latent abstractions.

Training Objective

The model is trained end-to-end using a standard negative log-likelihood (NLL) objective on the answer tokens only. No additional loss functions are applied to the latent tokens themselves; they are optimized implicitly to minimize the final task error.

3. Key Contributions

A New Paradigm for Visual Reasoning: LIVR introduces a method where models implicitly learn useful visual representations through latent tokens, eliminating the need for costly intermediate supervision (e.g., bounding boxes, helper images, or Chain-of-Thought annotations).
Superior Performance in Controlled Settings: In data-matched experiments across nine perception-heavy tasks (e.g., Jigsaw, Functional Correspondence, Visual Similarity) and three different LMM backbones (Qwen2.5-VL, Qwen3-VL, LLaVA-OneVision), LIVR consistently outperforms Direct Supervised Fine-Tuning (SFT).
Competitiveness with Explicit Methods: On broader reasoning benchmarks, LIVR matches or exceeds the performance of prior methods that rely on text-based reasoning (e.g., CoT) or explicit visual intermediates, despite requiring no such intermediate annotations.
Task-Agnostic Scalability: The method is shown to generalize effectively to multi-task training scenarios without requiring task-specific adjustments to the supervision signal.

4. Experimental Results

The paper evaluates LIVR on the BLINK benchmark and other visual reasoning datasets.

Single-Task Fine-Tuning:
- On Qwen2.5-VL-3B, LIVR improved the mean accuracy across nine tasks by 6.24% over Direct SFT. Notable gains were observed on Jigsaw (+12.00%) and Functional Correspondence (+13.02%).
- On Qwen3-VL-4B, LIVR improved the mean accuracy by 3.43%.
- On LLaVA-OneVision-1.5-4B, LIVR improved the mean accuracy by 5.60%.
Multi-Task Fine-Tuning:
- When trained on a combined dataset of six tasks using Qwen3-VL-4B, LIVR outperformed Direct SFT on all individual tasks, achieving a 2.77% improvement in mean accuracy.
Comparison with Prior Work:
- Visual Spatial Planning: LIVR achieved 66.00% accuracy on the VSP task (Qwen2.5-VL-3B), significantly outperforming Mirage (46.00%), which relies on explicit helper images.
- Spatial Reasoning Generalization: On the SAT Val, BLINK-3, and RoboSpatial benchmarks, LIVR-3B achieved 85.6% on SAT Val, outperforming ViGoRL (62.9%) and other baselines that use text-CoT or RL.
- Broader Benchmarks: On MMVP, V*, and BLINK-5, LIVR-7B was competitive with or superior to methods like LVR, PixelReasoner, and Vision-R1, despite lacking their explicit intermediate supervision.

Ablation Studies

Necessity of Bottlenecking: A variant with latent tokens but without the Stage 1 bottleneck ("Latents only") failed to improve performance over Direct SFT and the model learned to ignore the tokens.
Necessity of Latent Tokens: A variant with the bottleneck mask but no new tokens ("Mask only") also underperformed, suggesting that existing text tokens cannot easily be repurposed to form abstract visual representations.
Design Choices: The optimal configuration was found to be placing latent tokens after the prompt, using unshared embeddings, and a training schedule of 4 epochs (Stage 1) followed by 6 epochs (Stage 2) with $K=16$ tokens.

5. Significance and Claims

The authors claim that LIVR offers a simple, effective, and task-agnostic approach to enhancing visual reasoning in LMMs. By decoupling internal computation from external tokens, LIVR allows the model to refine its internal state solely to optimize task performance, rather than being constrained by what can be explicitly verbalized or hand-designed.

The paper emphasizes that LIVR achieves these gains without the need for:

Helper images
Bounding boxes
Image crops
Depth maps
Chain-of-Thought annotations

The authors conclude that while latent tokens are less directly interpretable than text-based reasoning, they provide a flexible internal space that allows models to discover visual abstractions that are difficult for humans to define or supervise explicitly. The work suggests that latent-space reasoning is a viable and powerful alternative to explicit intermediate supervision for vision-centric tasks.

Latent Implicit Visual Reasoning