LanteRn: Latent Visual Structured Reasoning

🏮 The Big Idea: "Thinking in Pictures" vs. "Talking About Pictures"

Imagine you are looking at a complex map and trying to explain a route to a friend over the phone.

Current AI (The "Talker"): Most large AI models today are like people who have to describe the map entirely in words. They look at the picture, translate it into a long, detailed speech ("There is a red house on the left, then a blue car, then a tree..."), and then try to solve the problem using only that speech. This is slow, clunky, and often loses important details (like exactly where the car is relative to the tree).
LanteRn (The "Thinker"): The LanteRn framework teaches the AI to do something different. Instead of forcing every thought into words, it allows the AI to have "silent picture thoughts." It can pause, hold a mental image of the map in its "mind's eye," reason about the spatial relationships in that image, and then speak the answer.

The paper calls these silent thoughts "Latent Visual Representations." Think of them as compressed, high-definition mental snapshots that the AI keeps to itself during the thinking process.

🛠 How It Works: The Two-Stage Training

The researchers didn't just turn this feature on; they had to teach the AI how to use it. They did this in two steps, like training an athlete.

Stage 1: The "Copycat" Phase (Supervised Fine-Tuning)

The Goal: Teach the AI how to create these mental snapshots.
The Analogy: Imagine a student learning to draw. The teacher shows them a photo of a cat and says, "Look at the cat's ear. Now, close your eyes and hold a perfect mental image of that ear in your mind."
How they did it: The researchers used a "teacher" AI (a vision encoder) to show the student AI exactly what the mental image should look like for specific parts of a picture. The student AI learned to generate these internal "thought vectors" that match the teacher's visual data.
The Result: The AI learned to "see" and "hold" visual information internally, but it was just copying what it was told. It was good at remembering the picture, but maybe not great at using that memory to solve hard puzzles yet.

Stage 2: The "Coach" Phase (Reinforcement Learning)

The Goal: Teach the AI to use those mental snapshots to actually win the game.
The Analogy: Now, the student is in a competition. The coach (the reward system) doesn't care if the mental image is a perfect copy of the photo. The coach only cares: "Did you get the right answer?"
How they did it: The AI is given a problem. If it uses its "silent picture thoughts" to figure out the answer correctly, it gets a "gold star" (reward). If it just talks in circles and gets it wrong, it gets no star.
The Result: The AI learns that these silent thoughts are powerful tools. It starts to use them strategically. It might decide, "I don't need to describe the whole sky in words; I'll just keep a mental note of the cloud shape to help me find the bird." This makes the reasoning much more efficient and accurate.

🧪 The Results: Why It Matters

The researchers tested LanteRn on three difficult visual puzzles (benchmarks like VisCoT, V ⋆, and Blink). These tests require the AI to understand fine details, like "Which object is behind the other?" or "Where exactly is the bike parked?"

The Old Way: The AI struggled because translating 3D space into 2D text is like trying to describe a 3D sculpture using only a flat sketch. It often got lost.
The LanteRn Way: By keeping the "3D sculpture" in its mind (latent space) while thinking, the AI solved these puzzles much better.
- It got better at Object Localization (finding exactly where things are).
- It got better at Relative Position (understanding what is in front of or behind what).

💡 The Takeaway

LanteRn is a breakthrough because it stops forcing AI to "translate" its vision into words before it thinks. Instead, it lets the AI think in images and only speak when it has the final answer.

Before: Look at image $\rightarrow$ Translate to text $\rightarrow$ Think in text $\rightarrow$ Answer.
LanteRn: Look at image $\rightarrow$ Think in images $\rightarrow$ Translate to text $\rightarrow$ Answer.

This makes the AI smarter, faster, and much better at tasks that require understanding the physical world, all without needing to build a massive, expensive computer. It's like giving the AI a pair of "mental glasses" that let it see the solution clearly before it even opens its mouth to speak.

1. Problem Statement

Current Large Multimodal Models (LMMs) excel in language tasks but struggle with fine-grained visual reasoning. The primary limitation is that most LMMs operate in a "thinking about images" regime: they encode visual inputs once and then perform all subsequent reasoning entirely in text. This forces high-dimensional perceptual information into a low-bandwidth symbolic medium, causing a loss of spatial and structural details.

Existing attempts to solve this ("thinking with images") fall into two categories with significant drawbacks:

Tool-based methods: Rely on external modules (e.g., cropping, object detection), incurring computational overhead and limiting flexibility to predefined tools.
Pixel-space generation: Generate intermediate images during reasoning, which is computationally wasteful as it focuses on photorealistic details irrelevant to the task.

LanteRn addresses these issues by proposing a framework where reasoning occurs directly in latent space, interleaving compact, continuous visual "thoughts" with text, avoiding both external tools and explicit image generation.

2. Methodology: LanteRn Framework

LanteRn augments a vision-language transformer (based on Qwen2.5-VL) to generate and attend to continuous visual embeddings during inference. The model operates in a hybrid trajectory of discrete text tokens and continuous latent vectors.

Core Architecture

Control Tokens: The vocabulary is extended with three special tokens: <|lvr_start|>, <|lvr_sep|>, and <|lvr_end|>.
Dual Modes:
1. Text Mode: Standard autoregressive generation of text tokens.
2. Visual Latent Mode: Upon generating <|lvr_start|>, the model bypasses the language modeling head for $K$ steps, outputting unprojected hidden states ( $z_t \in \mathbb{R}^d$ ) as a block of latent "thought" embeddings. It returns to text mode after <|lvr_end|>.

Two-Stage Training Pipeline

The model is trained in two distinct phases to first ground the representations and then optimize for task utility.

Stage 1: Supervised Fine-Tuning (SFT)

Goal: Ground latent states in visual features.
Teacher Signal: Uses the model's own pre-trained vision encoder as a teacher. For reasoning steps referencing specific image regions (bounding boxes), the vision encoder extracts feature maps.
Target Generation: These features are pooled to create a target latent sequence ( $Z_{target}$ ).
Loss Function: A hybrid objective combining:
- Text Loss ( $L_{text}$ ): Standard cross-entropy for token prediction.
- Latent Alignment Loss ( $L_{latent}$ ): Mean Squared Error (MSE) between the generated latent vectors and the pooled vision encoder features.
Outcome: The model learns to "imagine" visual content in latent space, reconstructing key features without verbalizing them.

Stage 2: Reinforcement Learning (RL)

Goal: Align latent reasoning with task-level utility (answer correctness) rather than just visual fidelity.
Algorithm: Group Relative Policy Optimization (GRPO).
Challenge: Latent vectors are continuous, making standard policy gradients difficult.
Solution (Latent State Replay): During policy updates, the model conditions on the exact latent vectors generated during the rollout phase. This stabilizes the importance sampling ratio while allowing gradients to flow back to the latent generation parameters.
Reward Design:
- Accuracy Reward ( $R_{acc}$ ): Binary reward for correct final answers.
- Format Reward ( $R_{fmt}$ ): Encourages the correct use of delimiters and latent blocks to prevent collapse into purely textual reasoning.

3. Key Contributions

Novel Framework: Introduction of LanteRn, the first framework to enable LMMs to interleave compact latent visual representations with text for structured reasoning, avoiding the computational cost of pixel-space generation.
Hybrid Training Strategy: A two-stage approach (SFT for visual grounding + RL for task alignment) that transitions the model from reconstructing visual features to generating abstract, task-critical internal visual thoughts.
Latent State Replay: A technical innovation in RL for hybrid action spaces (discrete text + continuous vectors) that stabilizes training by fixing the latent trajectory during policy updates.
Efficiency: Demonstrates that internal latent representations can achieve high performance on visual reasoning tasks without the need for massive model scaling or external tool invocation.

4. Experimental Results

The model was evaluated on three perception-centric benchmarks: VisCoT, V ⋆, and Blink.

SFT Results:
- LanteRn variants outperformed the base Qwen2.5-VL-3B model.
- Significant improvements were seen in perception-centric skills (e.g., Object Localization on BlinkOL increased from 0.48 to 0.52).
- However, performance on complex relational reasoning did not consistently improve over text-only baselines, suggesting SFT alone is insufficient for high-level reasoning.
- Ablation: Larger latent block sizes ( $K$ ) did not always yield better results, indicating a trade-off between capacity and effective reasoning.
RL Results:
- Applying RL on top of SFT (specifically the $K=8$ variant) yielded consistent improvements across all benchmarks.
- Key Gains:
  - BlinkRP (Relative Position): Improved from 0.68 (SFT) to 0.81 (RL), surpassing the text-only baseline.
  - V ⋆ RP: Improved from 0.57 to 0.67.
- The RL stage successfully shifted the model's latent representations from mere visual reconstruction to task-driven internal reasoning.
- LanteRn-RL achieved performance parity with larger 7B models on several benchmarks despite using a 3B backbone, highlighting its compute efficiency.

5. Significance and Conclusion

LanteRn represents a significant shift in multimodal reasoning from "verbalizing perception" to "thinking with latent visuals." By enabling models to manipulate continuous visual representations internally, it overcomes the bandwidth limitations of text-based reasoning without the inefficiency of generating intermediate images.

The work demonstrates that Reinforcement Learning is crucial for unlocking the full potential of latent reasoning, transforming the model's internal states from passive feature reconstruction into active, utility-driven problem-solving tools. This approach offers a promising, compute-efficient path toward more capable and spatially aware multimodal AI systems.

Limitations & Future Work:

Current reliance on narrow visual domains for training trajectories.
Use of fixed-size latent blocks; future work should explore dynamically sized blocks adapting to task complexity.
Need for better visualization and interpretability of latent dependencies.