From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Imagine you are teaching a brilliant student (a Large Language Model) how to solve complex puzzles that involve both reading text and looking at pictures. This student is already very good at reading, but when you show them a picture and ask a math question about it, they often ignore the picture and just guess based on the words. They are "lazy" when it comes to looking.

This paper, titled "From Narrow to Panoramic Vision," investigates why this happens and how to fix it. Here is the story in simple terms:

1. The Problem: The "Lazy" Student

The researchers discovered a strange phenomenon they call "Lazy Attention Localization."

The Expectation: You would think that if you train a student with lots of "Picture + Question" examples (Multimodal Cold-Start), they would get better at looking at pictures.
The Reality: They don't. Even after training, the student still mostly ignores the picture and relies on the text. Their "eyes" (attention) stay fixed on the text, and they barely glance at the image.
The Surprise: When the researchers trained the student only with text-based logic puzzles (no pictures), the student actually got better at looking at pictures later! Why? Because learning how to think deeply with text taught them a "habit" of paying attention to details, which they then accidentally applied to pictures.

The Analogy: Imagine a detective who is terrible at looking at crime scene photos. You show them 1,000 crime scene photos, but they keep staring at the police report and ignoring the photos. However, if you train them to write detailed stories about text-only mysteries, they learn the habit of being thorough. When you finally give them a photo, they suddenly start looking at it closely because they've learned the "habit of looking."

2. The Metric: The "Eye-Contact Score"

To prove this, the authors invented a score called Visual Attention Score (VAS).

Think of this as a "Gaze Tracker." It measures how much the model's "eyes" are actually looking at the pixels in the image versus the words in the prompt.
The Finding: There is a perfect match (96% correlation) between this score and how well the model solves problems.
- Low Score (Narrow Vision): The model ignores the image. It fails.
- High Score (Panoramic Vision): The model looks at the image intently. It succeeds.

3. The Quick Fix: Training-Free "Glasses"

Before building a new training system, the researchers tried a quick trick. They didn't retrain the model at all. Instead, they just tweaked the model's "glasses" while it was answering questions.

They told the model: "Stop staring so hard at the system instructions (the boring setup text) and look at the picture!"
Result: Just by forcing the model to shift its gaze, performance jumped by 1–2%. This proved that the problem wasn't the model's intelligence; it was just where it was looking.

4. The Big Solution: AVAR (The "Visual Anchor" System)

Since the quick fix worked, they built a full training framework called AVAR (Attention-Guided Visual Anchoring and Reflection) to teach the model to look at pictures naturally.

They did this in three creative steps:

Step 1: The "Photo Description" (Data Synthesis)
Instead of just showing a picture and a question, they used a super-smart AI to write a very detailed, high-definition description of the picture first. Then, they made the model solve the problem while constantly referencing that description.
- Analogy: Instead of just handing a student a map, you first have them write a detailed tour guide of the map, then solve the puzzle using that guide.
Step 2: The "Look Back" Habit (Training Objectives)
They added a special rule during training: "If you stop looking at the picture, you get a penalty." They forced the model to constantly check the image, inserting phrases like "Let me look at the triangle again" or "Checking the image..." into its thought process.
- Analogy: It's like a teacher tapping the student on the shoulder every time they start daydreaming, saying, "Hey, look at the diagram!"
Step 3: The "Good Job" Reward (Reward Shaping)
In the final reinforcement learning stage, they didn't just reward the model for getting the right answer. They also rewarded it for keeping its eyes on the picture throughout the whole process.
- Analogy: You don't just pay the student for the correct answer; you pay them extra if they can point to exactly where in the picture they found the clue.

5. The Result: From Narrow to Panoramic

By using AVAR, the model transformed from a "Narrow-View" student (who ignores pictures) to a "Panoramic-View" student (who sees the whole picture).

Performance: The new model (AVAR-Thinker) improved its reasoning skills by 7% on average across many difficult benchmarks.
Specific Wins: It got much better at geometry (MathVision) and stopped making up fake details about images (HallusionBench).

Summary

The paper teaches us that how a model "looks" is just as important as what it "knows."
Current methods often fail because they don't teach models to look at images; they just feed them data. By explicitly teaching models to keep their "eyes" on the visual details—using a system that forces them to anchor their thoughts to the image—we can turn a confused, text-dependent robot into a sharp, visual reasoning expert.

1. Problem Statement

The paper addresses a critical bottleneck in training Multimodal Large Reasoning Models (MLRMs): the cold-start initialization stage that precedes Reinforcement Learning (RL).

The Paradox: Recent studies show that initializing MLRMs with text-only reasoning data yields significantly better performance in subsequent RL tuning than initializing with multimodal (image + text) reasoning data. This is counter-intuitive, as one would expect multimodal data to better prepare the model for multimodal tasks.
The Gap: The underlying mechanism for why multimodal cold-start fails to improve reasoning, while text-only cold-start succeeds, remains poorly understood. Current models often fail to leverage visual signals effectively during the reasoning process, leading to "visual hallucinations" or reliance on language priors.

2. Key Methodology & Analysis

A. Visual Attention Score (VAS) & Lazy Attention Localization

To diagnose the issue, the authors introduce the Visual Attention Score (VAS), a metric quantifying the ratio of attention a model pays to visual tokens versus system tokens during inference.

Correlation: They found a strong positive correlation ( $r = 0.9616$ ) between VAS and reasoning performance. Models with high VAS ("Panoramic-View") outperform those with low VAS ("Narrow-View").
Lazy Attention Localization: The core discovery is that multimodal cold-start training fails to increase VAS. Models trained on multimodal data retain attention distributions similar to the base model (low visual attention). Conversely, text-only cold-start surprisingly induces a significant increase in visual attention.
Hypothesis: Text-only data forces the model to internalize structured reasoning patterns. When these patterns are applied to multimodal tasks, the model naturally preserves visual grounding to support the logic, whereas direct multimodal training leads to "lazy" reliance on system prompts without deep visual integration.

B. Training-Free Intervention

To validate causality, the authors performed training-free interventions at inference time. By artificially suppressing attention to system tokens and amplifying attention to visual tokens, they achieved consistent performance gains (1–2%) across various models without retraining. This confirmed that attention distribution is a decisive factor in reasoning capability.

C. AVAR Framework (Attention-Guided Visual Anchoring and Reflection)

Based on these insights, the authors propose AVAR, a comprehensive cold-start framework designed to explicitly reshape attention allocation. It consists of three synergistic components:

Visual-Anchored Reflection Data Synthesis:
- Instead of standard "caption-then-reason" pipelines, AVAR uses a three-stage synthesis process involving high-fidelity visual description generation, reflection-enhanced reasoning, and Visual Anchor Integration.
- The process inserts explicit visual references (e.g., "look back at the triangle," "check the image again") into the reasoning chain, forcing the model to simulate direct image perception throughout the thought process.
Attention-Guided Training Objectives (AGTO):
- During the cold-start fine-tuning phase, the authors introduce specific loss functions:
  - Image Enhancement Loss: Encourages sustained attention to visual tokens.
  - System Suppression Loss: Penalizes redundant attention to system tokens.
- This directly optimizes the model's attention allocation patterns to match the "Panoramic-View" behavior.
Visual-Anchored Reward Shaping (VARS):
- In the subsequent RL stage, the reward function is modified to include a visual attention reward.
- The model is rewarded not just for correct answers, but for maintaining a high ratio of attention to visual tokens relative to system tokens throughout the reasoning chain. This prevents the model from reverting to text-only reasoning patterns during RL.

3. Key Contributions

Metric Discovery: Introduction of VAS as a strong predictor of multimodal reasoning ability.
Phenomenon Identification: Discovery of Lazy Attention Localization, explaining why multimodal cold-start is often ineffective compared to text-only initialization.
Causal Validation: Demonstration that manipulating attention weights at inference time yields immediate gains, proving the causal link between visual attention and reasoning.
Novel Framework: Proposal of AVAR, which integrates data synthesis, attention-guided objectives, and reward shaping to systematically reshape attention allocation.

4. Experimental Results

The framework was applied to Qwen2.5-VL-7B, resulting in the AVAR-Thinker model.

Performance: AVAR-Thinker achieved an average gain of 7.0% across 7 multimodal reasoning benchmarks compared to the baseline.
Specific Gains:
- MathVision: +12.2% (Multi-step geometric reasoning).
- HallusionBench: +8.8% (Robustness against visual hallucinations).
- MathVista: +6.5%.
Comparison: It outperformed existing state-of-the-art 7B models like ThinkLite-VL (+3.0%) and MM-Eureka (+4.9%), and surpassed models initialized with multimodal cold-start data (e.g., R1-OneVision) by large margins.
Ablation Studies: Confirmed that each component (Data Synthesis, AGTO, VARS) contributes incrementally to the final performance, with the full pipeline achieving the best results.
Generalization: The method also showed consistent improvements when applied to Llama-3.2-11B-Vision-Instruct.

5. Significance

This work fundamentally shifts the paradigm for training multimodal reasoning models. It suggests that attention allocation is the primary bottleneck in multimodal reasoning, not just the quantity of data.

Paradigm Shift: It challenges the assumption that "more multimodal data" equals "better multimodal reasoning." Instead, it advocates for structured attention reshaping during the cold-start phase.
Practical Impact: The AVAR framework provides a blueprint for training models that genuinely "see" and reason with images, reducing hallucinations and improving performance on complex, multi-step visual tasks.
Open Source: The authors released the code, data, and trained models to facilitate reproducibility and further research in the field.

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

1. The Problem: The "Lazy" Student

2. The Metric: The "Eye-Contact Score"

3. The Quick Fix: Training-Free "Glasses"

4. The Big Solution: AVAR (The "Visual Anchor" System)

5. The Result: From Narrow to Panoramic

Summary

1. Problem Statement

2. Key Methodology & Analysis

A. Visual Attention Score (VAS) & Lazy Attention Localization

B. Training-Free Intervention

C. AVAR Framework (Attention-Guided Visual Anchoring and Reflection)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection