VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

The Problem: The "Forgetful Detective"

Imagine you are a detective trying to solve a complex mystery based on a crime scene photo and a few witness statements.

In the old days, AI models were like detectives who looked at the photo once, took a quick glance, and then started writing their report. They would talk to themselves for a long time ("Wait, let me think... maybe the gun was left-handed...").

But here's the catch: The longer they talked to themselves, the more they forgot the photo.

As the AI generated longer and longer chains of thought (text), its attention drifted away from the visual clues. It started relying entirely on its memory and general knowledge (textual priors) rather than what was actually in the picture. It was like a detective closing their eyes, spinning in a circle, and guessing the answer based on a hunch, forgetting to look at the evidence on the table. This is called "Visual Dilution."

The Old Solutions: Expensive Training

Scientists tried to fix this in two ways:

Re-training the Detective: They used Reinforcement Learning (RL) to teach the AI, "Hey, every time you write a sentence, look back at the photo!" This worked, but it was like hiring a team of 50 tutors to re-teach the detective from scratch. It was incredibly expensive and slow.
Just Thinking Longer: They tried making the AI think even longer (Textual Self-Reflection). But this just made the problem worse; the AI got more lost in its own thoughts and forgot the photo even faster.

The New Solution: VisRef (The "Smart Glance")

The authors of this paper proposed VisRef. They asked a simple question: "Can we make the detective look back at the photo without re-training them at all?"

The answer is yes. VisRef is a "training-free" framework. It doesn't change the AI's brain; it just changes how it looks at the photo while it thinks.

How VisRef Works (The Analogy)

Imagine the AI is solving a math problem on a whiteboard with a diagram next to it.

The "Core" Group: The diagram has hundreds of tiny details (pixels). If the AI tries to look at every single pixel every time it writes a sentence, it gets overwhelmed and slow.
The Smart Selection (DPP): VisRef acts like a smart spotlight. Instead of shining the light on the whole room, it uses a mathematical trick (called a Determinantal Point Process) to pick the most important 30% of the details that are relevant right now.
- Relevance: It picks the parts of the image that match what the AI is currently thinking about (e.g., if the AI is talking about a "red car," the spotlight zooms in on the red car).
- Diversity: It makes sure the spotlight doesn't just stare at the car's tire five times. It spreads out to see the wheels, the driver, and the background. It ensures a broad, diverse view.
The "Re-Inject": Every time the AI takes a step in its reasoning, VisRef re-injects these selected visual clues back into the AI's mind. It's like the detective pausing their monologue, opening their eyes, looking at the specific clues on the table, and then continuing their thought process with fresh eyes.
Knowing When to Stop: VisRef also has a "confidence meter." If the AI is 99% sure of the answer (low "entropy"), it stops thinking and gives the answer. If it's confused, it keeps looking at the photo and thinking more.

Why It's a Big Deal

No Training Needed: You can take any existing smart AI model and plug VisRef in like a USB drive. No expensive retraining required.
Better Accuracy: In tests (like MathVista and MM-Star), models using VisRef got significantly better scores (up to 6.4% higher) than models that just "thought longer" or models that were re-trained to look back.
Efficient: It doesn't waste time looking at irrelevant parts of the image. It only looks at what matters.

The Bottom Line

VisRef is like giving a forgetful genius a sticky note system.

Instead of letting the genius ramble on and forget the picture, VisRef forces them to pause, stick a few relevant "sticky notes" (visual clues) back onto their desk, and remind them of the evidence before they continue their brilliant reasoning. It keeps the AI grounded in reality, ensuring that the more it thinks, the smarter it gets, rather than the more it hallucinates.

1. Problem Statement

The Issue of Visual Dilution:
Multi-Modal Large Reasoning Models (MLRMs) have shown success by extending Chain-of-Thought (CoT) reasoning to vision-language tasks. However, a critical limitation exists: as these models generate longer reasoning traces (test-time scaling), their attention to visual tokens progressively diminishes.

Mechanism: As the context window expands with textual reasoning steps, visual tokens become diluted. The model increasingly relies on textual priors rather than grounding its reasoning in the actual image content.
Consequence: This leads to visual hallucinations and degraded performance on vision-critical tasks (e.g., geometry, chart interpretation), despite the model "thinking" longer.
Limitations of Existing Solutions:
- Reinforcement Learning (RL) Fine-tuning: Methods like "Look-Back" train models to explicitly revisit images. While effective, they are computationally expensive, require large-scale annotated datasets, and are not easily scalable.
- Text-Centric Test-Time Scaling: Methods that simply extend reasoning via self-reflection (e.g., "Wait, think more") fail to address the visual dilution problem, as they only extend the text chain without re-grounding the visual input.

Core Question: Can we restore visual grounding entirely at test time without any model retraining or RL fine-tuning?

2. Methodology: VisRef

The authors propose VisRef, a training-free framework that adaptively reinjects a carefully selected subset of visual tokens during the reasoning process.

A. Core Mechanism: Adaptive Visual Token Re-injection

Instead of processing the image once at the beginning, VisRef dynamically selects and re-injects visual tokens at each reasoning step $k$ .

Input: Image $I$ and Text Prompt $T$ .
Visual Tokens: A set $V = \{v_1, ..., v_N\}$ extracted from the image.
Process: At step $k$ , the model generates a reasoning trace $z_k$ . VisRef then selects a subset $V_k \subset V$ to re-inject, updating the context to $\tau_{1:k} = \{(z_1, V_1), ..., (z_k, V_k)\}$ .

B. Visual Token Selection via Determinantal Point Processes (DPP)

Selecting which tokens to re-inject is the central challenge. Naively re-injecting all tokens is computationally prohibitive. VisRef formulates this as an optimization problem to find a coreset that is both relevant to the current reasoning state and diverse in visual coverage.

Relevance: Tokens must align with the current textual reasoning state $z_k$ .
Diversity: Tokens must cover different parts of the image to avoid redundancy.

The Mathematical Formulation:
The authors use Determinantal Point Processes (DPP) to balance these objectives.

They project visual tokens into the subspace defined by the current text tokens $z_k$ using a kernel matrix $L_k$ .
The selection objective is to maximize the determinant of the kernel matrix restricted to the selected subset $V_k$ :
$\hat{V}_k = \arg \max_{V_k \subseteq V} \det(L_{V_k}^k)$
Decomposition: Maximizing $\log \det(L_{V_k}^k)$ naturally decomposes into:
$\log \det(L_{V_k}^k) = \underbrace{\sum \log(r_i^2)}_{\text{Relevance}} + \underbrace{\log \det(\bar{L}_{V_k}^k)}_{\text{Diversity}}$
Where $r_i$ is the relevance score of token $i$ , and $\bar{L}$ is the normalized diversity kernel.
Efficiency: Since exact optimization is NP-hard, they use a greedy selection algorithm with a fixed token budget $m$ (e.g., 30% of total visual tokens) to approximate the solution efficiently.

C. Adaptive Stopping Criterion

To prevent "overthinking" (which can degrade performance) and bound computational cost, VisRef employs an entropy-based stopping criterion.

At each step, the model calculates the entropy $H_k$ of its answer distribution.
If $H_k < \delta_{entropy}$ (a confidence threshold), the model stops reasoning and outputs the final answer.
This allows the model to reason longer for complex problems and stop early for simple ones.

3. Key Contributions

Training-Free Framework: VisRef achieves adaptive visual refocusing purely at inference time without modifying model parameters or requiring RL fine-tuning.
DPP-Based Selection: Introduces a principled method using Determinantal Point Processes to select a visual token coreset that optimally balances relevance (alignment with current thought) and diversity (global image coverage).
Adaptive Stopping: Implements an entropy-based criterion to dynamically terminate reasoning, balancing accuracy and efficiency.
Plug-and-Play: The method is compatible with any pre-trained Multi-Modal Large Reasoning Model (MLRM).

4. Experimental Results

The authors evaluated VisRef on three challenging benchmarks (MathVista, MM-Star, MathVision) using three state-of-the-art models (InternVL-3.5, Qwen-3-VL, SAIL-VL2).

Key Findings:

Superior Performance: Under fixed test-time compute budgets, VisRef consistently outperforms both Standard Thinking (ST) and Textual Self-Reflection (TSR).
- Example: On MathVision with SAIL-VL2, VisRef improved accuracy by 7.5% over Standard Thinking and 5.4% over Textual Self-Reflection.
- Example: On MM-Star with InternVL-3.5, VisRef achieved a 6.4% gain over the baseline.
Test-Time Scaling: When generating multiple parallel reasoning chains (scaling compute), VisRef consistently achieves higher accuracy than parallel text-only reasoning for any given token budget.
Comparison with RL Methods: VisRef performs competitively with Look-Back (an RL-based method requiring 60 GPU hours of fine-tuning) but does so without any training. Furthermore, combining Look-Back with VisRef yields the best results, indicating the methods are orthogonal.
Ablation Studies:
- Both Relevance and Diversity terms in the DPP objective are crucial; using only one leads to significant performance drops.
- The method is robust across different token budgets ( $m$ ) and entropy thresholds ( $\delta$ ).
- VisRef scales effectively from small (1B) to large (8B) models.

5. Significance and Impact

Solving the "Visual Dilution" Problem: VisRef provides a practical solution to the known issue where MLRMs lose visual grounding during long reasoning chains, a critical bottleneck for real-world multi-modal applications.
Efficiency vs. Performance: It demonstrates that high-performance visual reasoning does not necessarily require expensive retraining or RL. Simple, principled inference-time interventions can yield substantial gains.
Generalizability: As a training-free approach, VisRef can be immediately deployed on any existing pre-trained MLRM, making it a highly scalable solution for the community.
Human-Like Reasoning: The method mimics human problem-solving strategies (alternating between examining the image and abstract reasoning) purely through algorithmic token selection, bridging the gap between human cognition and AI reasoning.

In conclusion, VisRef establishes that visual refocusing is a critical component of test-time scaling for multi-modal models, offering a computationally efficient, training-free pathway to significantly improve reasoning accuracy on complex visual tasks.