ConFoThinking: Consolidated Focused Attention Driven Thinking for Visual Question Answering

The Big Problem: The "Clumsy Detective"

Imagine you have a super-smart detective (an AI) who is really good at solving mysteries by looking at photos. But there's a catch: this detective is terrible at pointing.

When you ask, "What color is the apple logo?", the detective might know exactly which part of the photo to look at in their "mind's eye." However, when they try to tell you where to zoom in, they fumble. They might say, "Zoom in on coordinates [0.7, 0.33, 0.92, 0.5]," but those numbers are slightly off. Because of this tiny math error, the camera zooms into the wrong spot (maybe the background instead of the logo), and the detective gives the wrong answer.

This is the problem the paper calls "Grounding-Perception Mismatch." The AI sees the right thing, but it says the wrong location.

The Old Solutions (And Why They Failed)

Researchers tried two main ways to fix this:

The "Coordinate Generator" Approach: They forced the AI to just spit out numbers for a box.
- The Flaw: It's like asking a human to describe a location using only a string of random numbers. They often get the numbers wrong, even if they know where the object is.
The "Attention Map" Approach: They tried to use the AI's internal "gaze" (where it was looking) to find the spot.
- The Flaw: The AI's gaze is scattered. Sometimes it looks at the right spot in "Layer 5" of its brain, and sometimes in "Layer 22." It's like trying to find a specific person in a crowd by asking 30 different people who are all looking in slightly different directions. If you pick the wrong person to ask, you get lost. Also, if you ask the AI a long, confusing question, its gaze gets blurry and unfocused.

The New Solution: ConFoThinking

The authors created a new method called ConFoThinking. Think of it as training the detective to be a Specialized Scout who uses a Flashlight instead of a map.

Here is how it works in three simple steps:

1. The "What to Look For" Cue (The Flashlight)

Instead of asking the AI a long, messy question like "What is the color of the apple logo on the shirt in the top left corner?", the system first asks the AI to generate a short, sharp clue.

The Analogy: Imagine the AI whispers to itself: "Focus on the big letters near the top."
Why it helps: This short phrase acts like a flashlight. It cuts out all the noise and tells the AI exactly what to look for, rather than getting distracted by the whole question.

2. The "Consolidated Gaze" (The Fixed Layer)

The researchers realized the AI's "gaze" was jumping around between different layers of its brain. So, they forced the AI to condense all that scattered attention into one specific layer (like Layer 22).

The Analogy: Imagine a chaotic room where 30 people are pointing at different spots. The researchers told everyone, "Okay, stop moving. Everyone, point at the object right now using your index finger, and freeze."
The Result: Now, instead of a scattered mess of pointing fingers, you have one clear, unified spotlight on the exact object. This makes the location signal super stable.

3. The "Box Translator" (The Translator)

Once the AI has this clear, stable spotlight (a heatmap showing exactly where the object is), a small, specialized tool called AttnDetector steps in.

The Analogy: The spotlight is like a glowing cloud on a map. The AttnDetector is a translator that looks at that glowing cloud and says, "Ah, that cloud covers the area from here to here." It converts the glowing cloud into a perfect, clean box.
Why it helps: The AI doesn't have to guess the numbers anymore. It just draws the box based on the glowing cloud it already sees.

The Result: A Super Detective

By using this method, the AI stops trying to be a mathematician (guessing coordinates) and starts being a visual thinker (focusing on the image).

Before: The AI sees the apple, gets confused about the numbers, zooms in on the wrong spot, and says, "I can't see it."
After (ConFoThinking): The AI says, "I'm focusing on the big letters," the system locks onto that exact spot, zooms in perfectly, and says, "It's red!"

Why This Matters

This paper shows that we don't need to make AI smarter at math to make it better at seeing. We just need to teach it how to focus and simplify its thinking. By breaking the problem down into "What to look for" and "Where it is," and then stabilizing the "Where," they made the AI significantly better at answering questions about complex, high-resolution images.

In short: They stopped asking the AI to draw a map with numbers and started teaching it to shine a flashlight on the answer.

1. Problem Statement

The paper addresses a critical bottleneck in "Thinking with Images" pipelines for Multimodal Large Language Models (MLLMs). While MLLMs are increasingly used for step-by-step visual reasoning (e.g., cropping or zooming into specific image regions to answer questions), current methods struggle with reliable Region of Interest (ROI) localization.

The authors identify three specific failure modes in existing approaches:

Grounding–Perception Mismatch (Tool-Augmented Methods): Methods that ask MLLMs to directly output bounding box coordinates often fail. Even when the model's internal attention correctly identifies the target region in intermediate layers, the final coordinate decoding (in later layers) drifts, producing incorrect boxes. The model "knows where to look" but "says" the wrong coordinates.
Fragmented Attention Signals (Attention-Driven Methods): Existing methods that extract ROIs from internal attention maps are unstable because the "where-to-look" signal is scattered across different transformer layers. No single fixed layer consistently holds the peak attention for all samples, making fixed-layer cropping unreliable.
Query Sensitivity: Extracting attention based on the raw, often verbose, question introduces semantic noise. Attention maps derived from long questions are diffuse, whereas attention derived from concise, semantically guided visual cues is more focused.

2. Methodology: ConFoThinking

The authors propose ConFoThinking (Consolidated Focused Attention Driven Thinking), a framework that disentangles "what to look for" from "where to look" to produce stable, high-quality ROIs without relying on brittle coordinate generation.

A. Semantically Guided Visual Chain of Thought (ConFoAttn)

Instead of training the MLLM to output coordinates, the model is trained to generate a concise semantic cue wrapped in a <FOCUS>...</FOCUS> tag.

Training Data Construction: The authors distill existing datasets (like VGR and Pixel-Reasoner) into a "focus-centric" format. A teacher model (GPT-5) rewrites reasoning traces to replace coordinate markers with natural language descriptions of the visual evidence (e.g., "The large lettering centered near the top...").
Objective: The model is trained using Next-Token Prediction (NTP) to generate these <FOCUS> spans, which specify what visual evidence to inspect.

B. Designated-Layer Attention Aggregation

To solve the layer fragmentation issue, the method consolidates attention into a single designated intermediate layer (e.g., Layer 22 for Qwen3-VL-8B).

Attention Condensation: During training, the model minimizes an Attention Condensation Loss ( $L_{AC}$ ). This loss forces the text-to-image attention (queried specifically by the tokens inside the <FOCUS> span) to concentrate heavily on the ground-truth ROI within that specific designated layer.
Result: This transforms scattered, layer-dependent attention signals into a stable, single-layer heatmap.

C. AttnDetector (Heatmap-to-Box Predictor)

Since the model does not output coordinates directly, a separate lightweight module, AttnDetector, is trained to convert the refined attention heatmap into bounding box coordinates.

Input: The attention heatmap from the designated layer (queried by the <FOCUS> span).
Architecture: A transformer-based detector (DETR-style) that takes the heatmap (plus auxiliary spatial grids) and regresses the bounding box.
Training: Supervised using the ground-truth boxes associated with the <FOCUS> spans in the training dataset.

D. Inference Pipeline

Generate Focus: The base MLLM (ConFoAttn) generates a reasoning trace containing a <FOCUS> span.
Extract Heatmap: The attention map is extracted from the designated layer using the <FOCUS> tokens as the query.
Predict Box: AttnDetector converts the heatmap into a bounding box.
Zoom & Answer: The image is cropped/zoomed to the predicted ROI, and the MLLM answers the question using both the original and zoomed images.

3. Key Contributions

Empirical Analysis: The paper provides a rigorous analysis of MLLM internals, revealing the "grounding-perception mismatch" in coordinate decoding and the "layer-wise dispersion" of attention signals.
Novel Framework: ConFoThinking is the first to explicitly disentangle semantic focus generation from spatial localization, using a concise <FOCUS> cue to query attention and consolidating that attention into a fixed layer.
State-of-the-Art Performance: The method achieves SOTA results on five diverse VQA benchmarks without requiring complex reinforcement learning or multi-step search strategies.

4. Experimental Results

The method was evaluated on five benchmarks: V*, HR-Bench (4K/8K), InfoVQA, and GQA.

Performance: ConFoThinking significantly outperforms both base MLLMs and existing "Thinking with Images" pipelines (e.g., ZoomEye, ICoT, ViCrop).
- On V*, ConFoThinking (Qwen3-VL-8B) achieved 94.8%, an 8.7% absolute gain over the base model and outperforming the tool-augmented Qwen3-VL-8B (88.7%).
- On HR-Bench 8K, it achieved 92.1%, a 7.8% gain over the base.
Efficiency: Unlike search-based methods (e.g., ZoomEye) which take 50 seconds per sample due to iterative exploration, ConFoThinking takes **12 seconds**, offering a 4x speedup while maintaining higher accuracy.
Ablation Studies:
- Attention Condensation: Adding the $L_{AC}$ loss improved accuracy by ~5-6% over using NTP alone, proving that consolidating attention to a fixed layer is crucial.
- Query Sensitivity: Using the <FOCUS> span for attention extraction yielded significantly better results (92.1%) than using the raw question (89.0%) or all generated tokens (84.3%).
- Layer Selection: Extracting attention from a single designated layer (Layer 22) outperformed averaging across a neighborhood of layers, confirming that the condensation process successfully creates a stable signal.

5. Significance

ConFoThinking offers a robust, low-noise alternative to coordinate-based grounding. By leveraging the model's internal attention mechanisms but stabilizing them through semantic cues and layer consolidation, it overcomes the inherent instability of autoregressive coordinate generation. This approach demonstrates that MLLMs possess correct internal "where-to-look" signals that are often lost during decoding, and that these signals can be effectively harvested for fine-grained visual reasoning without complex tooling or reinforcement learning. The framework is particularly significant for high-resolution perception tasks where precise localization is the primary bottleneck.