Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

Imagine you are trying to teach a very smart robot how to look at a picture and answer a question about it, like "What is the man on the left holding?"

To do this, the robot needs to zoom in on the specific part of the picture (the man's hands) to get a better look before answering. This process is called "Visual Chain-of-Thought."

For a long time, robots had to do this in a very clumsy way. Here is the story of how this paper fixes that problem using a new method called NV-CoT.

The Old Way: The "Pixelated" Problem

Imagine you are trying to tell a friend exactly where a specific tree is in a giant park.

The "Text" Method (The Old Way):
The robot tries to describe the tree's location using words. It says, "The tree is at 3... 1... 5... 9."
- The Problem: To the robot, "3" and "4" are just two completely different words, like "Apple" and "Banana." It doesn't understand that they are right next to each other. If the tree is actually at 3.1, and the robot guesses 3.9, the robot thinks it made a huge mistake because the numbers are totally different words. It's like trying to measure a room with a ruler that only has whole inches, no fractions. It's imprecise and confusing.
The "Patch" Method (Another Old Way):
The robot cuts the picture into a giant grid of Lego blocks (patches). It can only point to "Block #42."
- The Problem: What if the tree is right on the line between Block 42 and Block 43? The robot can't be precise. It's stuck with the size of the Lego blocks, no matter how small they are.

The New Way: NV-CoT (The "Smooth Slider")

The authors of this paper, Kesen Zhao and his team, came up with a brilliant solution called Numerical Visual Chain-of-Thought (NV-CoT).

Instead of using words or Lego blocks, they gave the robot a smooth slider (like the volume knob on a stereo or a dimmer switch for a light).

How it works: When the robot needs to zoom in, it doesn't say "3, 1, 5, 9." Instead, it generates a smooth, continuous number like 3.142.
The Magic: Because it's a smooth number, the robot understands that 3.14 is very close to 3.15. If it misses the target by a tiny bit, it knows it was "almost right," not "completely wrong." This allows the robot to point to the exact spot with laser precision.

The "Gaussian" and "Laplace" Analogy

The paper mentions some fancy math terms like "Gaussian" and "Laplace" policies. Here is what they mean in plain English:

The Gaussian Policy (The "Bell Curve" Guess):
Imagine the robot is throwing darts at a target. It doesn't just throw one dart; it throws a whole bunch of darts that cluster around the center. It knows, "I'm pretty sure the target is here, but I might be off by a little bit." This helps the robot learn by trying slightly different spots and seeing which one works best.
The Laplace Policy (The "Sharper" Guess):
Sometimes, the robot needs to be even more confident and less likely to make big mistakes. The Laplace method is like a dart thrower who is very focused and rarely throws a dart far away from the center. It's great for situations where you need to be very precise and avoid "outliers" (wild guesses).

Why is this a Big Deal?

The researchers tested this new method on three different "video game levels" (benchmarks) where robots have to find things in pictures.

It's Faster: The robot learns how to zoom in much quicker than before.
It's More Accurate: The robot finds the exact object (like a pepper mill or a handbag) without accidentally zooming in on the background.
It's Flexible: It works whether the robot is being taught by a teacher (Supervised Fine-Tuning) or learning by trial and error (Reinforcement Learning).

The Bottom Line

Think of the old robots as people trying to navigate a city using a map with only major highways. They could get close, but they often got lost in the details.

NV-CoT gives the robot a GPS with turn-by-turn navigation. It allows the robot to say, "Turn left at 42.5 meters," instead of "Turn left at the next big intersection." This small change makes the robot much smarter, faster, and better at understanding the visual world.

In short: They taught the robot to stop guessing with words and start measuring with smooth, precise numbers.

1. Problem Statement

Multimodal Large Language Models (MLLMs) are increasingly using "Visual Chain-of-Thought" (Visual CoT) to perform region-grounded reasoning. However, existing approaches suffer from two fundamental limitations:

Text-Based Coordinates (Modality Mismatch & Semantic Fragmentation): Current methods serialize bounding box coordinates as discrete text tokens (e.g., ["x1", "y1", "x2", "y2"]). This creates a mismatch between the continuous nature of visual space and discrete text tokens. It leads to semantic fragmentation (numbers split into unrelated sub-tokens) and brittle reasoning (e.g., failing to distinguish that 3.9 is closer to 3.1 than 3.2 is, as token-level cross-entropy treats them as equally distant errors).
Patch-Based Coordinates (Fixed Granularity): Alternative methods operate directly on fixed-granularity visual patches. While they avoid text serialization, they are constrained by the pre-defined spatial resolution of the vision backbone, limiting precise region selection and requiring significant architectural changes.

2. Methodology: Numerical Visual Chain-of-Thought (NV-CoT)

The authors propose NV-CoT, a framework that expands the MLLM action space from discrete vocabulary tokens to a continuous Euclidean space. The model directly generates numerical bounding box coordinates as actions.

Core Architecture

Action Space Expansion: The standard LLM output head is augmented with four lightweight linear heads to predict continuous coordinates $[x_1, y_1, x_2, y_2]$ .
Policy Formulation: Instead of a categorical distribution over tokens, NV-CoT models the coordinate output as a probability distribution:
- Gaussian Policy: Predicts a mean vector $\mu$ and a shared standard deviation $\sigma$ .
- Laplace Policy: A variant predicting a mean $\mu$ and a scale parameter $\alpha$ , motivated by the robustness of $\ell_1$ loss in localization tasks.
Stochasticity via Reparameterization: To enable Reinforcement Learning (RL), the model uses the reparameterization trick to sample coordinates:
- Gaussian: $b = \mu + \sigma \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ .
- Laplace: $b = \mu + \alpha (s \odot \epsilon)$ , where $s$ is a Rademacher sign vector and $\epsilon$ is exponential.

Training Paradigms

NV-CoT is compatible with both Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL):

Supervised Fine-Tuning (SFT):
- When ground-truth bounding boxes are available, the model is trained using a regression loss.
- Loss Function: Replaces token-level cross-entropy for coordinates with an $\ell_2^2$ (MSE) or $\ell_1$ loss.
- Formula: $L_{SFT} = \text{CrossEntropy}(\text{text}) + \lambda \sum \|\mu_t - b^*_t\|^2$ .
Reinforcement Learning (RL) - GRPO:
- Adapts the Group Relative Policy Optimization (GRPO) algorithm for continuous actions.
- Importance Ratios: Derives closed-form likelihood ratios for Gaussian/Laplace distributions to replace standard categorical ratios.
- KL Regularization: Uses the analytical KL divergence between the current and reference Gaussian/Laplace policies to constrain the policy update.
- Reward: Uses outcome-driven rewards (answer correctness, format validity, and a bonus for successful zoom-ins).

3. Key Contributions

Continuous Action Space: Proposes NV-CoT, shifting MLLM reasoning from discrete text tokens to continuous numerical coordinates, eliminating modality mismatch and semantic fragmentation.
RL Compatibility for Continuous Actions: Develops Gaussian and Laplace policies with reparameterized sampling and analytic importance ratios, making continuous localization fully compatible with mainstream RL algorithms like GRPO.
Minimal Architectural Overhead: Requires only the addition of five lightweight linear heads, avoiding the complex architectural changes needed for patch-based methods.
Empirical Validation: Demonstrates significant improvements in localization precision, final answer accuracy, and training convergence speed across multiple benchmarks.

4. Experimental Results

The authors evaluated NV-CoT on three benchmarks: V*Bench, HR-Bench 4K, and HR-Bench 8K, comparing against eight baselines (including text-based CoT like DeepEyes/Vis-CoT and patch-based CoT like LVR/PaDT).

Performance Gains:
- SFT Setting: NV-CoT (SFT) outperformed the text-based baseline Vis-CoT-7B by +3.7% overall on V*Bench and showed consistent gains on HR-Bench.
- RL Setting: NV-CoT (RL) improved upon the text-based DeepEyes-7B by +2.6% overall on V*Bench.
- Comparison with Larger Models: A 7B parameter NV-CoT model surpassed the performance of the 32B parameter Qwen2.5-VL model across all benchmarks.
Localization Precision: NV-CoT achieved significantly higher Intersection over Union (IoU) for bounding boxes (59.5% with $\ell_1$ loss vs. 47.3% for Vis-CoT).
Convergence: Training curves showed NV-CoT converges faster than text-based baselines in both SFT and RL settings.
Ablation Studies:
- Loss Function: The $\ell_1$ loss (Laplace policy) consistently outperformed $\ell_2^2$ loss (Gaussian policy) in both SFT and RL, confirming the robustness of $\ell_1$ for localization.
- Parameterization: Shared scale parameters ( $\sigma$ or $\alpha$ ) performed comparably to independent per-coordinate parameters but were preferred for efficiency.

5. Significance

Bridging Perception and Reasoning: NV-CoT provides a unified framework where visual perception (localization) and reasoning (CoT) occur in the same continuous mathematical space, rather than forcing a translation between vision and text.
Efficiency: By avoiding the need for fixed patch grids or complex tool-calling pipelines for coordinate extraction, NV-CoT offers a more efficient and precise mechanism for "thinking with images."
Scalability: The plug-and-play nature of the framework (minimal architectural changes) allows it to be easily integrated into existing MLLMs, potentially accelerating the development of more capable multimodal agents for tasks requiring fine-grained spatial understanding (e.g., OCR, visual QA, robotics).

Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

The Old Way: The "Pixelated" Problem

The New Way: NV-CoT (The "Smooth Slider")

The "Gaussian" and "Laplace" Analogy

Why is this a Big Deal?

The Bottom Line

1. Problem Statement

2. Methodology: Numerical Visual Chain-of-Thought (NV-CoT)

Core Architecture

Training Paradigms

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation