Thinking with Images as Continuous Actions: Numerical Visual Chain-of-Thought

This paper proposes Numerical Visual Chain-of-Thought (NV-CoT), a framework that enables multimodal large language models to perform precise region-grounded reasoning by generating continuous numerical coordinates as actions, thereby overcoming the limitations of discrete text-based or fixed-patch approaches while improving localization accuracy and training efficiency.

Kesen Zhao, Beier Zhu, Junbao Zhou, Xingyu Zhu, Zhongqi Yue, Hanwang Zhang

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a very smart robot how to look at a picture and answer a question about it, like "What is the man on the left holding?"

To do this, the robot needs to zoom in on the specific part of the picture (the man's hands) to get a better look before answering. This process is called "Visual Chain-of-Thought."

For a long time, robots had to do this in a very clumsy way. Here is the story of how this paper fixes that problem using a new method called NV-CoT.

The Old Way: The "Pixelated" Problem

Imagine you are trying to tell a friend exactly where a specific tree is in a giant park.

  1. The "Text" Method (The Old Way):
    The robot tries to describe the tree's location using words. It says, "The tree is at 3... 1... 5... 9."

    • The Problem: To the robot, "3" and "4" are just two completely different words, like "Apple" and "Banana." It doesn't understand that they are right next to each other. If the tree is actually at 3.1, and the robot guesses 3.9, the robot thinks it made a huge mistake because the numbers are totally different words. It's like trying to measure a room with a ruler that only has whole inches, no fractions. It's imprecise and confusing.
  2. The "Patch" Method (Another Old Way):
    The robot cuts the picture into a giant grid of Lego blocks (patches). It can only point to "Block #42."

    • The Problem: What if the tree is right on the line between Block 42 and Block 43? The robot can't be precise. It's stuck with the size of the Lego blocks, no matter how small they are.

The New Way: NV-CoT (The "Smooth Slider")

The authors of this paper, Kesen Zhao and his team, came up with a brilliant solution called Numerical Visual Chain-of-Thought (NV-CoT).

Instead of using words or Lego blocks, they gave the robot a smooth slider (like the volume knob on a stereo or a dimmer switch for a light).

  • How it works: When the robot needs to zoom in, it doesn't say "3, 1, 5, 9." Instead, it generates a smooth, continuous number like 3.142.
  • The Magic: Because it's a smooth number, the robot understands that 3.14 is very close to 3.15. If it misses the target by a tiny bit, it knows it was "almost right," not "completely wrong." This allows the robot to point to the exact spot with laser precision.

The "Gaussian" and "Laplace" Analogy

The paper mentions some fancy math terms like "Gaussian" and "Laplace" policies. Here is what they mean in plain English:

  • The Gaussian Policy (The "Bell Curve" Guess):
    Imagine the robot is throwing darts at a target. It doesn't just throw one dart; it throws a whole bunch of darts that cluster around the center. It knows, "I'm pretty sure the target is here, but I might be off by a little bit." This helps the robot learn by trying slightly different spots and seeing which one works best.
  • The Laplace Policy (The "Sharper" Guess):
    Sometimes, the robot needs to be even more confident and less likely to make big mistakes. The Laplace method is like a dart thrower who is very focused and rarely throws a dart far away from the center. It's great for situations where you need to be very precise and avoid "outliers" (wild guesses).

Why is this a Big Deal?

The researchers tested this new method on three different "video game levels" (benchmarks) where robots have to find things in pictures.

  1. It's Faster: The robot learns how to zoom in much quicker than before.
  2. It's More Accurate: The robot finds the exact object (like a pepper mill or a handbag) without accidentally zooming in on the background.
  3. It's Flexible: It works whether the robot is being taught by a teacher (Supervised Fine-Tuning) or learning by trial and error (Reinforcement Learning).

The Bottom Line

Think of the old robots as people trying to navigate a city using a map with only major highways. They could get close, but they often got lost in the details.

NV-CoT gives the robot a GPS with turn-by-turn navigation. It allows the robot to say, "Turn left at 42.5 meters," instead of "Turn left at the next big intersection." This small change makes the robot much smarter, faster, and better at understanding the visual world.

In short: They taught the robot to stop guessing with words and start measuring with smooth, precise numbers.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →