VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

Imagine you are teaching a robot to make a cup of coffee.

The Old Way (Traditional VLA Models):
Think of a traditional robot as a student who is given a photo of a kitchen and a written instruction: "Turn on the stove and put the pot on it."
The student looks at the photo once, memorizes everything they see, and then immediately starts moving.

The Problem: What if the photo is blurry? What if the robot can't tell if the stove knob is actually turned? Because the student only looked once, they might guess wrong, turn the wrong knob, or miss the pot entirely. They are "blind" to new details once they start moving. They are like someone trying to solve a maze while wearing a blindfold after the first glance.

The New Way (VLA-Thinker):
Now, imagine a smarter student. This student also gets the photo and the instruction, but they have a special superpower: They can "think" by zooming in.

Instead of just looking once, this student follows a process:

Look: "Okay, I see a stove and a pot."
Think: "Wait, is that knob actually accessible? It looks a bit far away in the picture."
Action (The "Zoom"): "I need a better look." Click! The student uses a tool to zoom in on the knob.
Re-evaluate: "Ah, now I see! The knob is right there. I can reach it."
Act: "Time to turn the knob."
Repeat: "Now, is the pot actually on the burner? Let me zoom in on the pot to be sure."

This is VLA-Thinker. It treats "looking closer" not just as a passive input, but as an active thinking step. It realizes that sometimes, to make a good decision, you have to go back and check the details.

The "Two-Step" Training Recipe

How do you teach a robot to do this? You can't just tell it to "think harder." The paper uses a clever two-step training method:

Step 1: The "Cram Session" (SFT Cold Start)
First, the researchers feed the robot thousands of examples of "good thinking." They show it scenarios where a robot should have zoomed in and what the robot should have said before acting.

Analogy: It's like a teacher giving a student a cheat sheet of "How to solve math problems step-by-step." The student learns the format: "Look -> Think -> Zoom -> Act." This teaches the robot the habit of thinking.

Step 2: The "Game of Life" (Reinforcement Learning)
Once the robot knows how to think, it needs to learn when to think.

The Problem: If the robot zooms in every single time, it wastes time. If it never zooms in, it makes mistakes.
The Solution: The researchers let the robot play the task over and over.
- If it zooms in at the right time and succeeds? Good job! (Reward)
- If it zooms in when it didn't need to? Wasted time. (No reward)
- If it misses the knob because it didn't zoom in? Fail. (No reward)
Analogy: This is like playing a video game where you only get points for winning the level, not for every move you make. The robot learns to balance "thinking hard" with "acting fast" to win the game.

Why Does This Matter?

In the real world, robots often face long, complicated tasks (like "clean the whole kitchen").

Old Robots get confused halfway through because they forgot what the sink looked like or didn't notice a spill.
VLA-Thinker is like a detective. It can pause, look at the evidence again, zoom in on a clue, and then continue its mission.

The Results:
The paper tested this on two major robot challenges:

LIBERO: A test of general robot smarts. VLA-Thinker got 97.5% of the tasks right (a huge jump from the previous best).
RoboTwin: A test of two-armed robots doing complex, long tasks. VLA-Thinker crushed the competition, especially in the hardest, longest tasks.

The Bottom Line

VLA-Thinker changes the rulebook. It stops treating the robot's eyes as a camera that takes one snapshot and starts treating them as a flashlight that the robot can shine wherever it needs to solve a problem. By letting the robot "think with its eyes," it becomes much smarter, more careful, and much better at handling tricky, real-world jobs.

1. Problem Statement

Current Vision-Language-Action (VLA) models typically rely on a text-based Chain-of-Thought (CoT) paradigm. In these systems:

Static Perception: Visual inputs are encoded once as static context embeddings at the beginning of the process.
Passive Reasoning: The reasoning process unfolds primarily in the language space, treating visual observation as a "one-shot" event.
Limitations: This design decouples perception from reasoning, preventing the model from actively revisiting the environment to resolve ambiguities, track subgoals, or recover from intermediate errors. This is particularly detrimental in long-horizon manipulation tasks where visual details may be occluded or ambiguous, and static context is insufficient for dynamic decision-making.

2. Methodology: VLA-Thinker

The authors propose VLA-Thinker, a framework that shifts from "passive observation" to "active perception-reasoning." It treats visual perception as an explicit, dynamically invocable reasoning action.

Core Architecture

Thinking-with-Image Reasoning: Instead of a linear Observation → Reasoning → Action pipeline, VLA-Thinker enables an interleaved process: Reasoning → (Optional) Tool Invocation → New Visual Evidence → Reasoning → Action.
Tool Invocation: The model can call visual tools (specifically a ZOOM-IN mechanism in this work) during intermediate reasoning steps to acquire task-relevant visual details (e.g., inspecting a specific object or knob) before committing to an action.
Trajectory: The output is a multimodal trajectory $\tau = \{T_1, C_1, V_1, \dots, T_k, A_k\}$ , where $T$ is text reasoning, $C$ is a tool call, $V$ is the returned visual evidence, and $A$ is the final action.

Two-Stage Training Strategy

To train a model capable of learning what to reason, when to query, and how to act, the authors introduce a two-stage pipeline:

Stage 1: SFT Cold-Start (Reasoning Activation)
- Challenge: Existing embodied datasets lack explicit CoT annotations.
- Solution: The authors use a large VLM (Qwen3-VL) to synthesize high-quality Embodied CoT data. This data includes structured reasoning steps and explicit tool invocation patterns (identifying keyframes where state changes occur).
- Goal: To distill foundational reasoning capabilities and establish consistent operation formats for tool use.
Stage 2: Group Relative Policy Optimization (GRPO)
- Challenge: SFT alone does not optimize for final task success, especially under sparse rewards.
- Solution: Apply GRPO (a reinforcement learning algorithm) to align complete reasoning-action trajectories with task outcomes.
- Reward Function: Sparse reward based on task success ( $I_{success}$ ) plus a format regularization reward ( $I_{format}$ ) to ensure valid CoT structure.
- Goal: To causally align the model's behavior, teaching it to balance reasoning costs against task success (i.e., learning when not to think to avoid redundancy).

3. Key Contributions

First "Thinking-with-Image" VLA: Introduces the first VLA model that models visual perception as a dynamically invocable reasoning action, enabling Multimodal Embodied Chain-of-Thought.
Novel Training Framework: Proposes a two-stage pipeline combining SFT cold-start (for structured reasoning priors) and GRPO (for trajectory-level alignment under sparse rewards), stabilizing multimodal reasoning behaviors.
State-of-the-Art Performance: Demonstrates that explicitly coupling perception with reasoning significantly improves robustness in long-horizon tasks, outperforming existing baselines on major benchmarks.

4. Experimental Results

The model was evaluated on two benchmarks: LIBERO (language-guided manipulation) and RoboTwin 2.0 (bimanual manipulation).

LIBERO Benchmark:
- VLA-Thinker achieved a 97.5% average success rate.
- This represents a +6.5% improvement over the strong baseline OpenVLA-OFT (91.0%).
- Significant gains were observed in the Spatial (+7.1%) and Long (+10.4%) suites, highlighting improved spatial grounding and long-horizon stability.
RoboTwin 2.0 Benchmark:
- Short Horizon: 62.3% success rate (vs. 21.3% for OpenVLA-OFT).
- Medium Horizon: 70.7% success rate.
- Long/Extra-Long Horizon: 64.6% success rate.
- The performance gap widens as task complexity and horizon length increase, proving the method's effectiveness in reducing error accumulation in long chains.
Ablation Studies:
- Removing the "Thinking-with-Image" mechanism drops performance significantly.
- Using only SFT (95.0%) or only GRPO (88.2%) is inferior to the combined approach (97.5%), confirming that SFT provides necessary priors while GRPO optimizes for task success.
- Training Curves: Show that while the model initially generates redundant reasoning (due to SFT imitation), the RL stage successfully teaches the model to selectively invoke tools, reducing response length while increasing success rates.

5. Significance

VLA-Thinker represents a paradigm shift in embodied AI. By moving away from the assumption that visual input is a static, one-time context, it enables agents to actively interrogate their environment during the decision-making process. This "active perception" capability is crucial for solving complex, long-horizon robotic tasks where static observations are insufficient. The framework is extensible, suggesting that future work can integrate more diverse visual tools (e.g., depth estimation, object segmentation) to further enhance robotic reasoning and robustness.

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

The "Two-Step" Training Recipe

Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology: VLA-Thinker

Core Architecture

Two-Stage Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance

More like this

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

Linear Programming for Multi-Criteria Assessment with Cardinal and Ordinal Data: A Pessimistic Virtual Gap Analysis

Seven simple steps for log analysis in AI systems

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

AHC: Meta-Learned Adaptive Compression for Continual Object Detection on Memory-Constrained Microcontrollers