Learning to Think Fast and Slow for Visual Language Models

Imagine you have a brilliant assistant who is great at solving problems, but they have a strange habit: they never stop talking.

Whether you ask them, "What color is this apple?" or "Solve this complex physics equation," they give you a 10-page essay. For the apple, they write a history of fruit cultivation. For the math problem, they write a novel.

This is the problem with current "Reasoning" AI models. They are trained to think hard (slowly) for everything, which wastes a huge amount of computer energy and time, especially for simple questions.

The paper "Learning to Think Fast and Slow for Visual Language Models" introduces a new AI called DualMindVLM. It teaches the AI to be more like a human: using "fast thinking" for easy tasks and "slow thinking" for hard ones.

Here is how it works, broken down with simple analogies:

1. The Human Brain Analogy: System 1 vs. System 2

Psychologists say humans have two ways of thinking:

System 1 (Fast): Automatic and intuitive. You use this when you see a red light and stop, or when you recognize a friend's face. It's quick and uses little energy.
System 2 (Slow): Deliberate and logical. You use this when you do long division, plan a trip, or solve a tricky riddle. It takes time and effort.

Current AI models are stuck in System 2 mode all the time. They try to "solve" a picture of a cat with the same intensity as a complex geometry proof.

2. The Discovery: The AI Already Knows the Difference

The researchers noticed something interesting. Even without being taught, pre-trained AI models naturally give short answers for easy questions (like "What is this emoji?") and long answers for hard ones (like "Calculate the angle of this triangle").

Think of it like a muscle memory. The AI already "feels" when a problem is easy or hard. The problem is that new training methods force the AI to ignore this feeling and just "think harder" for everything, wasting energy.

3. The Solution: The "DualMind" Training

The team created a two-step training process to teach the AI to switch between these modes automatically.

Step A: The "Labeling" Phase (Anchoring)

Imagine you have a pile of homework problems.

The researchers look at how the AI naturally answers them.
If the AI naturally gives a short answer, they tag it "Fast Thinking."
If it naturally gives a long answer, they tag it "Slow Thinking."
They then attach a specific "trigger phrase" to each tag.
- Fast Trigger: "Short Thinking:"
- Slow Trigger: "Long Thinking:"

This is like giving the AI a menu. They aren't forcing the AI to think; they are just showing it that "Short Thinking" is the right tool for the "Short Answer" job, and "Long Thinking" is for the "Long Answer" job.

Step B: The "Practice" Phase (Reinforcement Learning)

Now, they let the AI practice. They give it a question and say:

"Try to answer this using the 'Short Thinking' trigger."
"Try to answer this using the 'Long Thinking' trigger."
"Also, try to answer it however you want (Free-form)."

The AI gets a "reward" (like a gold star) if:

It gets the answer right.
It uses the correct trigger for the difficulty of the question.

If the AI tries to write a novel for a simple emoji question, it gets a lower score. If it writes a short, punchy answer, it gets a high score. Over time, the AI learns to automatically pick the right tool without being told.

4. The Result: A Smarter, Faster Assistant

The results are impressive. The new model, DualMindVLM, is:

More Accurate: It solves hard math and science problems better than previous models because it actually takes the time to think when needed.
Much Faster/Cheaper: For simple questions, it stops talking after a few sentences. This saves a massive amount of "tokens" (the currency of AI computing).
Less Hallucination: Because it doesn't ramble unnecessarily on simple tasks, it makes fewer up-to-date mistakes (hallucinations).

The Big Picture

Think of the old AI models as a heavy-duty truck used to deliver a single letter. It gets the job done, but it burns a lot of gas and takes up the whole road.

DualMindVLM is like a smart delivery fleet.

For a single letter? It sends a bicycle (Fast Thinking). Quick, efficient, cheap.
For a massive shipment? It sends a truck (Slow Thinking). It takes longer and uses more fuel, but it's necessary to get the job done right.

By teaching the AI to choose the right vehicle for the job, the researchers have created a system that is both smarter and more efficient, mimicking the way our own brains work.

1. Problem Statement

Current Visual Language Models (VLMs) designed for reasoning often suffer from inefficient token usage.

Uniform Long Reasoning: Existing reasoning-oriented VLMs (trained via Supervised Fine-Tuning or Reinforcement Learning like GRPO) are typically trained to generate uniformly long, step-by-step reasoning chains for all queries.
Redundancy: While detailed reasoning is necessary for complex tasks (e.g., math, geometry), it is redundant for simple perception tasks (e.g., identifying an emotion or counting objects). This leads to substantial computational waste and increased latency.
Lack of Adaptivity: Unlike humans, who dynamically switch between "System 1" (fast, intuitive) and "System 2" (slow, deliberate) thinking based on task complexity, current models lack a mechanism to adaptively select the appropriate reasoning depth.

2. Core Observation

The authors observed that pre-trained, general-purpose VLMs already possess an implicit prior regarding response length:

They naturally generate short responses for simple perception/OCR tasks.
They naturally generate longer responses for complex tasks like math or chart analysis.
Existing reasoning methods often override this natural prior by forcing long reasoning on all inputs, thereby destroying the model's inherent efficiency.

3. Methodology: DualMindVLM

The proposed framework, DualMindVLM, leverages this implicit prior to create an explicit, controllable dual-mode thinking mechanism. The training pipeline consists of two stages:

Stage 1: Dual-Mode Anchoring

The goal is to map the model's inherent length tendencies to distinct thinking modes and bind them to explicit control prefixes.

Length Profiling: For each training sample, the base model generates multiple rollouts. The average response length is calculated to determine the sample's "natural" tendency.
Anchoring: Samples are partitioned into two subsets based on length thresholds ( $\tau_{fast}$ $τ_{f a s t}$ and $\tau_{slow}$ $τ_{s l o w}$ ):
- Fast Thinking: Samples with short average lengths are anchored to a "Short Thinking" prefix.
- Slow Thinking: Samples with long average lengths are anchored to a "Long Thinking" prefix.
Prefix Binding: Specific system prompts are used to enforce these modes:
- Short Thinking: triggers concise processing.
- Long Thinking: triggers structured, multi-step reasoning.

Stage 2: Dual-Mode Learning (RL with Hybrid Rollouts)

The model is fine-tuned using Group Relative Policy Optimization (GRPO) to internalize these modes and enable autonomous selection.

Hybrid Group Sampling: For each input, a group of $n$ $n$ candidate responses is generated.
- Half (Prefix-Conditioned): Generated with the assigned prefix (e.g., "Short Thinking:") to enforce the correct format and behavior.
- Half (Free-Form): Generated without a prefix to allow the model to learn when to select a mode autonomously.
Reward Design: A joint reward function is used:
- Accuracy Reward ( $r_a$ ): 1 if the answer is correct, 0 otherwise.
- Format Reward ( $r_f$ ): Encourages consistency with the anchored mode. It gives a higher score if the generated prefix matches the assigned mode, and a partial score if a valid (but mismatched) prefix is used.
Objective: The GRPO objective optimizes the policy to maximize accuracy while maintaining consistency with the selected thinking mode, effectively teaching the model to switch between System 1 and System 2 based on the input.

4. Key Contributions

Discovery of Implicit Prior: Identified that pre-trained VLMs naturally exhibit task-dependent response length variations, which can be leveraged rather than overridden.
Dual-Mode Framework: Proposed a two-stage training framework (Anchoring + RL) that creates controllable fast and slow thinking modes without requiring external supervision or manual mode selection.
Efficiency-Performance Balance: Demonstrated that explicit dual-mode training allows models to achieve state-of-the-art (SOTA) reasoning performance while significantly reducing token consumption compared to uniform long-reasoning baselines.

5. Experimental Results

The model was evaluated on six multimodal benchmarks (MathVista, MathVision, MMStar, MMBench, ScienceQA, AI2D) using the Qwen2.5-VL-7B base model.

Performance: DualMindVLM achieved SOTA or near-SOTA accuracy across all benchmarks, outperforming the base model by significant margins (e.g., +7.4% on MathVista, +5.3% on MMBench).
Token Efficiency:
- DualMindVLM reduced average token usage by ~40% compared to other SOTA reasoning models (e.g., OpenVLThinker, VL-Rethinker).
- It maintained high accuracy even under strict token budgets (e.g., 100 tokens), where other reasoning models failed.
Ablation Studies:
- Removing the Anchoring stage caused "mode collapse," where the model defaulted to fast thinking only, degrading performance on complex tasks.
- Removing Dual-Mode RL (using only SFT) resulted in longer responses and lower efficiency.
- Hallucination Reduction: DualMindVLM showed superior performance on the HumbleBench hallucination benchmark, suggesting that appropriate reasoning depth (not just length) helps mitigate hallucinations.
Generalization: The approach successfully transferred to different architectures (InternVL3-8B) and scales (Qwen2.5-VL-3B).

6. Significance

Cognitive Alignment: The work bridges the gap between human cognitive science (System 1 vs. System 2) and VLM architecture, creating models that "think" more like humans by adapting effort to problem difficulty.
Cost-Effective Reasoning: By eliminating unnecessary token generation for simple tasks, DualMindVLM offers a practical solution for deploying reasoning-capable VLMs in resource-constrained environments.
Paradigm Shift: It challenges the prevailing "longer is better" reasoning paradigm, proving that adaptive reasoning is more effective than uniform reasoning.

In conclusion, DualMindVLM demonstrates that by respecting and formalizing a model's intrinsic response-length priors, it is possible to build VLMs that are both highly accurate and computationally efficient.