Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale

Imagine you are trying to teach a robot how to "think" like a human detective. You want it to look at a picture, not just say what it sees, but to figure out why things are there, spot hidden details, and correct its own mistakes if it gets confused.

For a long time, AI researchers could easily teach robots to think about text (like solving math word problems), but teaching them to think deeply about images was like trying to teach a fish to ride a bicycle. It was hard, slow, and the datasets (the "textbooks" for the AI) were too small and simple.

This paper introduces a new method called Long Grounded Thoughts (LGT). Think of it as a massive, automated factory that builds a super-challenging "training gym" for AI vision models. Here is how it works, broken down into simple concepts:

1. The Problem: The "Caption Trap"

Previous methods tried to teach AI by giving it a picture and a long description (a caption) of that picture, then asking the AI to make up questions.

The Analogy: Imagine you are studying for a test by reading a summary of a movie plot. You might remember the main characters, but you'll miss the tiny details in the background.
The Result: The AI got bored. It started asking the same easy questions over and over again (like "What color is the car?"). It hit a wall where adding more data didn't make it smarter.

2. The Solution: The Two-Stage Factory

The authors built a two-step assembly line to fix this.

Stage 1: The "Object Detective" (Scale & Diversity)

Instead of just reading a summary, the AI is given a map of the picture with specific "pins" dropped on every single object (a dog, a window, a shoe) along with their exact coordinates.

The Analogy: Instead of just saying "There is a messy room," the AI is told: "Look at the red sock at [x,y], the blue cup at [x,y], and the cat at [x,y]."
The Magic: This forces the AI to ask questions about specific things. It stops asking generic questions and starts asking, "Is the red sock to the left of the blue cup?" This creates 1 million+ unique, high-quality questions, preventing the AI from getting bored or stuck in a loop.

Stage 2: The "Puzzle Master" (Complexity)

The first stage creates good questions, but they are still a bit easy. The second stage takes several of these simple questions and mashes them together into one giant, tricky puzzle.

The Analogy:
- Simple Question: "Is the dog wearing a collar?"
- Simple Question: "Is the dog to the left of the tree?"
- Stage 2 "Hard" Question: "If the dog is wearing a collar and is to the left of the tree, but the tree is behind the fence, is the collar visible through the fence slats?"
The Result: The AI can't just guess; it has to break the problem down into steps, check its work, and backtrack if it makes a mistake. This teaches it complex reasoning.

3. The "Thinking Aloud" (Reasoning Traces)

The most important part isn't just the questions; it's the answer key. The AI doesn't just learn the right answer; it learns the thought process to get there.

The Analogy: Imagine a student taking a test.
- Old Way: The teacher just says "The answer is B."
- New Way (LGT): The teacher writes out the whole thought process: "I thought it was A, but wait, let me check the picture again... Oh, I see the dog is actually on the right. So, my first guess was wrong. Let me backtrack. The answer is B."
The Magic: The AI learns to say "Wait, let me double-check" and "I made a mistake." This is called self-correction.

4. The Results: Superpowers Everywhere

They trained a 7-billion-parameter AI model (a medium-sized brain) on this massive dataset. The results were shocking:

Vision: The AI became a master at looking at images, beating many models that were trained on private, expensive data.
Text & Audio: Even though they only trained the AI on pictures, the AI got better at solving text puzzles and understanding audio (like music or speech).
The Analogy: It's like teaching a pianist to play jazz by only having them practice scales on a piano. Surprisingly, they get better at playing the violin and singing too! The "thinking skills" they learned on images transferred to other subjects.

5. The Big Lesson: "Teach Before You Test"

The paper also discovered a crucial rule for training AI:

The Old Way: Throw the AI into a reinforcement learning gym (where it learns by trial and error) immediately.
The New Way: First, give it a "classroom" lesson (Supervised Fine-Tuning) where it learns how to think and correct itself. Then put it in the gym.
The Result: If you skip the classroom lesson, the AI gets confused and learns nothing. If you teach it the thinking habits first, it learns incredibly fast.

Summary

Long Grounded Thoughts is a recipe for creating a massive, high-quality "thinking gym" for AI. By forcing the AI to look at specific objects, combine simple questions into hard puzzles, and practice "thinking aloud" with self-correction, they created a dataset that makes AI smarter, more reliable, and surprisingly good at tasks it was never explicitly trained on. It's like upgrading the AI's brain from a simple camera to a detective with a magnifying glass and a notebook.

1. Problem Statement

Despite rapid advancements in Large Language Models (LLMs) and Vision-Language Models (VLMs), multimodal reasoning lags behind text-only reasoning in terms of dataset scale and complexity. Existing open-source vision-centric datasets suffer from three critical limitations:

Scale Saturation: Methods relying solely on dense image captions (e.g., LongPerceptualThoughts) quickly hit a diversity ceiling, generating redundant questions about salient objects, which limits downstream performance scaling.
Lack of Complexity: Most synthetic datasets contain trivial problems solvable by base VLMs without complex reasoning. They lack higher-order cognitive structures like subgoal setting, backtracking, and verification.
Reasoning Depth: Synthesized reasoning traces often lack the "System-2" cognitive behaviors (deliberate, non-linear reasoning) found in frontier models, resulting in shallow Chain-of-Thoughts (CoTs).

The authors aim to synthesize a massive, high-quality, vision-centric dataset (>1M examples) that spans diverse complexity levels and includes rich, verifiable reasoning traces to support Supervised Fine-Tuning (SFT), Offline RL (DPO), and Online RL (GRPO).

2. Methodology: The Two-Stage Synthesis Framework

The proposed framework, Long Grounded Thoughts (LGT), generates data through a two-stage process designed to maximize scale, complexity, and reasoning richness.

Stage 1: Scale and Diversity via Grounded Metadata

To overcome the "caption-only" saturation bottleneck, the authors introduce object-level grounding.

Input: Images are processed with Grounded-SAM to extract dense captions and object metadata (bounding box coordinates and open-vocabulary tags).
Generation: An LLM is prompted with the image description and specific object metadata (e.g., "bag at [x1, y1, x2, y2]") to generate Multiple-Choice Questions (MCQs).
Diversity Control: A semantic similarity filter (using embeddings) removes redundant questions. By conditioning on specific objects rather than just the whole image caption, the system generates a 3.2× wider semantic spread and significantly reduces redundancy compared to caption-only baselines.
Output: ~750K diverse, verifiable MCQs.

Stage 2: Composition Hardening for Complexity

To address the issue of trivially solvable questions, the authors apply a composition hardening algorithm.

Process: The system selects $K$ simple MCQs generated in Stage 1 for the same image and instructs an LLM to merge them into a single, multi-hop reasoning problem.
Mechanism: The original questions act as intermediate subgoals. The new question requires the model to decompose the problem, perform spatial reasoning, and synthesize information across multiple steps.
Result: This reduces the rate of trivially solvable questions by ~10× (from 36.7% in baselines to 3.3%) and significantly increases the difficulty.

Reasoning Trace Synthesis (CoT Expansion)

For both stages, the authors synthesize high-quality reasoning traces using a distillation and expansion strategy:

Distillation: A VLM generates a "Simple CoT" (initial rationale) for the MCQ.
Expansion: A powerful Reasoning LLM (e.g., DeepSeek-R1, Qwen2.5-72B) expands this rationale into a "Long Grounded Thought." This step injects complex cognitive behaviors such as self-verification, backtracking, and subgoal setting.
Verification: A local verifier model checks if the expanded trace logically supports the ground-truth answer, filtering out hallucinations.

RL Data Construction

Offline RL (DPO): Pairwise preferences are created based on correctness and compactness (e.g., a correct, concise trace is preferred over a verbose incorrect one).
Online RL (GRPO): The framework supports online reinforcement learning by using the synthesized problems and verifiable answers as the environment for reward calculation.

3. Key Contributions

Dataset Scale & Quality: The release of Long Grounded Thoughts, a dataset containing >1 million high-quality vision-centric problems, including reasoning traces, preference data, and instructions.
Novel Synthesis Framework: A two-stage pipeline that leverages grounded object metadata to prevent diversity saturation and composition hardening to create complex, multi-hop reasoning problems.
Cognitive Richness: The dataset exhibits a 206% increase in complex cognitive behaviors (subgoal setting, backtracking, verification) compared to previous baselines, yielding reasoning traces 3× richer.
Comprehensive Post-Training Analysis: The paper provides a large-scale study (1M+ data points) on the VLM post-training pipeline, revealing critical insights into the interplay between SFT, Offline RL, and Online RL.

4. Experimental Results

Performance on Vision-Centric Benchmarks

Fine-tuning Qwen2.5-VL-7B on LGT data yields state-of-the-art results among open-data models:

Outperforms Open Baselines: Significantly beats models trained on other open datasets (e.g., VLAA-Thinking, LPT, Virgo).
Competes with Closed Models: The best configurations (SFT + GRPO) match or surpass strong closed-data models like MiMo-VL-7B-RL on benchmarks such as V*Bench, CV-Bench, and MMStar-V.
Generalization: Despite being trained only on MCQs, the model generalizes robustly to open-ended embodied QA (NiEH), showing an +8.8% improvement.

Cross-Modality Transfer (Surprising Findings)

The vision-centric data demonstrates remarkable positive transfer to other modalities:

Text-Only Reasoning: Improves performance on MMLU-Pro (+3.7% over base).
Audio Reasoning: When applied to Qwen2.5-Omni-7B, the vision-centric reasoning data improves audio understanding on the MMAU benchmark (Sound +1.32%, Music +3.02%), despite the model never seeing audio data during this specific training phase.

Insights into Post-Training Pipeline

The authors analyzed the full training spectrum and found:

SFT is Prerequisite for RL: Online RL (GRPO) fails to elicit structured reasoning if started from a base model lacking cognitive patterns. It requires "offline teaching" via SFT on high-quality reasoning traces first.
Offline RL Efficiency: A staged approach (SFT $\to$ DPO) achieves performance competitive with Online RL (within 1.7 points) but with decoupled compute demands, offering better scalability.
Scaling Limits: Online RL on base models shows a "fast-gain/plateau" pattern, peaking at small data sizes (10K) and degrading with more data unless pre-trained on high-quality SFT data.

5. Significance

Democratization: This work provides a scalable, reproducible recipe for generating high-quality multimodal reasoning data, reducing reliance on expensive closed-source models or human annotation.
Bridging the Gap: It demonstrates that reasoning capabilities are transferable across modalities. Enhancing visual reasoning with grounded, complex traces improves text and audio reasoning, suggesting a unified "reasoning core" in multimodal models.
Training Paradigm Shift: The findings challenge the notion that Online RL is always superior, showing that high-quality SFT with cognitive-rich traces is the essential foundation for scaling RL in VLMs.

In summary, Long Grounded Thoughts establishes a new benchmark for vision-centric reasoning datasets, proving that systematic synthesis of grounded, complex, and cognitively rich problems can drive VLMs to match or exceed closed-source capabilities while enabling robust cross-modal generalization.