VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought

Imagine you are trying to explain a complex chart to a friend who has never seen one before. If you just say, "Look at the red bar and tell me how much money it made," your friend might get confused. They might look at the wrong bar, mix up the colors, or forget what the numbers mean.

This is exactly the problem that VisDoT solves for Artificial Intelligence.

Here is the story of the paper, told simply:

The Problem: AI is "Blind" to Details

Current AI models (called Large Vision-Language Models) are like brilliant students who have read every book in the library but have never actually looked at a graph. When you show them a chart and ask a question, they often guess the answer based on the text they see, rather than actually seeing the data.

They might say, "The red bar is the highest," when in reality, the blue bar is taller. They struggle to connect the visual part (the shape, color, and position) with the meaning (the numbers and logic). This is called a lack of "grounding."

The Solution: VisDoT (Visual Decomposition of Thought)

The researchers created a new framework called VisDoT. Think of VisDoT as a training manual that teaches AI to "think like a human" when looking at data.

Instead of letting the AI guess the answer in one giant leap, VisDoT forces the AI to slow down and follow a specific, human-like process. It does this in two main ways:

1. The "Perception" Training (Learning to See)

The researchers realized that humans don't just "look" at a chart; we decode it using specific senses. They taught the AI four specific "superpowers" based on how our brains naturally process images:

Position: "Is this dot higher or lower than that one?" (Like checking who is taller in a lineup).
Length: "Is this bar longer or shorter?" (Like comparing two sticks).
Pattern: "Which color represents 'Sales' and which represents 'Profit'?" (Like matching keys to locks).
Extract: "What is the exact number written here?" (Like reading a price tag).

By training the AI to master these four skills first, it stops guessing and starts actually seeing the data.

2. The "Decomposition" Strategy (Breaking it Down)

This is the secret sauce. The researchers introduced a method called Decomposition-of-Thought (DoT).

Imagine you are asked to solve a math problem: "If the red bar is 50, the blue bar is 30, and the green bar is 20, what is the average?"

Old AI: Tries to guess the answer immediately. It might get confused and say "40" or "100."
VisDoT AI: Breaks the problem into tiny, manageable steps, just like a human would:
1. Step 1 (Perception): "Okay, let me find the red bar. It says 50."
2. Step 2 (Perception): "Now let me find the blue bar. It says 30."
3. Step 3 (Perception): "And the green bar is 20."
4. Step 4 (Logic): "Now I add them up: 50 + 30 + 20 = 100."
5. Step 5 (Logic): "Now I divide by 3. The answer is 33.3."

By forcing the AI to look first, then think, it avoids the common mistakes of hallucinating numbers or mixing up colors.

The Results: A Smarter AI

When they tested this new method:

It got much better at charts: The AI improved its score on chart questions by over 11% compared to previous models.
It beat the giants: On difficult tests, this smaller, open-source AI performed as well as (or even better than) the massive, expensive models like GPT-4o.
It works everywhere: Even when they tested it on pictures of real-world objects (not just charts), it still performed better. This proves that teaching an AI to "look before it leaps" is a universal superpower.

The Big Picture

Think of VisDoT as teaching an AI to put on its glasses and use a ruler before trying to solve a puzzle. Instead of rushing to give an answer, it learns to:

Identify the pieces (Perception).
Break the question into small steps (Decomposition).
Solve the puzzle step-by-step.

This makes the AI not only smarter but also more trustworthy, because you can see exactly how it arrived at the answer, just like watching a human solve a problem on a whiteboard.

1. Problem Statement

Large Vision-Language Models (LVLMs) struggle with visual reasoning on data visualizations (charts, graphs, dashboards). The core bottleneck is a lack of perceptual grounding:

Failure to Detect Primitives: LVLMs often fail to reliably detect basic visual elements (e.g., bar heights, axis positions, colors) and align them with semantic representations (e.g., specific data values or categories).
Ineffective Chain-of-Thought (CoT): While CoT improves text-only reasoning, it often fails in visual contexts because models attempt to reason logically before accurately perceiving the visual data. This leads to hallucinations and errors in tasks requiring high-level perceptual alignment (e.g., legend identification, multi-object comparison).
Limitations of Current Approaches: Existing methods rely heavily on instruction tuning or synthetic data generation but lack a mechanism to explicitly separate visual perception from logical reasoning, resulting in poor performance on complex, underspecified queries.

2. Methodology: The VisDoT Framework

The authors propose VisDoT, a framework that enhances LVLMs by emulating human visual decoding processes. It consists of two main components:

A. Perception-Following Task Formalization

Based on Cleveland and McGill's (1984) theory of graphical perception, the authors formalize four core perceptual tasks to guide model attention:

Position: Comparing object positions along a common scale (e.g., x/y-axis) to determine relative order. This is the most accurate channel for quantitative information.
Length: Using distortion-free visual attributes (e.g., bar length) as a secondary cue to position.
Pattern: Linking visual patterns (colors, textures, markers) to legends and data to distinguish categories.
Extract: Reading explicitly shown numerical values (OCR-style tasks).

B. Decomposition-of-Thought (DoT) Prompting

VisDoT introduces a novel prompting strategy that redefines Visual Question Answering (VQA) as a compositional task. Unlike standard CoT, DoT enforces a strict perception-first order:

Question Decomposition: The model is guided to split a complex query into an ordered sequence of sub-questions. Crucially, perception-oriented sub-questions ( $Q^p$ ) must be generated before logic-oriented sub-questions ( $Q^l$ ).
- Example: Instead of asking "Which country had the highest GDP?", the model first asks "What is the GDP of Country A?" and "What is the GDP of Country B?" (Perception), followed by "Compare A and B" (Logic).
Sequential Problem Solving: The model generates intermediate answers for each sub-question, accumulating context to derive the final answer. This ensures that logical reasoning is grounded in verified visual evidence.

C. Dataset Construction (VisDoTQA)

The authors constructed VisDoTQA, a dataset containing ~332k QA pairs derived from 16,167 chart images (sourced from Pew Research, Statista, OWID, OECD).

Generation: Questions are generated using GPT-4o based on the four perceptual tasks.
Annotation: Answers are generated using LLaMA-3.2-90B with DoT prompting, verified by humans if models disagree.
Structure: The dataset explicitly pairs visual inputs with reasoning trajectories that follow the DoT structure.

3. Key Contributions

Formalization of Perceptual Tasks: Established a theoretical foundation for LVLMs by mapping human graphical perception principles (Position, Length, Pattern, Extract) to machine learning tasks.
DoT Strategy: Introduced a decomposition strategy that forces LVLMs to perform visual grounding before logical reasoning, significantly reducing hallucinations.
Perception-Following Dataset: Created VisDoTQA, a large-scale dataset designed to train models on structured, interpretable visual reasoning.
State-of-the-Art Performance: Demonstrated that a mid-sized open-source model (InternVL) fine-tuned with VisDoT can outperform or match massive closed-source models (GPT-4o, Gemini-Flash) on chart reasoning benchmarks.

4. Experimental Results

The framework was evaluated on ChartQA, ChartQAPro, and the new VisDoTQA benchmark.

Performance Gains:
- ChartQA: VisDoT-finetuned InternVL-2B/4B achieved +11.2% improvement over baselines, reaching performance comparable to GPT-4o.
- ChartQAPro: On this more challenging benchmark, VisDoT-4B surpassed GPT-4o, achieving 34.54% accuracy (vs. GPT-4o's ~32% in specific sub-tasks).
- VisDoTQA: The model improved by +33.2% over the baseline, achieving 76.52% accuracy (vs. GPT-4o's 57.14%).
- Task-Specific Gains: Significant improvements were seen in Position (+29.1%), Pattern (+42.0%), and Extract (+49.6%) tasks.
Efficiency: VisDoT achieved superior results using only 7.4K QA pairs for fine-tuning, compared to 163K/320K pairs used by other distillation pipelines (ChartGemma, ECD), proving the efficiency of perception-aligned supervision.
Generalization (Zero-Shot):
- Applied to open-domain VQA benchmarks (POPE and MMMU), DoT prompting improved performance by +1.43% and +2.2% respectively over standard CoT, confirming that the perception-logic separation strategy generalizes beyond charts.
Ablation Studies:
- Removing the "Position" task caused the steepest performance drop, confirming its critical role in spatial alignment.
- Combining Perceptual training with DoT yielded a 3.2x larger boost on VisDoTQA compared to CoT alone.

5. Significance

Solving the Grounding Bottleneck: VisDoT addresses the fundamental limitation of LVLMs in visual reasoning by decoupling perception from logic, ensuring that reasoning is always grounded in accurate visual evidence.
Interpretability: The DoT framework produces explicit, step-by-step reasoning traces that reveal how a model arrived at an answer, making failure cases easier to diagnose (e.g., distinguishing between a perception error and a logic error).
Scalability: The approach demonstrates that smaller, open-source models can rival or exceed proprietary giants in specialized domains (chart reasoning) when trained with the correct cognitive priors and decomposition strategies.
Future Direction: The paper suggests that "perception-first" decomposition is a generalizable strategy for all visual reasoning tasks, not just charts, potentially reducing hallucinations in broader VQA applications.