VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

The Big Idea: The "Twin" Problem

Imagine you are taking a math test. The teacher asks you to solve a problem, but instead of writing down a number, you have to pick the correct drawing from four options: A, B, C, and D.

Now, here's the twist: All four drawings look almost exactly the same. They are like identical twins wearing slightly different colored socks. Maybe one has a line that is a tiny bit steeper, or a circle that is a millimeter off-center. To solve the problem, you have to spot that tiny difference and match it to the text description.

This is what the paper calls VisioMath. The researchers built a massive test bank of 1,800 of these "twin" math problems to see if modern AI models (Large Multimodal Models, or LMMs) can actually do this.

The Surprise: AI is Bad at "Spot the Difference"

The researchers tested the smartest AI models available today (like GPT-4.1, Gemini 2.5 Pro, and Qwen). They expected the AIs to be great at math.

The result? The AIs struggled mightily.

The Analogy: Imagine a super-intelligent detective who can solve complex murder mysteries but fails a game of "Spot the Difference" because the differences are too small.
The Finding: As the drawings got more similar to each other, the AI's accuracy dropped sharply. When the options were very different, the AI did okay. But when the options were "twins," the AI started guessing randomly.

Why Did the AI Fail? (The "Positional Cheat")

The paper dug into why the AI failed. It turns out the AI wasn't actually looking at the pictures carefully. Instead, it was cheating.

The Analogy: Imagine you are playing a game where you have to match a description to a picture. The AI realized that "Option A" is usually on the left, and "Option B" is usually on the second from the left. So, instead of reading the text and looking at the picture, the AI just thought, "The answer is usually B, so I'll pick the second picture."
The Evidence: The researchers shuffled the order of the pictures (so the picture for "A" was actually in the "B" spot). When they did this, the AI's performance crashed. This proved the AI was relying on position (where the picture is) rather than content (what the picture shows).

The Solution: Teaching the AI to "Read" the Pictures

The researchers didn't just stop at finding the problem; they tried three ways to fix it:

The "One Big Picture" Trick (Consolidated Layout):
- The Fix: Instead of showing the AI four separate images floating around, they stitched them all together into one giant image.
- The Result: It helped a little. It's like putting all the puzzle pieces on one table instead of scattering them on the floor. It's easier for the AI to look at them all at once.
The "Name Tag" Trick (Explicit Anchors):
- The Fix: They physically wrote the letters "A," "B," "C," and "D" directly onto the pictures themselves.
- The Result: This was a huge help. It forced the AI to connect the text "A" with the specific picture labeled "A," stopping it from guessing based on position.
The "Study Buddy" Trick (Chain-of-Thought Training):
- The Fix: They created a small dataset where an AI "teacher" wrote out a step-by-step explanation for every problem, explicitly saying, "Look at the slope in picture A, it matches the text. Picture B is wrong because..." They then taught the student AI using these notes.
- The Result: This was the biggest winner. Even with a small amount of this "study guide" data, the AI's accuracy jumped by over 12%. It learned to actually think about the relationship between the text and the image, rather than just guessing.

Why Does This Matter?

You might ask, "Who cares about math diagrams?"

Real-World Impact: This isn't just about math class. In the real world, doctors look at X-rays that look 99% identical to find a tumor. Engineers look at blueprints that are nearly the same to find a structural flaw.
The Takeaway: If an AI can't tell the difference between two nearly identical diagrams in a math test, it might miss a critical detail in a medical scan or a safety inspection.

Summary

The VisioMath paper is a wake-up call. It shows that while AI is getting smarter at understanding the world, it still struggles with fine-grained details when there are many similar options. It tends to "cheat" by looking at where things are placed rather than what they actually are. However, by teaching the AI to explicitly link text to images (like a student reading a textbook), we can fix this and make them much more reliable for real-world tasks.

1. Problem Definition

The paper addresses a critical gap in the evaluation of Large Multimodal Models (LMMs): their ability to perform fine-grained comparative reasoning over multiple, visually similar images.

The Gap: While current benchmarks evaluate single-image understanding or multi-image reasoning with distinct inputs, they rarely test scenarios where all answer options are diagrams that differ only by subtle visual details (e.g., slight changes in slope, geometric topology, or functional curves).
The Challenge: In real-world educational settings (K–12 math), students must distinguish between nearly identical diagrams to select the correct solution. Current LMMs struggle with this because they often rely on shallow positional heuristics or fail to establish precise image-text alignment when multiple similar visual inputs are present.
Core Limitation Identified: The dominant failure mode is image-text misalignment, where models fail to correctly map specific textual options (A, B, C, D) to their corresponding visual diagrams, leading to systematic errors even when the reasoning logic is sound.

2. Methodology

A. Dataset Construction: VisioMath

The authors introduce VisioMath, a curated benchmark consisting of 1,800 high-quality K–12 mathematics problems.

Source: Real exam questions from Chinese high school and college entrance exams (2002–2023).
Structure:
- Options: All answer choices are distinct diagrams (images).
- Stems: Approximately 50% of questions include a stem image; the rest are text-only.
- Visual Similarity: The dataset explicitly quantifies visual similarity using Qwen multimodal-embedding-v1. The similarity score $Sim(Q)$ is defined as the minimum pairwise cosine similarity between all option images in a question. This ensures a spectrum of difficulty from low to extremely high similarity.
Quality Control: Rigorous manual review to remove duplicates, ensure LaTeX transcription for math expressions, and balance answer distribution (A–D) to prevent positional bias.

B. Experimental Setup

Models Evaluated: A comprehensive suite of 30+ LMMs, including:
- Closed-source: GPT-4.1, Gemini 2.5 Pro, Seed1.6-Thinking, Doubao-1.5-Vision-pro.
- Open-source: Qwen2.5-VL (various sizes), InternVL2.5, GLM-4.5V, Gemma3, and math-specialized models (MM-Eureka, MM-PRM).
Evaluation Protocol: Zero-shot setting with deterministic decoding. Performance was analyzed based on:
- Presence of stem images.
- Levels of inter-image visual similarity (quartiles).
- Ground-truth answer position.

C. Proposed Strategies for Improvement

To mitigate the identified failure modes, the authors tested three strategies:

Consolidated Single Image Layout: Concatenating all stem and option images into a single composite image to reduce attention fragmentation.
Explicit Visual-Textual Anchors: Embedding option labels (A, B, C, D) directly onto the images to force explicit binding.
Alignment-Oriented Multi-Image Chain-of-Thought (CoT) Fine-tuning:
- Generated 500 high-quality CoT examples using a two-stage pipeline (QwenVL-Max for initial captioning $\rightarrow$ DeepSeek V3.1 for reasoning refinement).
- The CoT explicitly links visual observations to textual inferences for each option before concluding.
- Fine-tuned models (e.g., Qwen2.5-VL-3B) on this small dataset.

3. Key Results

A. Performance Degradation with Similarity

Similarity Impact: There is a consistent decline in accuracy as inter-image similarity increases. For example, Gemini 2.5 Pro drops from 86.2% accuracy in low-similarity quartiles to 76.9% in the highest similarity quartile.
Stem Images: Models perform significantly worse when the question stem also contains an image, suggesting a bottleneck in processing and integrating multiple visual sources simultaneously.

B. Failure Mode Analysis

Image-Text Misalignment: This accounts for 36% of errors (the largest category). Models often confuse which image corresponds to which option label.
Positional Heuristics: A "shuffling experiment" (permuting option labels while keeping image order fixed) caused a significant accuracy drop (e.g., Gemini 2.5 Pro dropped from 80.9% to 72.2%). This proves models rely heavily on the spatial order of images rather than semantic alignment.
Human vs. Model: Humans show a moderate performance drop with high similarity but plateau, indicating they rely on conceptual understanding. Models, however, fail on perceptually trivial distinctions that humans easily resolve.

C. Impact of Strategies

Layout & Anchors: Consolidating images and adding explicit anchors yielded modest gains (e.g., +6.4% for Seed1.6-Thinking with layout consolidation).
CoT Fine-tuning: This was the most effective strategy. Fine-tuning Qwen2.5-VL-3B with just 500 alignment-oriented CoT samples increased accuracy from 25.4% to 38.0% (+12.6%), surpassing larger models like R1-OneVision-7B. This highlights that the lack of explicit alignment training data is a primary bottleneck.

4. Key Contributions

VisioMath Benchmark: The first benchmark specifically designed for figure-based mathematical reasoning where all options are visually similar diagrams. It fills the gap between standard VQA and complex multi-image reasoning.
Comprehensive Evaluation: A systematic analysis revealing that even state-of-the-art models (e.g., GPT-4.1, Gemini 2.5 Pro) struggle with fine-grained comparative reasoning, primarily due to image-text misalignment.
Diagnostic Insights: Identification of positional heuristics and misalignment as the root causes of failure, validated through controlled shuffling experiments.
Effective Mitigation: Demonstration that alignment-oriented CoT fine-tuning on a small dataset can substantially boost performance, offering a practical path forward for improving LMMs in educational and STEM domains.

5. Significance

Educational Relevance: VisioMath directly mirrors the challenges faced by students in K–12 education, making it a vital tool for developing AI tutors capable of handling complex diagrammatic reasoning.
Model Development: The paper shifts the focus from "can the model see?" to "can the model align what it sees with the text?" It suggests that future LMMs require architectures or training objectives that explicitly enforce cross-modal grounding across multiple visual inputs.
Future Directions: The authors propose extending this benchmark to other domains (physics, chemistry) and emphasize the need for training data that explicitly teaches models to bind visual features to textual descriptions in multi-image contexts.

Code and Dataset Availability: The dataset and code are open-source at https://github.com/Nefefilibata/VisioMath.