MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Imagine you are asking a super-smart robot to plan your perfect weekend trip. You want it to look at a map (to see the roads and stations), check a spreadsheet (to see the ticket prices and travel times), and then decide the best route based on a mix of rules: "I want it to be fast, cheap, comfortable, and reliable."

This paper, titled "MapTab," is basically a giant report card for these robots (called Multimodal Large Language Models or MLLMs) to see if they are actually ready for this kind of complex, real-world job.

Here is the breakdown in simple terms:

1. The Problem: The Robot is Good at Chatting, Bad at Planning

Current AI models are amazing at writing stories or answering questions. But when you ask them to do something practical like "Plan a route from Point A to Point B while balancing cost and time," they often get confused. They might look at the map and see a pretty picture, but fail to understand the numbers in the spreadsheet, or they might get the math wrong.

2. The Solution: The "MapTab" Exam

The researchers created a massive, super-hard test called MapTab. Think of it as a "Driver's License Test" for AI, but instead of driving a car, the AI has to navigate a city using two different tools at once:

The Visual Map: A picture of a subway system or a tourist map.
The Data Table: A structured list of numbers showing how long each trip takes, how much it costs, how comfortable it is, and how reliable the line is.

They tested the AI on 328 different maps covering cities in 52 countries and tourist spots in 19 countries. They asked the AI 196,800 questions (that's a lot of route planning!).

3. The Two Scenarios

The test had two main "levels":

MetroMap (The City Commuter): Imagine a complex subway map with 160 cities. It's like navigating a giant spiderweb of train lines where you have to transfer between lines. This is hard because the visual map is crowded and confusing.
TravelMap (The Tourist): Imagine a map of 168 famous tourist spots. You need to figure out how to get from the Eiffel Tower to the Louvre, considering how much it costs to take a taxi vs. a bus, and how tired you'll be.

4. The Big Surprise: What the Tests Revealed

The researchers tested 15 of the smartest AI models available (including big names like GPT-4o and Gemini). Here is what they found, using some analogies:

The "Blind Spot" Effect: When the map image was too busy or hard to read, the AI got lost. It's like trying to read a menu while someone is shining a bright flashlight in your eyes. The AI couldn't "see" the text on the map clearly enough to make a plan.
The "Table" Lifeline: When the researchers gave the AI just the spreadsheet (the numbers) without the picture, the AI actually did better. It's like if you gave a chef a recipe card with exact measurements instead of a blurry photo of the dish; the chef could cook it perfectly. The AI is great at math but bad at reading messy pictures.
The "Overthinker" Trap: Some models that have a "thinking" mode (where they talk to themselves before answering) actually did worse on simple tasks. It's like a student who knows the answer but starts doubting themselves so much they change their answer to the wrong one.
The "Shortest Path" Cheat: When the AI got stuck, it often just guessed the shortest path (the one with the fewest stops) and ignored the "cheap" or "comfortable" rules. It's like a GPS that only knows how to get you there fast, even if it costs you $1,000 in tolls.

5. The Verdict: Not Ready for Prime Time Yet

The paper concludes that while these AI models are impressive, they are not yet ready to replace human planners or navigation apps for complex, multi-rule decisions.

They struggle with math: They are bad at counting stations or adding up prices.
They struggle with "Multi-Tasking": They can't easily look at a picture and a spreadsheet at the same time to make a decision.
They get confused by complexity: If the map is too crowded or the rules are too complicated, the AI's brain just shuts down.

Why Does This Matter?

This isn't just about maps. It's about the future of AI. If we want AI to help us with real-life decisions—like planning a supply chain, managing a hospital, or driving a self-driving car—it needs to be able to look at a visual scene, read the data, and balance different priorities (like speed vs. safety).

MapTab is a wake-up call. It tells us that before we trust AI with our lives and our money, we need to teach it how to stop "guessing" and start "reasoning" properly across different types of information.

In short: The AI is a brilliant student who can write a great essay, but if you put a map and a calculator in front of it and ask for a budget-friendly trip, it's likely to get lost. We need to fix that before we let it drive the bus.

1. Problem Statement

While Multimodal Large Language Models (MLLMs) have advanced significantly in visual reasoning and decision-making, existing benchmarks fail to rigorously evaluate their capabilities under multi-criteria constraints within heterogeneous graphs.

The Gap: Current benchmarks often focus on single-objective tasks (e.g., finding the shortest path) or rely on unstructured text. Real-world route planning (RP) requires balancing conflicting objectives (Time, Price, Comfort, Reliability) while integrating visual map topology with structured quantitative data.
The Challenge: MLLMs must simultaneously perform Optical Character Recognition (OCR), extract topological structures from complex images, align them with structured tabular data, and perform multi-step numerical reasoning and logical optimization.

2. Methodology: The MapTab Benchmark

The authors introduce MapTab, a large-scale, multimodal benchmark designed to test holistic multi-criteria reasoning.

A. Dataset Composition

MapTab consists of 328 high-resolution maps divided into two scenarios:

Metromap: Covers 160 cities across 52 countries (32 native languages). Focuses on urban commuting topology.
Travelmap: Covers 168 tourist attractions across 19 countries. Focuses on scenic route planning.

B. Data Structure & Heterogeneous Graphs

To address the limitations of "image-only" or "image + unstructured text" paradigms, MapTab employs a "Image + Structured Tables" approach:

Visual Input: High-resolution map images ( $I$ ).
Structured Inputs:
- Edge_tab: Attributes for connections (Time, Price, Comfort, Reliability).
- Vertex_tab: Attributes for nodes (Stations/Attractions), including Transfer Time (for Metromap) and dwell times.
Query Types:
- Route Planning (RP): 196,800 queries requiring the model to generate an optimal path $r^*$ minimizing a weighted cost function: $Cost = w_1T + w_2P + w_3(1-C) + w_4(1-R)$ .
- Question Answering (QA): 3,936 queries testing specific capabilities like counting, localization, and trajectory tracing.

C. Evaluation Framework

Metrics:
- Exact Match Accuracy (EMA): Strict path matching.
- Partial Match Accuracy (PMA): Longest correct prefix from the origin.
- Difficulty-aware Score (DS): Weights performance based on graph complexity and query difficulty.
Experimental Setup: Evaluated 15 state-of-the-art MLLMs (including Qwen3-VL, GPT-4o, Gemini-3-Flash, Doubao-Seed) across various input modalities (Map-only, Table-only, Map+Table).

3. Key Contributions

First Multimodal Multi-Criteria Benchmark: MapTab is the first to combine visual map data with structured tabular attributes to evaluate reasoning over heterogeneous graphs under multi-objective constraints.
Large-Scale Dual-Scenario Design: It provides a comprehensive dataset spanning 52 countries and 19 countries, with nearly 200k RP queries and rigorous difficulty stratification (Easy/Medium/Hard).
Vision-Structured Data Collaboration: It introduces a novel evaluation setting where models must fuse visual topology with low-entropy tabular data, isolating reasoning performance from visual perception noise.

4. Key Results & Findings

Extensive experiments on 15 MLLMs reveal significant limitations in current models:

Observation 1: Visual Perception is a Major Bottleneck.
- In visually dense scenarios (Metromap), adding structured tables (Map + Vertex2_tab) significantly outperforms Map-only settings. This suggests that symbolic anchors in tables help models align entities and reduce OCR errors.
- Conversely, in simpler scenarios (Travelmap), adding tables sometimes degrades performance if the visual input is already sufficient, but tables generally provide a more reliable "floor" for performance.
Observation 2: The "Shortest-Path Trap".
- Models frequently ignore multi-criteria constraints and default to the unconstrained shortest path.
- Ablation studies show that when the optimal path differs from the shortest path (Not Repeat cases), model accuracy drops to near 0%. Models fail to truly optimize based on weights (Time, Price, etc.) and instead guess the shortest path.
Observation 3: Numerical & Multi-Step Reasoning Deficiencies.
- Models struggle with counting, numerical comparison, and calculating aggregate metrics (e.g., total time including transfer penalties).
- Performance collapses when "transfer time" is introduced, indicating a failure in multi-hop reasoning.
Observation 4: Chain-of-Thought (CoT) Limitations.
- While CoT (Thinking models) improves multimodal coordination in complex settings, it can lead to "overthinking" in simple tasks, degrading performance.
- CoT cannot compensate for fundamental deficits in visual perception or numerical grounding.
Observation 5: Modality Robustness.
- Structured tables are more robust than images under perceptual challenges. However, images remain indispensable for establishing the global topological structure; tables alone cannot replace the visual context.

5. Significance and Future Directions

Diagnostic Value: MapTab serves as a critical diagnostic tool, exposing that current MLLMs are not yet ready for real-world, multi-criteria decision-making where balancing conflicting constraints is essential.
Architectural Implications: The results suggest that future MLLMs need:
1. Modular Frameworks: Decoupling perception (visual grounding) from reasoning (logical optimization).
2. Tool Use: Integrating external tools for counting, calculation, and graph traversal to overcome numerical weaknesses.
3. Targeted Training: Post-training strategies (SFT + RL) specifically focused on constraint following and multi-criteria optimization.

In conclusion, MapTab demonstrates that while MLLMs have made strides in visual understanding, they lack the robustness required for complex, multi-criteria planning tasks involving heterogeneous graph reasoning. The benchmark provides a rigorous testbed to drive progress toward Artificial General Intelligence (AGI) in decision-making domains.

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

1. The Problem: The Robot is Good at Chatting, Bad at Planning

2. The Solution: The "MapTab" Exam

3. The Two Scenarios

4. The Big Surprise: What the Tests Revealed

5. The Verdict: Not Ready for Prime Time Yet

Why Does This Matter?

1. Problem Statement

2. Methodology: The MapTab Benchmark

A. Dataset Composition

B. Data Structure & Heterogeneous Graphs

C. Evaluation Framework

3. Key Contributions

4. Key Results & Findings

5. Significance and Future Directions

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Wildfire spread forecasting with Deep Learning

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank