On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

Imagine you are teaching a very smart, but slightly rigid, robot how to navigate a maze. The maze has a starting point, a treasure, and dangerous holes (lakes) that the robot must avoid.

This paper is like a report card for that robot, specifically testing how well it can learn the rules of navigation versus just memorizing the specific mazes it practiced on.

Here is the breakdown of their experiment in simple terms:

1. The Problem: The Robot is a "Cheat Sheet" User

The researchers found that when they taught these AI models (called Multimodal LLMs) to solve mazes using "Chain-of-Thought" (which is like asking the robot to "think out loud" before answering), the robot got really good at the practice mazes.

But, as soon as they gave the robot a bigger maze or a maze where the treasure was farther away than in the practice sessions, the robot failed miserably.

The Analogy: Imagine a student who memorizes the answers to a specific math worksheet. If you give them the exact same worksheet on a test, they get an A. But if you change the numbers slightly, they fail because they didn't learn the math; they just memorized the pattern. The AI was doing the same thing: it was pattern-matching, not actually "thinking."

2. The Experiment: Changing the "Language"

The researchers wanted to see if changing how they showed the maze and the robot's "thought process" would help the robot learn the actual rules.

They tried four different ways to show the maze:

Images: A picture of the maze.
Descriptions: A paragraph of text describing the maze.
Tables: A grid of text characters.
ASCII Grids: A simple, compact text drawing of the maze (like a retro video game).

They also tried different ways for the robot to "think":

Just guessing the answer.
Writing a story about its next move.
Drawing the map after every move.
The Winner: A combination of writing a story about the next move and redrawing the map in a simple text grid.

3. The Big Discovery: "Hybrid Thinking" Wins

The most surprising result was that the robot performed best when it used a hybrid approach for its thinking process.

The Recipe: The robot would first write a sentence in plain English explaining why it should move (e.g., "The treasure is to the right, so I'll go right"). Then, it would immediately update a simple text grid to show what the map looks like after that move.
Why it worked: It's like a human solving a puzzle. We don't just stare at the picture; we talk to ourselves ("Okay, if I go here, I hit a wall") and we mentally update the board. By forcing the AI to do both (explain in words + update the visual grid), the AI actually started to understand the logic of the maze, not just the pattern.

4. The "Text vs. Image" Surprise

You might think that since this is a visual task (a maze), showing the AI a picture would be best. It wasn't.

The Result: The models that used text descriptions and text grids vastly outperformed the models that looked at actual images.
The Analogy: It's like trying to teach someone to drive by showing them a video of a car (image) versus giving them a manual with clear, step-by-step instructions and a diagram (text). The AI, surprisingly, understood the "manual" much better than the "video." Even a fancy new method that tried to "dream" in a continuous space (like a human imagining a path) failed to beat the simple text-based approach.

5. The Conclusion: We Need Better "Thinking" Formats

The paper concludes that while AI is getting smarter, it still struggles to generalize (apply what it learned to new, bigger situations) unless we give it the right tools to think.

The Takeaway: If you want an AI to truly learn a skill and not just memorize examples, you shouldn't just throw data at it. You need to structure its "thought process" in a way that combines clear reasoning (words) with structured updates (grids).

In a nutshell: The AI isn't a genius yet; it's a student who needs the right study guide. When the study guide combined "explaining the logic" with "drawing the map," the student finally learned how to solve the puzzle, even when the puzzle got much harder.

1. Problem Statement

The paper addresses a critical gap in the understanding of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Multimodal LLMs (MLLMs). While CoT has been shown to improve performance on in-distribution (ID) tasks, its ability to generalize to out-of-distribution (OOD) scenarios remains poorly understood.

Previous research suggests that current reasoning models often rely on statistical pattern matching rather than genuine algorithmic learning, leading to performance collapse when inputs diverge from the training distribution. The authors aim to rigorously evaluate this phenomenon using a controlled, simple visual planning task where the ground truth is algorithmic (pathfinding), allowing them to disentangle the effects of input representations, reasoning formats, and distribution shifts.

2. Methodology

Task: FROZENLAKE Planning

The authors utilize a grid-based navigation task (based on the FrozenLake environment) where a model must guide a player from a start position to a goal while avoiding obstacles (lakes).

Inputs: Maps can be represented as Images, Text Descriptions, ASCII Tables, or ASCII Grids.
Outputs: A sequence of moves (UP, DOWN, LEFT, RIGHT).
Complexity Control: Difficulty is controlled via map size ( $N \times N$ ), the $L_\infty$ distance between start and goal ( $d_\infty$ ), and the optimal solution length.

Experimental Setup

Base Model: Qwen2.5-VL-7B-Instruct.
Training Data: 1,000 maps with sizes ranging from $3\times3$ to $6\times6$ (training distribution).
Test Conditions:
- In-Distribution (ID): Maps of size $3\times3$ to $6\times6$ .
- Out-of-Distribution (OOD):
  1. Larger Maps: $7\times7$ to $10\times10$ .
  2. Increased Distance: Maps where $d_\infty \geq 6$ (unseen during training).
  3. Longer Solutions: Paths longer than those in the training set.
Reasoning Traces (CoT): The authors systematically vary the format of the reasoning steps generated during fine-tuning:
- Description: Natural language narration of the next move.
- Grid/Table: ASCII representation of the map state after the move.
- Hybrid: Combinations (e.g., Description + Grid).
- Image-based: (Referenced from prior work like Mirage, where latent space images are used).

Evaluation Metrics

Accuracy is measured by simulating the model's output sequence in the environment. A solution is correct if the player reaches the goal without falling into a lake.

3. Key Contributions

Rigorous OOD Evaluation Framework: The authors introduce a controlled benchmark that isolates specific distribution shifts (map size, start-goal distance, solution length) to test the true algorithmic generalization of reasoning models.
Format-Dependent Generalization: They demonstrate that the format of the CoT trace is a decisive factor for OOD generalization. Specifically, combining natural language reasoning with a structured visual representation (Grid/Table) yields significantly better results than text-only or image-only approaches.
Superiority of Text-Based Representations: Contrary to the intuition that visual inputs are necessary for visual tasks, the study finds that purely text-based models (using ASCII grids/descriptions) consistently outperform multimodal models using image inputs, including those using latent space reasoning (e.g., Mirage).
Evidence of Pattern Matching vs. Algorithmic Learning: The results confirm that without specific formatting, models fail to learn the underlying pathfinding algorithm, relying instead on memorizing training patterns.

4. Key Results

In-Distribution Performance:
- CoT reasoning significantly boosts ID performance across all formats compared to direct answer fine-tuning.
- Text-based inputs (Grid/Table) generally outperform image inputs.
- The best ID accuracy (~91%) is achieved using Grid/Table inputs combined with Grid + Description CoT.
Out-of-Distribution Generalization:
- General Failure: Most models (including those with standard CoT) fail dramatically on OOD data (e.g., accuracy drops to near 0% on $10\times10$ maps or when $d_\infty \geq 6$ ).
- The "Hybrid" Breakthrough: Models trained with Grid + Description CoT are the only ones to show non-trivial OOD generalization.
  - They maintain ~20% accuracy on $10\times10$ maps (trained on up to $6\times6$ ).
  - They achieve ~41% average accuracy on OOD maps with $d_\infty \geq 6$ .
- Mechanism: The authors hypothesize that the Grid/Table component helps the model track the spatial state (visualizing the map update), while the Description component allows the model to reason about the logic of the next move in natural language.
Comparison with State-of-the-Art:
- The authors' model (trained without CoT on images) already outperforms Mirage Direct (a method using continuous latent space reasoning) on ID tasks.
- Their Description CoT model achieves 80% ID accuracy, significantly beating the 47% reported for similar reasoning traces in prior work (Mirage), suggesting their concise CoT formulation is more effective.

5. Significance and Conclusion

Redefining Reasoning Generalization: The paper challenges the notion that CoT automatically leads to robust reasoning. It shows that generalization is highly sensitive to the representation format of the reasoning steps.
Practical Implications: For tasks requiring spatial planning, using structured text (ASCII grids) combined with natural language explanations is more effective than feeding raw images or relying on complex latent visual reasoning.
Future Directions: The work suggests that future research should focus on how data formats influence the "learning" of algorithms in transformers. It also opens avenues for exploring Reinforcement Learning (RL) with these specific formats and theoretically analyzing how transformers solve such tasks.

In summary, the paper argues that while LLMs struggle to generalize reasoning to unseen distributions, carefully engineered data formats (specifically hybrid text-structure CoT) can unlock significant, non-trivial generalization capabilities, outperforming both standard multimodal approaches and specialized visual reasoning frameworks.

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks

1. The Problem: The Robot is a "Cheat Sheet" User

2. The Experiment: Changing the "Language"

3. The Big Discovery: "Hybrid Thinking" Wins

4. The "Text vs. Image" Surprise

5. The Conclusion: We Need Better "Thinking" Formats

1. Problem Statement

2. Methodology

Task: FROZENLAKE Planning

Experimental Setup

Evaluation Metrics

3. Key Contributions

4. Key Results

5. Significance and Conclusion

More like this

GNN-as-Judge: Unleashing the Power of LLMs for Graph Learning with GNN Feedback

Memory-Guided Trust-Region Bayesian Optimization (MG-TuRBO) for High Dimensions

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank