ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Imagine you are trying to solve a very tricky puzzle, like figuring out how to navigate a maze or assemble a jigsaw.

For a long time, AI models tried to solve these problems using only words. They would look at a picture and write a long essay about it, hoping their words were smart enough to figure out the answer. But sometimes, words just aren't enough. It's like trying to explain how to tie your shoes using only a dictionary; you need to see the knots and move the laces.

Other models tried to use tools, like a digital pair of scissors to crop an image or a pen to draw a line. But this was clunky. It was like asking a friend to hand you a tool every time you needed to think, rather than having the tool built right into your hand.

Enter ThinkMorph.

ThinkMorph is a new kind of AI that learns to "think" in a way that feels very human: by mixing words and pictures together in a single, flowing conversation.

Here is how it works, broken down into simple concepts:

1. The "Sketch-and-Talk" Strategy

Think of a human solving a complex problem. You might say, "Okay, the red piece goes here," and then you draw a line on the paper to show where it fits. Then you say, "Wait, that doesn't look right," and you erase the line and try a different spot.

ThinkMorph does this digitally. Instead of just writing text, it can:

Write a thought: "I need to find the duck's beak."
Draw a thought: It generates a new image with a red box highlighting the duck's beak.
Write again: "Ah, I see! The beak is pointing right, so the answer is 'Right'."

It treats text and images as partners, not copies of each other. The text explains why it's looking, and the image shows what it's seeing.

2. The "Magic Paintbrush" (Emergent Skills)

The most surprising thing about ThinkMorph is that it learned skills nobody explicitly taught it. This is called an "Emergent Property."

Imagine you teach a child to paint by giving them a brush and a canvas. You never tell them, "If you zoom in, you can see details better." But after a while, they figure it out on their own and start zooming in to paint tiny details.

ThinkMorph did the same thing. During its training, it learned to:

Zoom in on blurry parts of an image to read a sign.
Draw arrows to trace a path through a maze.
Highlight specific objects to check their color.

It didn't just learn to answer questions; it learned to manipulate the visual world to help itself think.

3. Knowing When to Switch Gears

ThinkMorph is also smart about efficiency. Sometimes, a problem is so simple that drawing a picture is a waste of time.

If you ask, "What color is the sky?", a human doesn't need to draw a blue circle to know the answer. They just think, "Blue."

ThinkMorph learned to do this too. Even though it was trained to always mix pictures and words, it realized that for some easy questions, it could just switch to text-only mode and save energy. It knows when to "draw" and when to just "talk."

4. The "Group Brainstorm" Effect

Finally, ThinkMorph gets better the more it tries. If you ask it a hard question, it can generate several different "paths" to the answer (some with drawings, some with just words).

Think of it like a group of friends brainstorming. One friend suggests a path, another suggests a different angle. By looking at all these different ideas together, the group is much more likely to find the right answer than one person working alone. ThinkMorph uses this "group brainstorming" method to solve problems it has never seen before.

The Bottom Line

Before ThinkMorph, AI was like a person trying to solve a puzzle while blindfolded, relying only on a description of the pieces.

ThinkMorph is like a person who can see the pieces, draw on them, talk about them, and even change them to see what happens. It's a giant leap forward, showing that when AI learns to truly "think" with both words and images, it becomes much smarter, more flexible, and surprisingly human-like.

1. Problem Statement

Multimodal reasoning requires iterative coordination between language and vision. Current approaches face two primary limitations:

Text-Only Limitations: Standard Chain-of-Thought (CoT) relies solely on text, which often fails in vision-centric tasks (e.g., spatial reasoning, jigsaw puzzles) where describing an image is insufficient; the model must actively interrogate and manipulate visual elements.
Existing Interleaved Approaches: Previous "interleaved" methods either rely on brittle external tools (e.g., cropping tools, sketching APIs) or unified models where text and image tokens are isomorphic (text merely labels the image, offering no complementary reasoning). These approaches lack a generalizable recipe for text and image modalities to mutually advance reasoning.

The core question addressed is: How can text and image thoughts function as complementary, rather than isomorphic, modalities to enable robust, generalizable multimodal reasoning?

2. Methodology: ThinkMorph

The authors propose ThinkMorph, a unified multimodal model designed to generate progressive, interleaved text-image reasoning steps.

A. Core Architecture & Training

Base Model: Built upon Bagel-7B, a unified autoregressive multimodal model.
Interleaved Generation: Unlike standard CoT which outputs only text tokens ( $\hat{t}$ ), ThinkMorph generates a sequence $T = (\hat{m}_1, \dots, \hat{m}_n)$ where $\hat{m}_i$ can be either text or image tokens.
Modality Control: Transitions are managed via special delimiter tokens (<image_start>, <image_end>) and standard reasoning tags (<thought>, </thought>).
Training Objectives: The model is optimized with dual losses:
- Cross-Entropy (CE) Loss: For text tokens.
- Mean Squared Error (MSE) Loss: For image tokens (diffusion-based generation).

B. Data Curation (~24K High-Quality Traces)

The authors constructed a dataset of ~24,990 interleaved reasoning traces spanning four tasks with varying visual engagement levels. The data is curated to ensure text and images provide complementary cues:

Jigsaw Assembly: Text describes local piece content; images visualize re-arranged pieces to check spatial coherence; final text verifies the assembly.
Spatial Navigation: Text establishes a global abstraction; images render the trajectory (path); final text verifies the move sequence.
Visual Search: Text hypothesizes the region of interest; images draw a bounding box; final text confirms object attributes.
Chart Refocus: Text identifies data elements; images highlight specific chart regions; final text extracts values.

Note: Data for Jigsaw and Navigation was synthesized via a custom pipeline; Visual Search and Chart Refocus were curated from existing datasets (GQA, VSR, ChartQA) with strict filtering for quality and difficulty.

3. Key Contributions & Emergent Properties

Beyond performance gains, ThinkMorph exhibits three distinct emergent properties indicative of higher-level multimodal intelligence:

Property 1: Unseen Visual Manipulations

Observation: The model generates visual edits (e.g., zooming in, inpainting, perspective shifting, motion forecasting) that were not present in the training data.
Mechanism: Specific textual cues (e.g., "examine closely") reliably trigger corresponding visual manipulations. This suggests the model has internalized a "toolset" of visual operations during pretraining and learned to activate them contextually during reasoning.

Property 2: Autonomous Mode Switching

Observation: Despite being trained exclusively on interleaved data, the model autonomously switches to text-only reasoning for ~5.3% of inference cases.
Adaptivity: This switch is task-dependent. For tasks where visual grounding is redundant (e.g., simple chart value extraction), the model switches to text for efficiency. For tasks requiring fine-grained visual cues, it maintains interleaved reasoning.
Efficiency: Switched instances achieve higher accuracy (7.29% gain) and consume ~75% fewer tokens than forced interleaved reasoning.

Property 3: Better Test-Time Scaling via Diversified Thoughts

Observation: Interleaved reasoning scales more effectively than unimodal approaches when using Best-of-N sampling.
Mechanism: By exploring both text and image spaces simultaneously, the model covers a broader solution space. As $N$ increases, the diversity of reasoning trajectories increases the probability of finding the correct answer, particularly on out-of-domain (OOD) benchmarks.

4. Experimental Results

A. Performance Gains

In-Domain: ThinkMorph achieves an average 34.74% improvement over the base Bagel-7B model.
- Spatial Navigation: +85.84% (0.83% $\to$ 86.67%).
- Jigsaw Assembly: +38.75%.
Out-of-Domain (OOD):
- SAT (Spatial Reasoning): 52.67% (vs. InternVL3.5-38B at 49.33%).
- MMVP (Perception): 80.33% (matching Gemini 2.5 Flash).
- BLINK-J: 72.00% (surpassing larger models).

B. Comparison with SOTA

ThinkMorph (7B) rivals or exceeds significantly larger models (e.g., Qwen2.5-VL-72B, InternVL3.5-38B) and proprietary systems (GPT-4o, GPT-5, Gemini 2.5 Flash) on reasoning-heavy benchmarks.
It outperforms other unified models (Janus-Pro, Chameleon) by margins of 28–42%, demonstrating that interleaved training effectively bridges generation and understanding.

C. Scaling Analysis

Test-Time Scaling: On the challenging BLINK-J benchmark, interleaved reasoning improved accuracy by +8.0% when scaling from $N=1$ to $N=8$ , whereas visual-only reasoning dropped by 2.0%.
Cost Efficiency: While interleaved reasoning is computationally more expensive per token (due to image generation), it achieves higher accuracy with fewer total tokens compared to text-only models scaled to equivalent compute budgets.

5. Significance

Paradigm Shift: The paper challenges the notion that text and images in reasoning must be isomorphic. It demonstrates that treating them as complementary modalities allows for "think-and-sketch" behaviors similar to human problem-solving.
Emergent Intelligence: The discovery of autonomous mode switching and unseen visual manipulations suggests that unified models can develop adaptive strategies and internal tool-use capabilities without explicit supervision for those specific behaviors.
Generalization: ThinkMorph provides a scalable recipe for multimodal reasoning that generalizes across diverse domains, proving that high-quality interleaved traces can unlock capabilities in smaller models that typically require massive scale or proprietary data.

In conclusion, ThinkMorph establishes that interleaved Chain-of-Thought is not merely a coordination mechanism but an engine for emergent multimodal intelligence, offering a promising path toward robust, human-like reasoning in unified AI systems.