ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Imagine you are trying to solve a mystery, but instead of having one photo of the crime scene, you have two photos taken from completely different angles.

The Problem: The "Single-View" Shortcut
Current AI models (the "detectives" of the digital world) are great at looking at one photo and describing what they see. But when you give them two photos of the same room from different sides, they often get confused.

Instead of putting the two photos together in their mind to build a 3D map, they tend to take a lazy shortcut. They might just look at the first photo, guess the answer, and ignore the second one. Or, they might describe both photos separately but fail to realize that the "red chair" in photo A is the same object as the "red chair" in photo B, just seen from a different side. It's like trying to solve a jigsaw puzzle by looking at only one piece at a time and guessing where it goes, rather than seeing how the pieces fit together.

The Solution: ViewFusion (The "Think Twice" Detective)
The authors of this paper created a new system called ViewFusion. They realized that to solve these puzzles, the AI needs to stop and "think twice" before it answers. They designed a two-step process, like a detective working in two distinct phases:

Phase 1: The "Mental Map" Builder (Spatial Pre-Thinking)

Before the AI tries to answer the question, it is forced to stop and build a mental map.

The Analogy: Imagine you are a tour guide. Before you tell a tourist where the bathroom is, you first have to figure out: "Okay, in this first photo, the door is on the left. In this second photo, the door is on the right. That means I must have turned around 180 degrees between these two shots."
What ViewFusion does: It explicitly writes down these observations. It says, "I see the window here, and the same window there. The camera moved forward and turned left." It creates a shared "workspace" where the two images are fused into a single, consistent 3D understanding.

Phase 2: The "Answer" Phase

Once the mental map is built, the AI uses that map to answer the actual question.

The Analogy: Now that the tour guide has the map in their head, they can confidently say, "Since the camera turned left, the bathroom is actually behind you."
What ViewFusion does: It takes the question (e.g., "Where is the picture frame relative to the piano?") and looks at the mental map it just built to find the answer. Because it already figured out the spatial relationship, the answer is much more accurate.

How They Taught the AI (The Training)
You can't just tell an AI to "think harder." You have to teach it how to think. The authors used a clever training method:

Supervised Learning (The Teacher): They showed the AI thousands of examples where a "smart teacher" wrote out the mental map first, then the answer. The AI learned to copy this two-step pattern.
Reinforcement Learning (The Coach): They then let the AI practice on its own. If the AI tried to skip the "mental map" step and just guessed, it got a "penalty" (no points). If it built a good map and got the right answer, it got a "reward." Over time, the AI learned that taking the time to build the map was the only way to win.

The Result
When they tested ViewFusion on difficult puzzles (called benchmarks), it was significantly better than other top AI models.

Old AI: Looked at one photo, guessed, and was often wrong.
ViewFusion: Looked at both photos, figured out how the camera moved, built a mental 3D model, and then gave the correct answer.

In a Nutshell
ViewFusion is like teaching a student to draw a diagram before solving a math word problem. Instead of rushing to the answer, they first visualize the problem, connect the dots, and then solve it. This simple change—forcing the AI to "think twice" by building a spatial map first—makes it much smarter at understanding the 3D world.

Here is a detailed technical summary of the paper "ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning".

1. Problem Statement

Current Vision-Language Models (VLMs) struggle with multi-view spatial reasoning, specifically the ability to align spatial information across different viewpoints to infer 3D scene structures.

Core Failure Mode: Even when provided with multiple images of the same scene, models often rely on "single-image shortcuts." They describe individual views independently or use superficial correlations rather than establishing cross-view spatial correspondences (e.g., how the camera moved, which objects correspond across views, and how occlusions evolve).
Limitations of Existing RL: While Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) can improve general reasoning, they often fail to induce correct cross-view alignment. Models trained with standard RL frequently exhibit shortcut behaviors, such as answering based on a single salient view or treating other images as incidental noise, leading to brittle performance on viewpoint transformation and occlusion tasks.

2. Methodology: ViewFusion

The authors propose ViewFusion, a two-stage "think twice" framework that explicitly separates cross-view spatial pre-alignment from question answering.

A. The Two-Stage Architecture

Instead of a standard "describe-then-answer" or direct "think-and-answer" approach, ViewFusion enforces a structured generation protocol:

Stage 1: Spatial Pre-Thinking (<spatial_thinking>): The model performs deliberate reasoning to infer viewpoint relations and spatial transformations before addressing the specific question. It establishes an intermediate workspace by:
- Identifying shared visual cues across views.
- Inferring camera movement (rotation/translation).
- Resolving occlusions and object re-identification.
Stage 2: Question-Driven Reasoning (<thinking> & <answer>): The model conditions its final reasoning on the intermediate workspace generated in Stage 1 to produce the answer.

B. Training Strategy

ViewFusion employs a two-stage training pipeline combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL):

Data Preparation:
- SFT Data (18k instances): Original rationales are rewritten into structured traces containing <spatial_thinking>, <thinking>, and <answer> tags using a stronger teacher model (Qwen3-32B).
- RL Data (16k instances): Contains only inputs and ground-truth answers (no rewritten traces) to prevent overfitting to specific rationales.
Optimization Pipeline:
- Stage 1 (SFT): Initializes the model to learn the structured two-stage protocol.
- Stage 2 (RL with GRPO): Further optimizes the policy using Group Relative Policy Optimization (GRPO).
  - Reward Design: A composite reward function is crucial to prevent degenerate behaviors:
    - Answer Correctness ( $r_{ans}$ ): Binary reward for the correct option.
    - Format Validity ( $r_{fmt}$ ): Binary reward ensuring the strict order of <spatial_thinking>, <thinking>, and <answer> tags.
    - Length Regularization ( $r_{len}$ ): Encourages sufficient reasoning length (320–512 tokens) to avoid premature truncation or verbosity.

3. Key Contributions

Diagnosis of Failure Modes: The paper identifies that current MLLMs (even RL-trained ones) fail to align cross-view spatial information, relying instead on view-local shortcuts.
ViewFusion Framework: Introduces a novel "think twice" paradigm that explicitly decouples spatial pre-alignment from question solving, forcing the model to build a consistent spatial model before answering.
Training Recipe: Proposes a specific training strategy combining synthesized structured supervision (SFT) with GRPO-based RL, utilizing a composite reward system to stabilize the two-stage generation behavior.
Empirical Validation: Demonstrates that explicit cross-view pre-thinking yields significant gains over simply encouraging longer deliberation (e.g., outperforming "Thinking" variants of base models).

4. Experimental Results

The method was evaluated on three multi-view benchmarks: MMSI-Bench, MindCube, and ViewSpatial-Bench.

Performance Gains:
- On MMSI-Bench, ViewFusion (SFT+RL) achieved 35.4% accuracy, a 5.3% absolute improvement (17.6% relative) over the strong baseline Qwen3-VL-4B-Instruct (30.1%).
- On MindCube, it achieved 77.0% accuracy compared to the baseline's 37.0%, indicating a massive leap in constructing coherent mental models from limited views.
- It consistently outperformed Qwen3-VL-4B-Thinking (a model trained with high-quality chain-of-thought data), proving that structured cross-view alignment is more effective than just longer deliberation.
Ablation Studies:
- Removing the structured two-stage format ("Free format Reasoning") dropped accuracy to 33.4%.
- Removing GRPO ("w/o GRPO") dropped accuracy to 32.4%, highlighting the necessity of RL for refining correctness.
- Removing the format reward led to unstable generation distributions, confirming its role in enforcing the protocol.
Qualitative Analysis: Visual examples show that ViewFusion correctly identifies viewpoint shifts (e.g., "rotated to the left") and object correspondences, whereas baselines often hallucinate relationships or rely on a single image.

5. Significance

Paradigm Shift: ViewFusion challenges the notion that "more thinking" automatically leads to better reasoning. It demonstrates that for spatial tasks, the structure of the reasoning (explicit pre-alignment) is more critical than the length of the chain.
Robustness: By making cross-view alignment a deliberate first step, the model becomes robust against occlusions and viewpoint changes, which are common failure points in current VLMs.
Practicality: The framework is lightweight and can be applied to existing base models (like Qwen3-VL-4B) without requiring massive architectural changes, offering a practical path toward more reliable multi-view spatial intelligence.

ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

Phase 1: The "Mental Map" Builder (Spatial Pre-Thinking)

Phase 2: The "Answer" Phase

1. Problem Statement

2. Methodology: ViewFusion

A. The Two-Stage Architecture

B. Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Markovian Generation Chains in Large Language Models