ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning

ViewFusion is a two-stage framework that enhances multi-view spatial reasoning in vision-language models by explicitly separating cross-view spatial pre-alignment from question-driven reasoning, achieving significant accuracy improvements on benchmarks like MMSI-Bench through synthetic supervision and reinforcement learning.

Xingjian Tao, Yiwei Wang, Yujun Cai, Yifan Song, Jing Tang

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you are trying to solve a mystery, but instead of having one photo of the crime scene, you have two photos taken from completely different angles.

The Problem: The "Single-View" Shortcut
Current AI models (the "detectives" of the digital world) are great at looking at one photo and describing what they see. But when you give them two photos of the same room from different sides, they often get confused.

Instead of putting the two photos together in their mind to build a 3D map, they tend to take a lazy shortcut. They might just look at the first photo, guess the answer, and ignore the second one. Or, they might describe both photos separately but fail to realize that the "red chair" in photo A is the same object as the "red chair" in photo B, just seen from a different side. It's like trying to solve a jigsaw puzzle by looking at only one piece at a time and guessing where it goes, rather than seeing how the pieces fit together.

The Solution: ViewFusion (The "Think Twice" Detective)
The authors of this paper created a new system called ViewFusion. They realized that to solve these puzzles, the AI needs to stop and "think twice" before it answers. They designed a two-step process, like a detective working in two distinct phases:

Phase 1: The "Mental Map" Builder (Spatial Pre-Thinking)

Before the AI tries to answer the question, it is forced to stop and build a mental map.

  • The Analogy: Imagine you are a tour guide. Before you tell a tourist where the bathroom is, you first have to figure out: "Okay, in this first photo, the door is on the left. In this second photo, the door is on the right. That means I must have turned around 180 degrees between these two shots."
  • What ViewFusion does: It explicitly writes down these observations. It says, "I see the window here, and the same window there. The camera moved forward and turned left." It creates a shared "workspace" where the two images are fused into a single, consistent 3D understanding.

Phase 2: The "Answer" Phase

Once the mental map is built, the AI uses that map to answer the actual question.

  • The Analogy: Now that the tour guide has the map in their head, they can confidently say, "Since the camera turned left, the bathroom is actually behind you."
  • What ViewFusion does: It takes the question (e.g., "Where is the picture frame relative to the piano?") and looks at the mental map it just built to find the answer. Because it already figured out the spatial relationship, the answer is much more accurate.

How They Taught the AI (The Training)
You can't just tell an AI to "think harder." You have to teach it how to think. The authors used a clever training method:

  1. Supervised Learning (The Teacher): They showed the AI thousands of examples where a "smart teacher" wrote out the mental map first, then the answer. The AI learned to copy this two-step pattern.
  2. Reinforcement Learning (The Coach): They then let the AI practice on its own. If the AI tried to skip the "mental map" step and just guessed, it got a "penalty" (no points). If it built a good map and got the right answer, it got a "reward." Over time, the AI learned that taking the time to build the map was the only way to win.

The Result
When they tested ViewFusion on difficult puzzles (called benchmarks), it was significantly better than other top AI models.

  • Old AI: Looked at one photo, guessed, and was often wrong.
  • ViewFusion: Looked at both photos, figured out how the camera moved, built a mental 3D model, and then gave the correct answer.

In a Nutshell
ViewFusion is like teaching a student to draw a diagram before solving a math word problem. Instead of rushing to the answer, they first visualize the problem, connect the dots, and then solve it. This simple change—forcing the AI to "think twice" by building a spatial map first—makes it much smarter at understanding the 3D world.