Visuospatial Perspective Taking in Multimodal Language Models

Imagine you are playing a game of "Simon Says" with a very smart robot friend. You are standing back-to-back. You see a red ball on your left and a blue ball on your right. Your robot friend, however, is facing the opposite direction. To them, the red ball is on their right, and the blue ball is on their left.

If you say, "Pick up the red ball on your left," a human would instantly realize: "Wait, from my perspective, the red ball is on the right. But from the robot's perspective, it's on the left. I need to flip my brain to see what they see."

This paper asks a simple but profound question: Can modern AI models do this "brain flip"?

The researchers tested the latest, most advanced Multimodal Language Models (MLMs)—the kind of AI that can see pictures and read text—to see if they truly understand Visuospatial Perspective Taking (VPT). This is the ability to imagine the world not just from your own eyes, but from someone else's.

Here is the breakdown of their findings, using some everyday analogies:

1. The Two Levels of "Seeing"

The researchers looked at two types of perspective-taking, which they call Level 1 and Level 2.

Level 1 (The "Can You See It?" Test): This is like asking, "Is the cat behind the sofa?" If you can see the cat, and the other person is looking at the sofa, you know they can see it too. If the cat is hidden behind a wall, you know they can't.
- The Result: The AI was pretty good at this. It could usually tell if an object was visible or hidden. It's like a security guard who can easily spot if a door is open or closed.
Level 2 (The "What Does It Look Like?" Test): This is the hard part. It's not just about seeing the object; it's about knowing how it looks to them. If you see a number "6" on the floor, and your friend is standing opposite you, they see a "9".
- The Result: The AI struggled massively here. It's like a robot that can see the number "6" but cannot mentally rotate its head to realize, "Oh, if I were standing there, that would look like a 9."

2. The "M-Shaped" Failure

The researchers spun the AI's "head" (the figure in the image) in a full circle, from 0 degrees (facing you) to 180 degrees (facing away).

The Human Way: Humans get slightly slower and make a few more mistakes as the angle gets weird, but we generally get it right.
The AI Way: The AI showed a strange "M-shaped" pattern.
- When the figure faced exactly the same way as the AI (0°), it was perfect.
- When the figure faced exactly the opposite way (180°), it was also surprisingly good.
- But in the middle? When the figure was turned slightly to the side (45°, 90°, 135°), the AI's performance crashed.

The Analogy: Imagine a mirror. If you stand directly in front of it, you see yourself clearly. If you stand directly behind it, you know you can't see yourself. But if you stand at a weird angle, the AI gets confused. It seems to rely on a simple trick: "If they are opposite me, I just swap left and right." But if they are just slightly turned, that trick fails, and the AI gets lost.

3. The "Director" Game

To test this in a real-world scenario, they used a game called the Director Task.

The Setup: A human "Director" stands on one side of a shelf with a grid of items. Some items are hidden from the Director's view by opaque backs. The AI (the "Player") must pick the item the Director asks for.
The Trap: The Director says, "Pick the smallest star." But there are two stars. One is small and visible to the AI but hidden from the Director. The other is small and visible to both.
The Failure: The AI often picked the star that it could see, forgetting that the Director couldn't see it. It couldn't "inhibit" its own privileged view to adopt the Director's limited view.

4. Why Does This Matter?

You might think, "So the AI is bad at spinning numbers around. Who cares?"

The paper argues that this is a critical flaw for the future. We are starting to use AI as teachers, therapists, and collaborative partners.

If you ask a robot to "hand me the cup on the left," and it doesn't understand that its left is your right, it will hand you the wrong cup.
If a virtual therapist doesn't understand what you can see in your room versus what they can see on a screen, they might give bad advice.

The Bottom Line

The paper concludes that while these AI models are incredibly smart at reading and recognizing images, they are not actually "thinking" about other people's viewpoints.

They are like a very talented actor who has memorized a script. If the script says "swap left and right," they do it. But if the situation is complex and requires genuinely imagining being someone else, they stumble. They haven't learned to "walk a mile in someone else's shoes"; they've just learned to guess what the shoes look like from the outside.

In short: Current AI is great at seeing the world, but it's still learning how to see the world through someone else's eyes. Until it masters this, we need to be careful about trusting it in complex social or collaborative situations.

Visuospatial Perspective Taking in Multimodal Language Models

1. The Two Levels of "Seeing"

2. The "M-Shaped" Failure

3. The "Director" Game

4. Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. The Rotating Figure Task (RFT)

B. The Director Task (DT)

3. Key Contributions

4. Key Results

A. Rotating Figure Task Findings

B. Director Task Findings

5. Significance and Implications

Visuospatial Perspective Taking in Multimodal Language Models

1. The Two Levels of "Seeing"

2. The "M-Shaped" Failure

3. The "Director" Game

4. Why Does This Matter?

The Bottom Line

1. Problem Statement

2. Methodology

A. The Rotating Figure Task (RFT)

B. The Director Task (DT)

3. Key Contributions

4. Key Results

A. Rotating Figure Task Findings

B. Director Task Findings

5. Significance and Implications

More like this

Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Internal Safety Collapse in Frontier Large Language Models

DISCO: Document Intelligence Suite for COmparative Evaluation