Egocentric Bias in Vision-Language Models

Imagine you are sitting at a table with a friend. You are looking at a piece of paper that says "81". Your friend is sitting directly across from you, looking at the other side of that same piece of paper.

If you asked your friend, "What do you see?" a human would instantly know: "I see 18." Why? Because if you flip the paper 180 degrees, the 8 stays an 8, but the 1 flips around to become a 1 on the other side, and the whole thing reads backward.

This paper asks a very simple question: Can AI do this?

The researchers created a test called FlipSet to see if Vision-Language Models (the smart AI systems that can "see" images and talk about them) can understand what someone else sees when they are looking at the world from a different angle.

Here is the breakdown of what they found, using some everyday analogies:

1. The Big Problem: The "Selfie" Habit

The researchers tested 103 different AI models. The result was shocking: 9 out of 10 models failed.

Instead of imagining what the monkey (the "friend" in the picture) sees, the AI almost always just described what it (the camera) sees.

The Camera sees: "81"
The Monkey sees: "18"
The AI says: "81"

The paper calls this Egocentric Bias. It's like a toddler who thinks that because they see a toy on the left, everyone else must see it on the left too. The AI is stuck in its own "selfie" perspective and cannot mentally step into someone else's shoes.

2. The "Three-Part Puzzle" Experiment

To figure out why the AI was failing, the researchers didn't just ask the hard question. They broke the task down into three smaller puzzles to see which part the AI was good at and which part it broke.

Think of the AI's brain as having three different tools:

Tool A: The "Social Awareness" Tool (Theory of Mind)
- The Question: "Does the monkey see something different than the camera?"
- The Result: The AI is a genius here (90% success). It knows that if you sit across from me, you see a different view. It understands the concept of "other people."
Tool B: The "Mental Gymnast" Tool (Mental Rotation)
- The Question: "If I take this word '81' and spin it 180 degrees in a vacuum, what does it look like?"
- The Result: The AI is okay, but shaky (26% success). It can sometimes figure out how shapes flip, but it's not great at it.
Tool C: The "Grand Finale" (Putting it together)
- The Question: "What does the monkey see?" (This requires using Tool A and Tool B at the same time).
- The Result: The AI crashes completely (10% success).

3. The "Broken Assembly Line"

This is the most important discovery. The AI has the parts, but it can't assemble them.

Imagine a car factory.

The factory is great at making wheels (Social Awareness).
The factory is decent at making engines (Mental Rotation).
But when they try to put the wheels and engine together to make a car (Perspective Taking), the car falls apart.

The researchers call this a Compositional Deficit. The AI knows the pieces, but it lacks the "glue" to combine them in a real-world situation. It's like having a dictionary and a thesaurus, but not knowing how to write a sentence that makes sense.

4. Why "Thinking Harder" Didn't Help

The researchers tried to help the AI by asking it to "think step-by-step" (a technique called Chain-of-Thought). Usually, this helps AI solve math or logic problems.

But here, it made things worse. It was like asking a person who is bad at math to "talk through their steps" while trying to solve a problem they don't understand. The AI would confidently say, "I see the camera, so the answer is 81," and then write a long, convincing paragraph explaining why that is correct, even though it was wrong. It was "hallucinating" logic to support its bad guess.

The Bottom Line

This paper tells us that while AI is getting very good at recognizing objects and understanding language, it is still terrible at "stepping into someone else's shoes."

Current AI models are like mirrors: they reflect exactly what is in front of them. They haven't yet learned to be windows: looking through them to see the world from a different angle. Until we fix this "egocentric bias," AI will struggle to interact with us in complex, real-world social situations where understanding another person's point of view is crucial.

1. Problem Statement

The paper addresses a critical gap in the cognitive capabilities of Vision-Language Models (VLMs): Level-2 Visual Perspective Taking (L2 VPT).

Definition: Unlike Level-1 VPT (determining if an object is visible from another's viewpoint), L2 VPT requires simulating how an object appears from a different viewpoint (e.g., recognizing that a "6" looks like a "9" when viewed from 180°).
The Challenge: Current VLMs are evaluated on complex 3D benchmarks (like the "Three Mountains" task) where failures are ambiguous. It is unclear if a model fails because it cannot understand that another agent sees differently (Theory of Mind/ToM) or because it cannot perform the necessary spatial transformation (Mental Rotation/MR).
Hypothesis: The authors hypothesize that VLMs suffer from a systematic egocentric bias, defaulting to the camera's viewpoint rather than simulating an agent's perspective, and that this failure stems from a compositional deficit—an inability to integrate social awareness with spatial reasoning.

2. Methodology: The FlipSet Benchmark

To isolate and diagnose these specific cognitive failures, the authors introduce FlipSet, a controlled diagnostic benchmark.

Task Design:
- Stimuli: Images of a card with 2D alphanumeric strings (e.g., "81", "pond") placed on a floor, with a plush monkey sitting opposite the card.
- Prompt: "What does the monkey see on the card?"
- Mechanism: To answer correctly, the model must mentally rotate the string 180° to simulate the monkey's view (e.g., "81" $\rightarrow$ "18").
- Control: By using 2D strings rather than complex 3D scenes, the benchmark minimizes confounding factors like depth perception, occlusion, and object tracking, isolating the spatial transformation component.
Experimental Conditions:
1. Main Experiment (L2 VPT): The full perspective-taking task.
2. Control Experiments: To dissociate cognitive components, the authors ran two additional tasks using identical visual stimuli:
  - Theory of Mind (ToM): "Is the monkey seeing a different string than what appears in this image?" (Tests social awareness without spatial transformation).
  - Mental Rotation (MR): "If the string on the paper is rotated 180°, what would it be?" (Tests geometric transformation without perspective-taking).
Diagnostic Answer Choices:
The benchmark uses a 4-choice multiple-choice format designed to categorize error types:
- Correct: Successful perspective transformation.
- Egocentric: Reproducing the camera's view (ignoring the monkey).
- Confusable: Selecting a visually similar distractor (partial reasoning).
- Random: Unrelated guess.
- Fail: Invalid output.
Evaluation Scope:
- Main Study: 103 publicly available VLMs (families including LLaVA, Qwen, InternVL, Gemma, etc.) ranging from 1B to 90B parameters, evaluated under zero-shot conditions.
- Control Study: A subset of 24 models evaluated across all three tasks (ToM, MR, L2 VPT) to analyze component integration.

3. Key Results

A. Systematic Egocentric Bias

Performance: 91.3% of the 103 models performed below the 25% chance level. The median accuracy was only 5.36%.
Error Distribution: The dominant failure mode is Egocentric Bias.
- 75.88% of all errors were "Egocentric" (the model simply repeated what the camera saw).
- Only 8.96% of responses were correct.
- Chain-of-Thought (CoT) prompting failed to mitigate this bias and often amplified it.

B. Compositional Deficit (Control Experiments)

The evaluation of 24 models across the three tasks revealed a critical dissociation:

High ToM Accuracy: Models achieved 90.4% accuracy, proving they understand that the monkey sees something different from the camera.
Low Mental Rotation (MR): Models performed near chance (26.1%) on isolated rotation tasks.
Catastrophic L2 VPT: Performance dropped to 10.3%.
The Deficit: Crucially, L2 VPT performance was lower than the product of ToM and MR capabilities.
- Example: A model with perfect ToM (1.0) and above-chance MR (0.5) should theoretically achieve ~0.5 on L2 VPT. Instead, it achieved ~0.34.
- 91.7% of models showed L2 VPT performance below the expected factorial prediction of their component abilities.
- 66.7% of models achieved above-chance MR scores but remained at or below chance on L2 VPT.

C. Lack of Correlation

There was no correlation between ToM and L2 VPT performance ( $r=0.010$ ).
There was no correlation between ToM and MR performance.
There was a strong correlation between MR and L2 VPT ( $r=0.746$ ), confirming that spatial transformation is a necessary prerequisite, but possessing it is not sufficient for successful perspective taking in a social context.

4. Key Contributions

FlipSet Benchmark: The first large-scale (103 models), controlled benchmark for L2 VPT that isolates spatial transformation from 3D complexity and separates ToM recognition from mental rotation.
Diagnostic Granularity: Introduction of a systematic answer-choice design that distinguishes between egocentric, confusable, and random errors, moving beyond binary accuracy metrics.
Evidence of Compositional Deficit: Empirical proof that current VLMs possess the "building blocks" of perspective taking (social awareness and geometric rotation) but lack the integrative mechanisms to bind them together in situated reasoning contexts.

5. Significance and Implications

Architectural Limitations: The findings suggest that current VLMs rely on coarse-grained visual encodings and learned visual-linguistic associations rather than structured, model-based spatial reasoning. They cannot "simulate" a viewpoint; they merely "recognize" patterns.
Developmental Parallel: The results mirror Piaget's concept of egocentrism in the preoperational stage of child development, where children cannot coordinate their own perspective with another's through reversible mental operations.
Future Directions:
- Scaling parameters alone will not solve this; architectural innovations are required.
- Future models need mechanisms to bind social awareness to spatial operations.
- Training strategies should focus on multi-view data and model-based simulation (constructing internal 3D scene models) rather than pattern matching.
- Diagnostic benchmarks like FlipSet are essential for tracking progress in "situated" AI reasoning.

In conclusion, the paper demonstrates that despite high performance in perception and general reasoning, VLMs fundamentally struggle with Level-2 Visual Perspective Taking due to an inability to integrate social context with spatial transformation, resulting in a pervasive and systematic egocentric bias.