GPT-4o Lacks Core Features of Theory of Mind

The Big Question: Do AI "Mind-Readers" Actually Have a Mind?

Imagine you are playing a game of "Guess Who?" with a very advanced robot. You ask it, "If I really want a cookie but I'm afraid of the dark, will I go to the kitchen?" The robot answers perfectly every time. It seems to understand your fears and desires.

But does the robot actually understand you? Or is it just a masterful actor reciting lines from a script it memorized?

This is the question researchers at Yale University asked about GPT-4o (a top-tier AI). They wanted to know if the AI has a Theory of Mind (ToM).

What is Theory of Mind?
Think of ToM as an internal "simulation engine" in your brain. It's not just knowing facts; it's a causal model that says: "Because I want X, and I believe Y is true, I will do Z."

Coherent: The logic holds together.
Abstract: The logic works whether you are talking about cookies, movies, or politics.
Consistent: If you say "I want a cookie," the AI shouldn't later act like you hate cookies.

The researchers tested GPT-4o to see if it has this engine or if it's just mimicking human behavior.

The Three Tests

The researchers set up three different "games" to test the AI.

Test 1: The "Cookie Jar" Game (Coherence)

The Setup: Imagine a character in a room. There is a box right next to them and a basket far away.

Beliefs: The character thinks the box has apples, but the basket might have oranges.
Desires: The character loves oranges but hates apples.
Cost: Walking to the basket takes effort (it's far away).

The researchers asked the AI: "What will the character do?"
The Result: The AI did a great job! It correctly figured out that if the character hates apples, they will walk all the way to the basket to get the oranges.
The Takeaway: The AI can follow the rules of logic within a single story. It looks like it has a mind.

Test 2: The "Movie Festival" Game (Abstractness)

The Setup: Now, the researchers changed the story but kept the exact same math.

Instead of a box and basket, there are two movies: one starting in 5 minutes, one in 90 minutes.
Instead of apples and oranges, the genres are Action and Romance.
The character has the same beliefs and desires.

If the AI has a true "Mind," it should realize this is the same puzzle just with different costumes. It should give the same logical answer.
The Result: The AI started to stumble. While it got some answers right, its logic didn't transfer perfectly. It treated the "Movie" story as a totally different problem rather than the same logic puzzle in disguise.
The Takeaway: The AI is like a student who memorized the answer key for the "Cookie" test but doesn't understand the concept of "distance vs. desire." When the test changes slightly, the student fails. It lacks abstractness.

Test 3: The "Backwards" Game (Consistency)

The Setup: This is the ultimate test of a real mind.

Forward: The AI predicts what a person will do based on their thoughts.
Backward: The AI looks at what a person did and guesses what they were thinking.

If you have a real Theory of Mind, these two directions should match perfectly. If the AI says, "Because he likes Action movies, he watched the 90-minute movie," then later, when shown the 90-minute movie, it should say, "He must like Action movies."
The Result: The AI failed this completely.

When predicting actions, it used one set of rules.
When guessing thoughts from actions, it used a totally different (and contradictory) set of rules.
The Takeaway: The AI is like a broken compass. It points North when you ask it to find North, but when you ask it to find South, it points East. It doesn't have a single, consistent internal map.

The Verdict: The "Parrot" vs. The "Psychologist"

The researchers concluded that GPT-4o does not have a Theory of Mind.

Here is the best way to visualize the difference:

A Human with ToM is like a Chess Master. They understand the deep rules of the game. If you change the board to a different size, they can still play because they understand the principles of strategy. They can predict moves and explain why a move was made using the same logic.
GPT-4o is like a Super-Parrot. It has heard millions of stories about people making choices. It knows that "People usually walk to the basket if they hate the box." It can mimic this perfectly in a familiar story. But it doesn't actually understand the cause-and-effect relationship. It's just pattern matching.

Why Does This Matter?

You might ask, "So what? The AI still gives good answers, right?"

The researchers say yes, but with a big warning.
If you ask the AI about a situation it has seen a million times in its training data, it will be brilliant. But if you ask it to apply its "social skills" to a brand new, weird, or complex situation (like a new culture or a strange social dilemma), it might fail because it doesn't have a real model of how minds work. It's just guessing based on statistics.

The Bottom Line:
Current AI is incredibly good at acting like it understands people, but it doesn't actually have a model of how people think. It's a very convincing performance, but the "actor" has no idea what the script actually means.

1. Problem Statement

The central question addressed is whether Large Language Models (LLMs) possess a Theory of Mind (ToM). While previous research has shown LLMs can succeed on standard ToM benchmarks (often based on developmental psychology tasks like false-belief tests), the authors argue these evaluations suffer from construct validity issues.

The Gap: Standard benchmarks often test surface-level pattern matching or specific social scenarios rather than the underlying cognitive architecture.
The Definition: The authors define ToM not merely as the ability to pass a test, but as a unified causal model that explains the relationship between mental states (beliefs, desires) and behavior.
The Hypothesis: For an LLM to possess a genuine ToM, its internal representation must exhibit three core characteristics:
1. Coherence: Systematic application of core principles (e.g., rational planning) to predict behavior.
2. Abstractness: The ability to generalize these principles across logically equivalent but superficially different domains.
3. Consistency: The causal model must be bidirectional; mental state inferences derived from actions must be consistent with the actions predicted by those same mental states.

2. Methodology

The authors developed a cognitively-grounded evaluation framework using GPT-4o (specifically gpt-4o-2024-05-13). They utilized two distinct but logically isomorphic paradigms to test the three core features of ToM.

Paradigms

ContainerWorld: A spatial task where an agent chooses between a "box" (near) and a "basket" (far).
- Variables: Beliefs (contents of containers), Desires (like/dislike fruit), State (actual contents), Cost (distance/effort).
- Task: Predict the agent's action (which container to open).
MovieWorld: A temporal task where an agent chooses between a movie starting in 5 minutes or 90 minutes.
- Variables: Beliefs (genre), Desires (like/dislike genre), State (actual genre), Cost (waiting time).
- Mapping: Every tuple in ContainerWorld maps 1:1 to MovieWorld (e.g., "box" = "5 min", "basket" = "90 min").

Evaluation Pipeline

The study involved three specific investigations:

Study 1: Coherence (Action Prediction)
- Method: The LLM was queried on all possible combinations of beliefs, desires, and states to generate a forward model ( $\mathcal{F}$ ) of action predictions.
- Comparison: LLM predictions were compared against a HumanToM model (a rational planning model integrating beliefs, desires, and costs) and several "ablated" models (e.g., models ignoring beliefs or costs).
- Metric: Agreement percentage between LLM outputs and candidate models.
Study 2: Abstractness (Cross-Domain Generalization)
- Method: The LLM was tested on both ContainerWorld and MovieWorld.
- Metric 1 (Behavioral): Did the LLM produce the same pattern of action predictions in MovieWorld as it did in ContainerWorld?
- Metric 2 (Inference): The LLM was queried to infer missing mental states (Beliefs $\mathcal{I}_\mathcal{B}$ , Desires $\mathcal{I}_\mathcal{D}$ , or Joint $\mathcal{I}_\mathcal{J}$ ) given observed actions. The authors tested if the inference distribution in ContainerWorld could predict the inference distribution in MovieWorld.
Study 3: Consistency (Bidirectional Causality)
- Method: The authors tested the internal consistency of the LLM's causal model.
  - Forward: Given mental states, does the LLM predict the correct action?
  - Backward: Given an action, does the LLM infer the mental states that would logically generate that action?
- Metrics:
  1. Bayesian Evaluation: Correlation between the LLM's inferred posterior and the theoretical Bayesian posterior (strict).
  2. Validity Evaluation: A "generous" metric checking if any set of inferred mental states, when fed back into the forward model, would reproduce the observed action.

3. Key Results

Study 1 (Coherence): Partial Success
- GPT-4o showed the highest agreement with the HumanToM model in ContainerWorld, suggesting it can approximate human-like rational planning in simple, single-domain contexts.
- However, agreement was not perfect, and the model relied on a complex mechanism beyond simple cost-minimization.
Study 2 (Abstractness): Failure
- Behavioral Predictions: While GPT-4o performed well in both domains individually, the action predictions in ContainerWorld did not reliably predict the actions in MovieWorld ( $r = 0.48$ ). The model treated the domains as distinct rather than applying a unified abstract rule.
- Mental State Inferences: Only belief inferences showed strong cross-domain correlation ( $r = 0.78$ ). Desire inferences ( $r = 0.18$ ) and joint inferences ( $r = 0.39$ ) were very weak.
- Conclusion: The LLM lacks an abstract, domain-general causal model. Its "ToM" is brittle and context-dependent.
Study 3 (Consistency): Failure
- Bayesian Evaluation: GPT-4o failed to show ceiling correlations between its forward predictions and backward inferences in either domain.
- Validity Evaluation: Even under the more lenient metric (where inferred states just need to be able to generate the action), GPT-4o failed to reach ceiling agreement.
- Conclusion: The LLM's action predictions and mental state inferences are uncoupled. It does not maintain a single, consistent causal model that works bidirectionally.

4. Key Contributions

Redefining ToM Evaluation: The paper shifts the focus from "benchmark accuracy" (passing a test) to structural properties (coherence, abstractness, consistency) required for a genuine causal model.
New Evaluation Framework: Introduces a rigorous, cognitively-grounded methodology using isomorphic paradigms (ContainerWorld/MovieWorld) to test for domain-generalization and bidirectional consistency.
Empirical Evidence of Limitations: Provides strong evidence that even state-of-the-art models like GPT-4o lack a unified Theory of Mind. Their social proficiency appears to be a result of statistical pattern matching rather than an emergent causal understanding of mental states.
Anthropocentric-Free Benchmarking: The authors explicitly designed tests that do not require the LLM to match human ToM specifically, but rather to possess any consistent abstract model. The LLM failed even this less restrictive criteria.

5. Significance and Implications

AI Safety and Reliability: If LLMs lack a consistent ToM, their social reasoning is fragile. They may appear competent in familiar contexts but fail catastrophically when generalizing to new, logically equivalent social scenarios. This undermines confidence in their ability to make "reasonable" inferences in novel situations.
Cognitive Science: The findings suggest that the "alien intelligence" of LLMs is fundamentally different from human cognition. Humans possess a unified causal model; LLMs possess a collection of narrow, domain-specific heuristics.
Future Directions: The authors argue that future evaluations must prioritize structural desiderata (coherence, abstractness, consistency) over human-like benchmarking. They propose formalizing this approach as an open-source metric to assess the "ToM" capabilities of future, more powerful models.
Broader Application: This framework could be applied to other "folk theories" (e.g., physics, economics) and non-human intelligences to determine if they possess genuine causal models or merely statistical approximations.

Final Conclusion: The paper concludes that GPT-4o lacks a Theory of Mind. While it can mimic human social behavior in specific, constrained tasks, it fails to demonstrate the abstract, consistent, and coherent causal modeling required for genuine ToM.