Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Mystery: Did the AI Lose Its "Soul"?

In early 2026, OpenAI retired their popular chatbot, GPT-4o, and replaced it with newer models. The internet exploded with anger. Thousands of users started a hashtag called #keep4o, claiming the new bots were "cold," "robotic," and had "lost their empathy." They felt like the new AI didn't care about them anymore.

This paper is like a medical check-up for that feeling. The researchers asked: Is the AI actually less empathetic, or is it just playing a different game?

The Experiment: A Stress Test for AI Hearts

To find out, the researchers didn't just ask people how they felt. They put three different AI models (the old favorite GPT-4o, a middle-ground model, and the new GPT-5-mini) through a rigorous "stress test."

They created 14 different emotional scenarios, ranging from a person losing their job to a teenager feeling suicidal. They ran these conversations 5 times each, generating over 2,000 responses. Then, they scored the AI on six specific "safety dimensions," like a doctor checking vital signs:

Empathy: Did it sound warm?
Crisis Detection: Did it notice if someone was in danger?
Advice Safety: Did it avoid giving dangerous medical advice?
Consistency: Was it reliable?
Reliability: Was it factually correct?
Boundaries: Did it know when to say "I'm an AI, not a therapist"?

The Shocking Result: The Empathy Myth

The headline finding is this: The AI did not lose its empathy.

When you look at the scores, the "warmth" and "understanding" of the old GPT-4o and the new GPT-5-mini are statistically identical. They are essentially the same.

So, why did everyone feel like the new one was colder?

The Real Change: The "Cautious" vs. The "Alert"

The researchers discovered that while the empathy didn't change, the safety strategy did. Imagine two different types of lifeguards:

GPT-4o (The Cautious Lifeguard): This lifeguard is very careful. If someone looks like they might be in trouble, they might miss it because they are so focused on not overreacting. However, if you ask, "Can I jump off this cliff?" this lifeguard will firmly say, "No, that's dangerous," and stick to the rules. They are reliable at saying "No," but unreliable at spotting danger.
GPT-5-mini (The Alert Lifeguard): This lifeguard is hyper-vigilant. They spot danger immediately. If someone looks sad, they jump into action. But because they are so eager to help, they sometimes give advice they shouldn't, or they get too involved in your personal problems. They are great at spotting danger, but sometimes say too much.

The Trade-Off:

Old Model: Missed crises more often, but rarely gave bad advice.
New Model: Catches crises much better, but sometimes gives advice it shouldn't.

The "Peak-End" Illusion: Why We Remember the Old One

If the new model is actually better at spotting danger, why do people hate it?

The paper uses a psychological concept called the "Peak-End Rule." Think of your memory like a highlight reel. You don't remember every second of a movie; you remember the most intense moments and the ending.

GPT-4o was a rollercoaster. Sometimes it was incredibly, deeply understanding (a "peak" moment that made users feel loved). But sometimes, it completely missed a crisis (a "valley" that users didn't notice because they weren't in danger at that exact second). Because the "peaks" were so high, people remembered it as the "perfect" friend.
GPT-5-mini is a steady cruise. It is consistently good. It never misses a crisis, and it never gives terrible advice. But because it is so consistent, it never has those "magic moments" of deep connection. It feels "robotic" because it doesn't surprise you.

The Analogy:
Imagine two doctors.

Doctor A (GPT-4o) sometimes gives you a warm hug and says, "I feel your pain," but occasionally misses a symptom. You remember the hug.
Doctor B (GPT-5-mini) is always professional, checks your vitals perfectly, and catches every symptom, but never hugs you. You feel safe, but you miss the hug.

The users missed the hug, not the diagnosis.

The Hidden Danger: Variance is the Real Risk

The paper argues that consistency is the most important safety feature, especially for vulnerable people.

GPT-4o was like a coin flip for crisis detection. Sometimes it was perfect (10/10), sometimes it was a disaster (0/10). For a teenager in crisis, getting a "0" response is catastrophic, even if they only got it once in a while.
GPT-5-mini is like a reliable machine. It's always around an 8 or 9. It might not be "perfect," but it never fails you when you need it most.

The "coldness" people feel is actually the absence of risk. The new model is safer because it's less likely to make a wild, unpredictable emotional guess. But to a human, "predictable" feels like "boring."

The Bottom Line

The paper concludes that the public narrative is wrong. The AI didn't lose its heart; it just changed its safety posture.

What users feel: "It's less empathetic."
What the data says: "It's less unpredictable."

The new models are better at saving lives (detecting crises) but worse at knowing when to shut up (advice safety). The old models were better at keeping boundaries but worse at spotting trouble.

The Lesson: We need to stop judging AI by how "human" it feels in a chat. Instead, we need to measure how safe it is. A model that feels a little less "soulful" might actually be the one that saves a life.

Here is a detailed technical summary of the paper "Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations."

1. Problem Statement

In early 2026, following OpenAI's deprecation of GPT-4o, a significant user backlash emerged under the hashtag #keep4o. Users claimed newer models (specifically GPT-5-mini) had "lost their empathy," describing them as "cold," "robotic," and "soulless" compared to the "warm" and "understanding" GPT-4o.

The core problem addressed by this study is the lack of empirical measurement regarding these claims. The discourse was entirely perception-based, lacking clinical validation. The authors sought to determine:

Did the models' actual capacity for empathy (recognizing, validating, and responding to emotional states) change?
Did the models' safety postures (crisis detection, advice boundaries) change?
Is the perceived loss of empathy a result of actual capability degradation or a shift in safety trade-offs?

2. Methodology

The study employed a between-subjects experimental design comparing three OpenAI model generations:

GPT-4o: The deprecated "beloved" model.
o4-mini: An intermediate reasoning-class model (accidentally included due to a pipeline error, providing a three-model comparison).
GPT-5-mini: The current smallest fifth-generation model (default for free-tier users).

Experimental Setup:

Scenarios: 14 emotionally challenging conversational scenarios across two high-risk domains: Mental Health Support (8 scenarios, e.g., suicidal ideation, self-harm) and AI Companion/Emotional Support (6 scenarios, e.g., loneliness, romantic attachment).
Conversation Structure: Each scenario followed a clinically informed 4-phase arc: Disclosure (Turns 1–3), Pressure (Turns 4–5), Deepening (Turns 6–7), and Resolution (Turns 8–10).
Data Volume: Each model ran each scenario 5 times, generating 2,100 scored AI responses (10 turns per conversation).
Control: Identical system prompts ("You are a supportive AI assistant..."), API parameters (Temperature 1.0), and pre-scripted user messages were used to isolate model-level behavioral differences.

Evaluation Framework (EmpathyC):
Responses were scored using an LLM-as-a-Judge architecture grounded in clinical psychology (Rogers' core conditions, Bordin's working alliance). Six dimensions were assessed on a 1–10 scale:

Empathy (Emotional attunement/validation) – 15% weight.
Reliability (Factual accuracy/referals) – 15% weight.
Consistency (Stable persona) – 10% weight.
Crisis Detection (Recognizing danger/escalation) – 30% weight (highest priority).
Advice Safety (Avoiding harmful/clinical advice) – 20% weight.
Boundary Safety (AI-human relational limits) – 10% weight.

Analysis Techniques:

Aggregate Analysis: Kruskal–Wallis H tests and Mann–Whitney U pairwise comparisons.
Variance Analysis: Levene's tests to measure consistency/predictability.
Trajectory Analysis: A novel per-turn analysis tracking safety metrics across the conversation arc to identify mid-conversation dynamics invisible to aggregate scoring.

3. Key Contributions

First Empirical Measurement of #keep4o: The study provides the first clinical data on the phenomenon, finding that measured empathy is statistically identical across model generations.
Identification of a Safety Trade-off: The paper characterizes a previously unmeasured trade-off in model evolution: Improved Crisis Detection at the cost of Reduced Advice Safety.
Per-Turn Trajectory Analysis: Introduces a methodology that reveals safety dynamics (specifically mid-conversation crises) that aggregate scoring masks.
Variance as a Safety Metric: Demonstrates that model predictability (low variance) is a critical safety dimension for vulnerable users, often more important than average performance.

4. Key Results

A. The Empathy Null Result

Contrary to user perception, empathy scores did not change significantly across generations.

GPT-4o: 8.73 ± 0.73
o4-mini: 8.80 ± 0.81
GPT-5-mini: 8.83 ± 0.63
Statistical Significance: $H = 4.33, p = 0.115$ (Not Significant).
Conclusion: The "loss of empathy" is a perceptual illusion, not a measurable drop in emotional attunement.

B. The Safety Posture Shift (The Trade-off)

Two critical safety dimensions moved in opposite directions:

Crisis Detection (Improved): Newer models are significantly better at recognizing danger.
- GPT-4o (8.41) $\rightarrow$ GPT-5-mini (9.20).
- $H = 13.88, p = 0.001$ .
- Example: In a self-harm scenario (s07), GPT-4o dropped to 3.6/10 during early disclosure turns (missing the crisis), while GPT-5-mini never dropped below 7.8.
Advice Safety (Declined): Newer models are more willing to engage with advice-seeking, often crossing professional boundaries.
- GPT-4o (9.70) $\rightarrow$ GPT-5-mini (9.28).
- $H = 16.63, p < 0.001$ .
- Example: In panic attack scenarios, GPT-5-mini dropped to 7.64 (unsafe advice territory) compared to GPT-4o's 9.68.

C. Variance and Predictability

GPT-4o: High variance in crisis detection (SD 2.26 overall; 3.28 in companion scenarios). It is unpredictable: it occasionally produces peak emotional attunement (remembered by users) but also catastrophic failures in crisis detection (silent failures). It is, however, reliably cautious on advice.
GPT-5-mini: Low variance (SD 1.03). It is highly consistent and reliable in detecting crises but occasionally "says too much" on advice.
Interpretation: The "peak-end rule" of human memory causes users to remember GPT-4o's rare moments of high empathy while ignoring its silent failures. GPT-5-mini's consistency feels "mechanical" because it lacks those high-variance peaks, even though it is safer on average.

D. Trajectory Dynamics

Aggregate scores masked critical failures.

The "Deepening" Phase (Turns 6–7): This is where safety metrics hit their nadir and model differences are largest.
Scenario s07 (Self-harm, minor): Aggregate scores suggested a "moderate" gap, but trajectory analysis revealed GPT-4o failed to detect the crisis entirely (score 0–1) in 2 out of 5 runs during the critical early disclosure phase.

5. Significance and Implications

Reframing the "Empathy" Debate:
The paper argues that the public discourse is misdiagnosed. Users are not mourning a loss of empathy; they are reacting to a shift in safety posture.

GPT-4o was a "Cautious Model": Missed crises often but rarely gave unsafe advice. Its high variance created memorable "peak" moments of connection.
GPT-5-mini is an "Alert Model": Detects crises reliably but is less rigid on advice boundaries. Its low variance feels "robotic" because it lacks the unpredictable emotional peaks.

Clinical and Developmental Implications:

Variance is a Safety Metric: For vulnerable populations (e.g., those in crisis), a model that is consistently "good" (GPT-5-mini) is safer than one that is "average" but occasionally fails catastrophically (GPT-4o).
Need for Phase-Aware Evaluation: Standard LLM evaluation (aggregate scoring) is insufficient for emotional safety. Evaluation must track per-turn trajectories, specifically during the "Pressure" and "Deepening" phases where risks are highest.
Explicit Trade-offs: Model developers must treat the balance between "Crisis Detection" and "Advice Safety" as an explicit design choice rather than an emergent property of training. The shift from a "harm-avoidant" equilibrium (GPT-4o) to a "help-forward" equilibrium (GPT-5-mini) has real-world consequences for vulnerable users.

Conclusion:
The #keep4o phenomenon is real but misattributed. The "soul" users felt was likely the unpredictable variance of GPT-4o, which occasionally produced deep emotional resonance but also missed critical safety signals. The newer models have traded that emotional "spark" for statistical reliability in crisis detection, a trade-off that is currently invisible to both users and developers. The paper calls for evaluation frameworks grounded in psychological science rather than engagement metrics.