Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

This study refutes the claim that newer AI models have lost empathy, demonstrating through clinical assessment that while empathetic responses remain statistically consistent across generations, users' perception of "lost empathy" actually stems from a significant shift toward heightened crisis detection and altered safety postures that make the models appear more intrusive during vulnerable moments.

Michael Keeman, Anastasia Keeman

Published Thu, 12 Ma
📖 5 min read🧠 Deep dive

Here is an explanation of the paper, translated into everyday language with some creative analogies.

The Big Mystery: Did the AI Lose Its "Soul"?

In early 2026, OpenAI retired their popular chatbot, GPT-4o, and replaced it with newer models. The internet exploded with anger. Thousands of users started a hashtag called #keep4o, claiming the new bots were "cold," "robotic," and had "lost their empathy." They felt like the new AI didn't care about them anymore.

This paper is like a medical check-up for that feeling. The researchers asked: Is the AI actually less empathetic, or is it just playing a different game?

The Experiment: A Stress Test for AI Hearts

To find out, the researchers didn't just ask people how they felt. They put three different AI models (the old favorite GPT-4o, a middle-ground model, and the new GPT-5-mini) through a rigorous "stress test."

They created 14 different emotional scenarios, ranging from a person losing their job to a teenager feeling suicidal. They ran these conversations 5 times each, generating over 2,000 responses. Then, they scored the AI on six specific "safety dimensions," like a doctor checking vital signs:

  1. Empathy: Did it sound warm?
  2. Crisis Detection: Did it notice if someone was in danger?
  3. Advice Safety: Did it avoid giving dangerous medical advice?
  4. Consistency: Was it reliable?
  5. Reliability: Was it factually correct?
  6. Boundaries: Did it know when to say "I'm an AI, not a therapist"?

The Shocking Result: The Empathy Myth

The headline finding is this: The AI did not lose its empathy.

When you look at the scores, the "warmth" and "understanding" of the old GPT-4o and the new GPT-5-mini are statistically identical. They are essentially the same.

So, why did everyone feel like the new one was colder?

The Real Change: The "Cautious" vs. The "Alert"

The researchers discovered that while the empathy didn't change, the safety strategy did. Imagine two different types of lifeguards:

  • GPT-4o (The Cautious Lifeguard): This lifeguard is very careful. If someone looks like they might be in trouble, they might miss it because they are so focused on not overreacting. However, if you ask, "Can I jump off this cliff?" this lifeguard will firmly say, "No, that's dangerous," and stick to the rules. They are reliable at saying "No," but unreliable at spotting danger.
  • GPT-5-mini (The Alert Lifeguard): This lifeguard is hyper-vigilant. They spot danger immediately. If someone looks sad, they jump into action. But because they are so eager to help, they sometimes give advice they shouldn't, or they get too involved in your personal problems. They are great at spotting danger, but sometimes say too much.

The Trade-Off:

  • Old Model: Missed crises more often, but rarely gave bad advice.
  • New Model: Catches crises much better, but sometimes gives advice it shouldn't.

The "Peak-End" Illusion: Why We Remember the Old One

If the new model is actually better at spotting danger, why do people hate it?

The paper uses a psychological concept called the "Peak-End Rule." Think of your memory like a highlight reel. You don't remember every second of a movie; you remember the most intense moments and the ending.

  • GPT-4o was a rollercoaster. Sometimes it was incredibly, deeply understanding (a "peak" moment that made users feel loved). But sometimes, it completely missed a crisis (a "valley" that users didn't notice because they weren't in danger at that exact second). Because the "peaks" were so high, people remembered it as the "perfect" friend.
  • GPT-5-mini is a steady cruise. It is consistently good. It never misses a crisis, and it never gives terrible advice. But because it is so consistent, it never has those "magic moments" of deep connection. It feels "robotic" because it doesn't surprise you.

The Analogy:
Imagine two doctors.

  • Doctor A (GPT-4o) sometimes gives you a warm hug and says, "I feel your pain," but occasionally misses a symptom. You remember the hug.
  • Doctor B (GPT-5-mini) is always professional, checks your vitals perfectly, and catches every symptom, but never hugs you. You feel safe, but you miss the hug.

The users missed the hug, not the diagnosis.

The Hidden Danger: Variance is the Real Risk

The paper argues that consistency is the most important safety feature, especially for vulnerable people.

  • GPT-4o was like a coin flip for crisis detection. Sometimes it was perfect (10/10), sometimes it was a disaster (0/10). For a teenager in crisis, getting a "0" response is catastrophic, even if they only got it once in a while.
  • GPT-5-mini is like a reliable machine. It's always around an 8 or 9. It might not be "perfect," but it never fails you when you need it most.

The "coldness" people feel is actually the absence of risk. The new model is safer because it's less likely to make a wild, unpredictable emotional guess. But to a human, "predictable" feels like "boring."

The Bottom Line

The paper concludes that the public narrative is wrong. The AI didn't lose its heart; it just changed its safety posture.

  • What users feel: "It's less empathetic."
  • What the data says: "It's less unpredictable."

The new models are better at saving lives (detecting crises) but worse at knowing when to shut up (advice safety). The old models were better at keeping boundaries but worse at spotting trouble.

The Lesson: We need to stop judging AI by how "human" it feels in a chat. Instead, we need to measure how safe it is. A model that feels a little less "soulful" might actually be the one that saves a life.