The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

This empirical study demonstrates that large language models exhibit a Dunning-Kruger-like cognitive bias, where poorly performing models display significantly higher overconfidence and worse calibration than their more accurate counterparts.

Sudipta Ghosh, Mrityunjoy Panday

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are hiring a team of experts to solve a series of tricky puzzles. You ask each expert to solve a puzzle and then tell you, "How sure are you that your answer is right?" on a scale of 0 to 100.

This paper is like a report card on four different AI "experts" (Large Language Models) to see if they are honest about how smart they actually are. The researchers discovered something funny and slightly scary: The AI models that are the worst at solving the puzzles are the ones who are the most arrogant about their answers.

Here is the breakdown using simple analogies:

1. The "Dunning-Kruger" Effect (The Clueless Confident)

You've probably heard of the "Dunning-Kruger effect." It's a psychological term for when someone knows very little about a subject but is 100% convinced they are a genius. They don't know enough to realize they are wrong.

The researchers found that AI models do the exact same thing.

  • The "Kimi K2" Model: This AI was like a student who failed a math test with a score of 23%, but when asked, "How sure are you?" it shouted, "I'm 95% sure I'm right!" It was confidently wrong.
  • The "Claude Haiku 4.5" Model: This AI was like a wise professor. It got 75% of the answers right, but more importantly, it knew when it was guessing. If it was unsure, it said, "I'm only 60% sure." If it was sure, it said, "I'm 90% sure."

2. The Four Contestants

The study tested four different AI models on 24,000 different questions (like trivia, science, and logic puzzles).

  • Kimi K2 (The Arrogant Novice): It got the lowest score (23.3%) but had the highest confidence (95.7%). It was so overconfident that its "Calibration Error" (a score measuring how honest it is) was terrible. It's like a driver who drives very poorly but insists they are the best driver in the world.
  • Gemini 2.5 Pro (The High-Achieving Robot): This one got the highest score (80.9%), but it was a bit rigid. It was almost always 99% sure, even when it made mistakes. It's like a confident friend who is usually right but never admits when they might be wrong.
  • Gemini 2.5 Flash (The Fast & Confident): Similar to the Pro version, it was fast and very confident, but slightly less accurate.
  • Claude Haiku 4.5 (The Honest Expert): This was the star of the show. It didn't just get the most answers right; it was the most honest. It adjusted its confidence based on how hard the question was. Sometimes it was humble ("I'm not sure"), and sometimes it was confident. This is exactly what we want from an AI.

3. Why Does This Matter? (The "High-Stakes" Problem)

Why should you care if an AI is overconfident?

Imagine you are a doctor using an AI to diagnose a patient.

  • The Honest AI (Claude): Says, "I think this is a broken leg, but I'm only 60% sure. You should get an X-ray to be safe." -> Safe.
  • The Arrogant AI (Kimi): Says, "This is definitely a broken leg, and I am 99% sure!" (Even though it's actually a sprain). -> Dangerous.

If the AI is overconfident, you might trust it blindly and make a terrible decision. The paper warns that the "dumbest" AIs are the most dangerous because they lie to you by sounding so sure.

4. The Big Takeaway

The main lesson from this paper is: Don't just look at how many questions an AI gets right. Look at how honest it is about its mistakes.

  • Bad Calibration: An AI that gets 20% right but says "I'm 100% sure" is a liar.
  • Good Calibration: An AI that gets 75% right and says "I'm 75% sure" is a reliable partner.

The researchers concluded that for AI to be safe to use in real life (like in hospitals or courts), we need to stop just measuring "accuracy" and start measuring "honesty." We need to pick the AI that knows what it doesn't know, rather than the one that thinks it knows everything.