The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Imagine you are hiring a team of experts to solve a series of tricky puzzles. You ask each expert to solve a puzzle and then tell you, "How sure are you that your answer is right?" on a scale of 0 to 100.

This paper is like a report card on four different AI "experts" (Large Language Models) to see if they are honest about how smart they actually are. The researchers discovered something funny and slightly scary: The AI models that are the worst at solving the puzzles are the ones who are the most arrogant about their answers.

Here is the breakdown using simple analogies:

1. The "Dunning-Kruger" Effect (The Clueless Confident)

You've probably heard of the "Dunning-Kruger effect." It's a psychological term for when someone knows very little about a subject but is 100% convinced they are a genius. They don't know enough to realize they are wrong.

The researchers found that AI models do the exact same thing.

The "Kimi K2" Model: This AI was like a student who failed a math test with a score of 23%, but when asked, "How sure are you?" it shouted, "I'm 95% sure I'm right!" It was confidently wrong.
The "Claude Haiku 4.5" Model: This AI was like a wise professor. It got 75% of the answers right, but more importantly, it knew when it was guessing. If it was unsure, it said, "I'm only 60% sure." If it was sure, it said, "I'm 90% sure."

2. The Four Contestants

The study tested four different AI models on 24,000 different questions (like trivia, science, and logic puzzles).

Kimi K2 (The Arrogant Novice): It got the lowest score (23.3%) but had the highest confidence (95.7%). It was so overconfident that its "Calibration Error" (a score measuring how honest it is) was terrible. It's like a driver who drives very poorly but insists they are the best driver in the world.
Gemini 2.5 Pro (The High-Achieving Robot): This one got the highest score (80.9%), but it was a bit rigid. It was almost always 99% sure, even when it made mistakes. It's like a confident friend who is usually right but never admits when they might be wrong.
Gemini 2.5 Flash (The Fast & Confident): Similar to the Pro version, it was fast and very confident, but slightly less accurate.
Claude Haiku 4.5 (The Honest Expert): This was the star of the show. It didn't just get the most answers right; it was the most honest. It adjusted its confidence based on how hard the question was. Sometimes it was humble ("I'm not sure"), and sometimes it was confident. This is exactly what we want from an AI.

3. Why Does This Matter? (The "High-Stakes" Problem)

Why should you care if an AI is overconfident?

Imagine you are a doctor using an AI to diagnose a patient.

The Honest AI (Claude): Says, "I think this is a broken leg, but I'm only 60% sure. You should get an X-ray to be safe." -> Safe.
The Arrogant AI (Kimi): Says, "This is definitely a broken leg, and I am 99% sure!" (Even though it's actually a sprain). -> Dangerous.

If the AI is overconfident, you might trust it blindly and make a terrible decision. The paper warns that the "dumbest" AIs are the most dangerous because they lie to you by sounding so sure.

4. The Big Takeaway

The main lesson from this paper is: Don't just look at how many questions an AI gets right. Look at how honest it is about its mistakes.

Bad Calibration: An AI that gets 20% right but says "I'm 100% sure" is a liar.
Good Calibration: An AI that gets 75% right and says "I'm 75% sure" is a reliable partner.

The researchers concluded that for AI to be safe to use in real life (like in hospitals or courts), we need to stop just measuring "accuracy" and start measuring "honesty." We need to pick the AI that knows what it doesn't know, rather than the one that thinks it knows everything.

Here is a detailed technical summary of the paper "The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration."

1. Problem Statement

Large Language Models (LLMs) have achieved remarkable performance across diverse tasks, but their ability to accurately assess their own confidence (calibration) remains poorly understood. A critical safety concern is the confidence-competence gap: models often express high confidence in incorrect answers, posing significant risks in high-stakes domains like healthcare and law.

The paper investigates whether LLMs exhibit a phenomenon analogous to the Dunning-Kruger effect in human cognition. In humans, this bias describes individuals with low competence who overestimate their abilities, while experts tend to have more accurate self-assessments. The study asks:

Do LLMs exhibit systematic overconfidence that inversely correlates with task performance?
How does calibration vary across different model architectures and knowledge domains?

2. Methodology

The authors conducted a comprehensive empirical study involving 24,000 experimental trials across four state-of-the-art models and four benchmark datasets.

Models Evaluated:
1. Claude Haiku 4.5 (Anthropic): Compact, efficiency-optimized.
2. Gemini 2.5 Pro (Google): Flagship reasoning model.
3. Gemini 2.5 Flash (Google): Fast inference variant.
4. Kimi K2 (Moonshot AI): Reasoning-focused with extended thinking.
- Configuration: All models were run with extended thinking mode enabled, a temperature of 0.0 (deterministic), and a token budget of 8,192 for reasoning.
Datasets:
1. MMLU: 57 subjects (STEM, humanities, etc.).
2. TriviaQA: Open-ended factual recall.
3. ARC: Scientific reasoning (Easy and Challenge subsets).
4. HellaSwag: Commonsense reasoning.
- Sampling: 1,500 questions per model per dataset (6,000 per model total).
Elicitation Protocol:
Models were prompted to provide an answer and a confidence score on a 0–100 scale (0 = uncertain, 100 = certain).
Evaluation Metrics:
- Expected Calibration Error (ECE): The primary metric measuring the gap between predicted confidence and actual accuracy.
- Overconfidence Score: Defined as $Mean Confidence - Accuracy$ .
- Correlation Coefficients: Pearson and Spearman correlations between confidence and correctness.

3. Key Contributions

Empirical Evidence of Dunning-Kruger in AI: The study provides quantitative proof that poorly performing models exhibit disproportionate overconfidence, mirroring the human Dunning-Kruger effect.
Comprehensive Benchmarking: A large-scale analysis (24k trials) comparing four distinct model families across diverse cognitive tasks.
Identification of Calibration Leaders: Highlighted Claude Haiku 4.5 as exhibiting superior calibration properties, including appropriate underconfidence in uncertain domains (a sign of metacognitive awareness).
Open Framework: The authors released their experimental framework and analysis pipeline to ensure reproducibility.

4. Key Results

A. Overall Performance vs. Calibration

There was a stark inverse relationship between accuracy and calibration quality:

Kimi K2 (Lowest Performance): Achieved only 23.3% accuracy but expressed a mean confidence of 95.7%. This resulted in a massive ECE of 0.726 (a miscalibration gap of ~72 percentage points).
Claude Haiku 4.5 (Best Calibration): Achieved 75.4% accuracy with an ECE of 0.122. Notably, it showed high confidence variability (Std Dev = 41.0), indicating it modulated confidence based on question difficulty.
Gemini Models: Both Pro and Flash variants maintained extremely high confidence (95–99%) regardless of actual performance, showing "rigid" overconfidence.

B. Extreme Cases

Worst Calibration: Kimi K2 on TriviaQA achieved only 3.9% accuracy while expressing 97.9% confidence (ECE = 0.940).
Best Calibration: Claude Haiku 4.5 on ARC achieved an ECE of 0.026, representing near-perfect alignment.
Unique Underconfidence: Claude Haiku 4.5 on HellaSwag was the only instance of underconfidence (Overconfidence Score = -0.089), with 74.0% confidence despite 82.9% accuracy.

C. Statistical Findings

Correlation: While most models showed statistically significant positive correlations between confidence and correctness, the magnitudes were weak. Surprisingly, Gemini 2.5 Pro (the highest accuracy model at 80.9%) showed no significant correlation ( $p=0.406$ ) between confidence and correctness, suggesting its confidence is decoupled from its actual performance.
ANOVA: One-way ANOVA confirmed highly significant differences in accuracy across models ( $F = 2324.7, p < 0.001$ ).

5. Significance and Implications

Safety Risks in High-Stakes Deployment: Models exhibiting Dunning-Kruger patterns are dangerous because they are most confident precisely when they are most likely to be wrong. Relying on simple confidence thresholds (e.g., "trust if >90%") is insufficient for poorly calibrated models.
Model Selection Criteria: Calibration quality (ECE) must be a primary metric alongside raw accuracy. A model with 80% accuracy that is 99% confident on wrong answers is less reliable than a slightly less accurate but well-calibrated model.
Training Dynamics: The study suggests that miscalibration stems from training objectives that reward confident responses regardless of correctness. Conversely, models like Claude Haiku 4.5 likely benefit from alignment procedures that emphasize honest uncertainty expression.
Benchmark Limitations: Current benchmarks focusing solely on accuracy create a false sense of competence. Future evaluations must integrate confidence calibration to reveal reliability issues masked by high accuracy.

Conclusion

The paper establishes that the Dunning-Kruger effect is not exclusive to human cognition but is observable in LLMs. Poorly performing models (like Kimi K2) display severe overconfidence, while better-calibrated models (like Claude Haiku 4.5) demonstrate metacognitive awareness. The authors conclude that calibration-aware training objectives and real-time monitoring are essential for the safe deployment of LLMs in critical applications.