Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

This paper demonstrates that large language models exhibit confirmation bias during rule-discovery tasks, leading to inefficient hypothesis testing, but shows that this limitation can be effectively mitigated through targeted prompting and behavioral distillation, thereby improving their reasoning performance.

Ayush Rajesh Jhaveri, Anthony GX-Chen, Ilia Sucholutsky, Eunsol Choi

Published 2026-04-06
📖 5 min read🧠 Deep dive

The Big Idea: The "Yes-Man" Problem

Imagine you are a detective trying to solve a mystery. You have a hunch about who the culprit is.

  • A smart detective asks: "If I'm right, what evidence would prove me wrong?" They try to break their own theory to see if it holds up.
  • A biased detective (suffering from confirmation bias) only asks questions that make them look right. They ignore clues that might prove them wrong. They keep asking, "Does this clue fit my theory?" instead of "Does this clue destroy my theory?"

This paper asks: Do AI language models (LLMs) act like that biased detective?

The answer is yes. When these AIs try to figure out a hidden rule, they tend to be "yes-men." They keep testing ideas that support their current guess, rather than trying to break it. This makes them slower and less accurate at solving problems.


The Experiment: The "Number Guessing Game"

To test this, the researchers adapted a classic psychology game called the Wason 2-4-6 Task.

The Setup:

  1. The AI is shown three numbers, like [2, 4, 6].
  2. The AI is told: "These numbers follow a secret rule."
  3. The AI's job is to guess the rule.
  4. The AI can propose new sets of numbers (e.g., [8, 10, 12]) and the computer says "Yes" (it fits the rule) or "No" (it doesn't).

The Trap:
The hidden rule might be simple, like "increasing numbers." But the AI might guess, "They are all even numbers."

  • The Biased AI: Tests [2, 4, 6] (Yes), [8, 10, 12] (Yes), [20, 22, 24] (Yes). It keeps getting "Yes" and feels confident. It never tries a set like [1, 2, 3] (which is increasing but not even) to see if its "even numbers" theory is wrong.
  • The Smart AI: Tests [1, 2, 3]. The computer says "Yes." The AI realizes, "Oh! It's not just even numbers; it's just increasing numbers!" It found the truth by trying to break its own theory.

The Result:
The researchers tested 11 different AI models. They found that most AIs were terrible at this. They kept testing "even numbers" over and over, getting stuck in a loop of confirming their wrong guesses. They only solved the puzzle about 42% of the time.


The Fix: Teaching the AI to "Think in Opposites"

Since humans also get stuck in this bias, psychologists have developed tricks to help us think better. The researchers tried two of these tricks on the AIs:

  1. Think-in-Opposites: The AI is told: "Look at your last guess. Now, deliberately create a test that is the opposite of what you think."
    • Analogy: If you think "All birds can fly," don't test a sparrow. Test a penguin. If the penguin flies, your rule is wrong. If it doesn't, you learn something new.
  2. Dual-Goal: The AI is told to guess two rules at once: The rule for "Yes" (DAX) and the rule for "No" (MED).
    • Analogy: Instead of just looking for the key that opens the door, you also look for the key that locks it. This forces you to look at both sides of the coin.

The Result:
When the researchers gave the AIs these instructions (prompts), the AIs got much smarter.

  • They started testing "opposite" ideas.
  • Their success rate jumped from 42% to 56%.
  • They solved the puzzles faster.

The "Magic Potion": Distillation

There was a catch. The AIs only got smarter when the researchers reminded them to think in opposites every single time. If you took away the instruction, the AI went back to being a "yes-man."

The researchers wanted the AI to internalize this skill, so it would be smart even without the reminder. They used a technique called Knowledge Distillation.

  • The Analogy: Imagine a master chef (the "Teacher" AI) who knows how to cook perfectly because they follow a strict recipe book (the "Think in Opposites" prompt). The researchers recorded every move the master chef made. Then, they fed those recordings to a junior chef (the "Student" AI) and trained them to copy the master's moves exactly.
  • The Outcome: The junior chef learned the habit of checking for opposites. Even without the recipe book, the junior chef started cooking better.
  • The Bonus: This new skill wasn't just for number games. When they tested these trained AIs on a completely different game (the "Blicket Test," which involves figuring out which toy blocks turn on a machine), the AIs were still better at solving it! They had learned a general way of thinking, not just a specific trick for numbers.

Why This Matters

This paper shows that AI isn't just "smart" or "dumb"; it has human-like flaws. Just like humans, AIs can get lazy and only look for evidence that makes them feel right.

But the good news is that we can fix this. By teaching AIs to challenge their own ideas (falsification) and training them to keep that habit, we can make them better at science, logic, and discovery. We are essentially teaching them to be better detectives.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →