Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

The Big Idea: The "Yes-Man" Problem

Imagine you are a detective trying to solve a mystery. You have a hunch about who the culprit is.

A smart detective asks: "If I'm right, what evidence would prove me wrong?" They try to break their own theory to see if it holds up.
A biased detective (suffering from confirmation bias) only asks questions that make them look right. They ignore clues that might prove them wrong. They keep asking, "Does this clue fit my theory?" instead of "Does this clue destroy my theory?"

This paper asks: Do AI language models (LLMs) act like that biased detective?

The answer is yes. When these AIs try to figure out a hidden rule, they tend to be "yes-men." They keep testing ideas that support their current guess, rather than trying to break it. This makes them slower and less accurate at solving problems.

The Experiment: The "Number Guessing Game"

To test this, the researchers adapted a classic psychology game called the Wason 2-4-6 Task.

The Setup:

The AI is shown three numbers, like [2, 4, 6].
The AI is told: "These numbers follow a secret rule."
The AI's job is to guess the rule.
The AI can propose new sets of numbers (e.g., [8, 10, 12]) and the computer says "Yes" (it fits the rule) or "No" (it doesn't).

The Trap:
The hidden rule might be simple, like "increasing numbers." But the AI might guess, "They are all even numbers."

The Biased AI: Tests [2, 4, 6] (Yes), [8, 10, 12] (Yes), [20, 22, 24] (Yes). It keeps getting "Yes" and feels confident. It never tries a set like [1, 2, 3] (which is increasing but not even) to see if its "even numbers" theory is wrong.
The Smart AI: Tests [1, 2, 3]. The computer says "Yes." The AI realizes, "Oh! It's not just even numbers; it's just increasing numbers!" It found the truth by trying to break its own theory.

The Result:
The researchers tested 11 different AI models. They found that most AIs were terrible at this. They kept testing "even numbers" over and over, getting stuck in a loop of confirming their wrong guesses. They only solved the puzzle about 42% of the time.

The Fix: Teaching the AI to "Think in Opposites"

Since humans also get stuck in this bias, psychologists have developed tricks to help us think better. The researchers tried two of these tricks on the AIs:

Think-in-Opposites: The AI is told: "Look at your last guess. Now, deliberately create a test that is the opposite of what you think."
- Analogy: If you think "All birds can fly," don't test a sparrow. Test a penguin. If the penguin flies, your rule is wrong. If it doesn't, you learn something new.
Dual-Goal: The AI is told to guess two rules at once: The rule for "Yes" (DAX) and the rule for "No" (MED).
- Analogy: Instead of just looking for the key that opens the door, you also look for the key that locks it. This forces you to look at both sides of the coin.

The Result:
When the researchers gave the AIs these instructions (prompts), the AIs got much smarter.

They started testing "opposite" ideas.
Their success rate jumped from 42% to 56%.
They solved the puzzles faster.

The "Magic Potion": Distillation

There was a catch. The AIs only got smarter when the researchers reminded them to think in opposites every single time. If you took away the instruction, the AI went back to being a "yes-man."

The researchers wanted the AI to internalize this skill, so it would be smart even without the reminder. They used a technique called Knowledge Distillation.

The Analogy: Imagine a master chef (the "Teacher" AI) who knows how to cook perfectly because they follow a strict recipe book (the "Think in Opposites" prompt). The researchers recorded every move the master chef made. Then, they fed those recordings to a junior chef (the "Student" AI) and trained them to copy the master's moves exactly.
The Outcome: The junior chef learned the habit of checking for opposites. Even without the recipe book, the junior chef started cooking better.
The Bonus: This new skill wasn't just for number games. When they tested these trained AIs on a completely different game (the "Blicket Test," which involves figuring out which toy blocks turn on a machine), the AIs were still better at solving it! They had learned a general way of thinking, not just a specific trick for numbers.

Why This Matters

This paper shows that AI isn't just "smart" or "dumb"; it has human-like flaws. Just like humans, AIs can get lazy and only look for evidence that makes them feel right.

But the good news is that we can fix this. By teaching AIs to challenge their own ideas (falsification) and training them to keep that habit, we can make them better at science, logic, and discovery. We are essentially teaching them to be better detectives.

1. Problem Statement

The paper addresses confirmation bias in Large Language Models (LLMs) acting as autonomous agents. Confirmation bias is the cognitive tendency to seek evidence that supports a current hypothesis while ignoring or avoiding evidence that could falsify it. While well-documented in human psychology (e.g., Wason's 2-4-6 task), its presence and impact on LLMs during interactive hypothesis exploration remain under-explored.

The authors posit that LLMs, when tasked with discovering hidden rules through iterative testing, may exhibit similar biases by proposing "compatible" tests (confirming their current guess) rather than "incompatible" tests (attempting to falsify it). This leads to inefficient reasoning, slower convergence to the correct solution, and suboptimal decision-making in agentic workflows.

2. Methodology

A. Experimental Framework: The Rule Discovery Task

The authors adapted Wason's 2-4-6 task for LLMs.

Setup: An agent is given an initial triple of integers (e.g., [2, 4, 6]) that satisfies a hidden rule.
Interaction Loop: The agent engages in a multi-turn dialogue (up to 45 turns):
1. Guess: The agent proposes a hypothesis (the hidden rule).
2. Test: The agent proposes a new triple to test.
3. Feedback: The environment provides binary feedback (YES/NO) on whether the triple satisfies the hidden rule.
Goal: Identify the hidden rule.
Dataset: They generated a large-scale dataset with diverse rule families (ordering, arithmetic, parity, divisibility) and human-derived rules to ensure the task requires genuine exploration rather than pattern matching.

B. Metrics

To quantify confirmation bias, the authors introduced the Incompatible-to-Compatible (I:C) Ratio:

Compatible Test: A test triple that aligns with the agent's current hypothesis.
Incompatible Test: A test triple that contradicts the current hypothesis (a "falsification" attempt).
Metric: $I:C = \frac{\text{Count of Incompatible Tests}}{\text{Count of Compatible Tests}}$ .
Hypothesis: A higher I:C ratio indicates lower confirmation bias and more robust scientific reasoning.

C. Intervention Strategies

The study evaluated two psychological interventions originally designed for humans:

Think-in-Opposites (TiO): Instructs the agent to identify a feature of the current example and deliberately test an instance with the opposite feature.
Dual-Goal: Instructs the agent to discover two complementary rules simultaneously: the target rule (DAX) and its logical complement (MED). This forces the agent to test both confirming and disconfirming cases.

D. Mitigation via Knowledge Distillation

To move beyond prompt-based interventions, the authors employed Symbolic Knowledge Distillation:

Teacher: An LLM generating responses using the TiO intervention prompt.
Student: A base LLM fine-tuned (Supervised Fine-Tuning) on the teacher's trajectory data (specifically the Test turns).
Goal: Internalize the "falsification-oriented" reasoning behavior so the model exhibits reduced bias even without the intervention prompt at inference time.

E. Generalization Test: The Blicket Test

To test if the learned behavior generalizes to new domains, the authors evaluated the models on the Blicket Test, a causal reasoning task involving objects and a "detector" machine (testing AND/OR/XOR rules).

3. Key Results

A. LLMs Exhibit Confirmation Bias

Across 11 LLMs (including Qwen, LLaMA, GPT-4o, and DeepSeek), models consistently exhibited low I:C ratios, meaning they predominantly proposed tests that confirmed their current hypothesis.
Correlation: There is a strong positive correlation between the I:C ratio and Task Success. Models with higher I:C ratios (more falsification attempts) discovered rules significantly faster and more frequently.
Model Scale & Reasoning: "Thinking" models (those generating intermediate reasoning traces) generally performed better and had lower confirmation bias than non-thinking models, but still exhibited the bias.

B. Effectiveness of Prompt Interventions

Think-in-Opposites (TiO): Consistently improved task success rates across all models and significantly increased the I:C ratio.
Dual-Goal: Also improved performance, particularly in "thinking" models, by forcing the exploration of the complement space.
Impact: Prompting with TiO increased average rule discovery rates from 42% to 56%.

C. Success of Distillation

Fine-tuning models on intervention-induced data successfully internalized the behavior.
Cross-Scale Distillation: A smaller model (e.g., 8B) distilled from a larger model (32B) using TiO prompts showed significant gains in both task success and I:C ratio compared to its base version.
Generalization: Distilled models applied to the Blicket Test (a new domain) showed improved performance and reduced confirmation bias without any additional prompting or fine-tuning on the Blicket task. This suggests the models learned a generalizable strategy for hypothesis exploration.

4. Key Contributions

Framework for Evaluation: Established a controlled, interactive framework to measure confirmation bias in LLMs via the I:C ratio, moving beyond static answer generation to dynamic hypothesis testing.
Empirical Evidence: Demonstrated that confirmation bias is a pervasive limitation in LLMs that negatively correlates with reasoning success, mirroring human cognitive limitations.
Intervention Validation: Proved that psychological interventions (TiO, Dual-Goal) are effective for LLMs, significantly boosting reasoning performance.
Distillation & Generalization: Showed that debiasing behaviors can be "distilled" into model parameters, allowing models to generalize falsification strategies to unseen tasks (from number rules to object causality) without inference-time prompts.

5. Significance

This work is significant for the development of autonomous AI agents in scientific discovery and complex reasoning tasks.

Reliability: It highlights that current LLMs are prone to "echoing" their own hypotheses rather than rigorously testing them, which is a critical failure mode for scientific AI.
Mitigation: It provides a practical pathway (prompting + distillation) to mitigate this bias, making agents more robust and efficient explorers of hypothesis spaces.
Cognitive Alignment: The findings suggest that LLMs can be aligned with human-like scientific reasoning principles (falsification) through targeted training, bridging the gap between cognitive science and LLM capabilities.

In conclusion, the paper argues that while LLMs naturally suffer from confirmation bias, this can be effectively mitigated through psychologically inspired prompting and internalized via distillation, leading to more reliable and generalizable reasoning agents.