Do LLMs Share Human-Like Biases? Causal Reasoning Under Prior Knowledge, Irrelevant Context, and Varying Compute Budgets

Imagine you are hiring a team of new assistants (Large Language Models, or LLMs) to help you solve mysteries. You want to know: Do these assistants think like human detectives, or do they follow a rigid rulebook? And more importantly, can they handle it when the case file gets messy or confusing?

This paper, presented at a 2026 AI workshop, puts over 20 different AI models through a "causal reasoning" test to see how they compare to human intuition. Here is the breakdown in simple terms.

1. The Setup: The "Common Effect" Mystery

The researchers used a classic logic puzzle called a Collider.

The Analogy: Imagine a car that won't start (the Effect).
The Causes: It could be a dead battery (Cause A) OR an empty gas tank (Cause B).
The Rules: You know the battery is dead. You know the car won't start.
The Question: How likely is it that the gas tank is also empty?

In a perfect, logical world, knowing the battery is dead doesn't change the odds of the gas tank being empty. They are independent. However, human brains are messy. Humans often think, "Well, if the battery is dead, maybe the gas tank is fine," or conversely, "If the battery is dead, maybe the whole car is junk, so the gas tank is probably empty too." Humans make intuitive leaps and assumptions about "hidden factors" (like the car being old).

2. The Big Discovery: Robots vs. Humans

The study found a fascinating split between how humans and AI think:

Humans are "Open-World" Thinkers: When humans solve these puzzles, they assume there are other things they didn't tell you about. They think, "Maybe the car is old, maybe the mechanic is bad." They are flexible but prone to biases (like assuming one bad thing implies another).
LLMs are "Strict Rule-Followers": The AI models acted like a computer program reading a manual. If the prompt said "Battery is dead," the AI calculated the odds based only on that. They didn't invent hidden background factors.
- The Result: The AI was actually more logically consistent than humans in this specific test. It didn't fall for the same "gut feeling" traps humans do.

3. The "Chain of Thought" Superpower

The researchers tested two ways of asking the AI questions:

Direct: "What's the answer?"
Chain-of-Thought (CoT): "Think step-by-step before answering."

The Metaphor: Think of Direct prompting as asking a student to shout out an answer instantly. Chain-of-Thought is asking them to show their work on a whiteboard first.

The Finding: When the AI was forced to "show its work" (CoT), it became even more logical and robust. It handled messy information much better. It was like giving the AI a moment to calm down and focus, which made it less likely to get distracted by irrelevant details.

4. The Stress Test: Noise and Nonsense

The researchers tried to trick the AI by:

Abstracting: Replacing words like "Battery" and "Gas" with random gibberish like "X-7" and "Y-9."
Overloading: Adding a huge block of irrelevant text (like a recipe for soup) right in the middle of the question to distract the AI.

The Results:

Older/Smaller Models: These got confused easily. When the words were abstract or the text was noisy, their logic fell apart. They were like a student who can't solve a math problem if the numbers are written in a weird font.
Newer/Larger Models (e.g., Gemini-2.5-pro): These were incredibly tough. They solved the logic puzzle correctly even when the words were nonsense or the prompt was full of distractions. They were like a master detective who can solve a case even if the suspect is speaking in riddles.

5. The "Bias" Surprise

Usually, we worry that AI trained on human data will copy human mistakes.

The Surprise: Humans have a specific bias called "Explaining Away" (where finding one cause makes you ignore the other). Humans also violate logical rules (Markov violations) by letting one cause influence their belief about another.
The AI: Most AI models did not copy these human biases. They were actually better at following the strict rules of probability than humans were. They didn't get "distracted" by the idea that one cause might cancel out the other.

The Bottom Line: What Does This Mean for Us?

The Good News:
AI can be a fantastic partner for high-stakes decisions (like law or medicine) because it doesn't get tired, it doesn't have "gut feelings" that lead to errors, and it sticks to the facts provided. If you need someone to follow the rules strictly without inventing hidden variables, AI is great.

The Bad News:
Real life is messy. Sometimes, the "hidden variables" humans assume are actually real. Because AI is so strict, it might fail in situations where uncertainty is high and you need to guess about things that weren't explicitly stated.

The Takeaway:
Think of AI not as a replacement for human intuition, but as a specialized tool.

Use Humans when you need to guess the unknown, handle ambiguity, or bring in "street smarts."
Use AI when you need to process complex rules, ignore distractions, and avoid human emotional biases.

The paper concludes that to use AI safely, we need to understand how it thinks. It's not a human in a box; it's a very strict, very logical robot that needs to be paired with human wisdom to handle the real world.

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in high-stakes domains requiring causal reasoning (e.g., legal and medical decision support). However, it remains unclear whether LLMs perform normative causal computation, mimic human-like heuristics/biases, or rely on brittle pattern matching.

While humans are known to exhibit systematic deviations from normative inference in causal "collider" structures (where two independent causes $C_1, C_2$ affect a common effect $E$ ), such as weak explaining away and Markov violations, it is unknown if LLMs trained on human-generated text reproduce these same biases. Furthermore, the robustness of LLM causal reasoning under semantic abstraction and irrelevant context (noise) is poorly understood.

2. Methodology

Benchmark and Task Design

Source: The study adapts 11 causal judgment tasks from Rehder & Waldmann (2017) (RW17), based on a collider graph structure ( $C_1 \rightarrow E \leftarrow C_2$ ).
Variables: Binary causes ( $C_1, C_2$ ) and a binary effect ( $E$ ).
Task: Agents must estimate the probability of a target variable (0–100 scale) given observed values of others.
Underspecification: Crucially, the tasks do not provide explicit base rates or causal strengths. This forces agents to rely on implicit priors and assumptions, allowing researchers to probe reasoning strategies rather than just statistical rule application.
Domains: Tasks were presented in three cover stories (Sociology, Weather, Economy) and an Abstract version (variables replaced with random strings) to test reliance on world knowledge.

Experimental Conditions

The study evaluated 20+ LLMs (including GPT-4/5, Claude 3/4, Gemini 2.5, and O-series models) against a human baseline ( $N=48$ ). The design crossed three factors:

Prior Knowledge: Original RW17 stories vs. Abstract (meaningless variable names).
Information Load: Standard prompts vs. Overloaded prompts (injecting irrelevant text/noise to reduce signal-to-noise ratio).
Prompting Strategy: Direct prompting (single numeric answer) vs. Chain-of-Thought (CoT) (step-by-step reasoning before the answer).

Total Conditions: $2 \times 2 \times 2 = 8$ experimental conditions per model.

Analysis Framework

The authors used Causal Bayesian Networks (CBN) with a leaky noisy-OR parameterization to model agent behavior.

Metrics:
- BACS (Background-Adjusted Causal Strength): Measures reliance on stated rules vs. latent background factors ($BACS = m - b$). High BACS indicates strict rule-following; low BACS indicates accommodation of unmentioned factors.
- Explaining Away (EA): Quantifies how evidence for one cause reduces belief in the alternative cause.
- Markov Violation (MV): Measures dependence between causes when the effect is unobserved (normative inference requires independence).
- Robustness: Measured via consistency of these metrics across the 8 experimental conditions and Out-of-Sample CBN alignment (LOOCV $R^2$ ).

3. Key Contributions

Comprehensive Benchmark: Evaluation of 20+ LLMs on a classic human causal cognition benchmark, providing the first large-scale comparison of LLM vs. human causal biases in underspecified settings.
Cognitive Modeling: Demonstration that LLM causal judgments can be compressed into small, interpretable Causal Bayesian Networks, allowing for the diagnosis of reasoning strategies (rule-based vs. heuristic).
Robustness Analysis: Systematic testing of causal reasoning under semantic abstraction and prompt overloading, revealing the impact of CoT on stability.
Open Resources: Release of an LLM-friendly version of the RW17 benchmark and a software package (CAUSAIIGN) for structure-matched prompt generation.

4. Key Results

Q1: Alignment and Sensible Judgments

Both humans and LLMs provide "sensible" judgments (probability of effect increases with more causes).
CoT Effect: Chain-of-Thought prompting significantly improves alignment between LLMs and human judgments for models that were initially less aligned (Spearman $\rho$ increases up to a ceiling of ~0.85).

Q2: Compressibility via CBNs

LLM judgments are well-predicted by a single fitted CBN (low Mean Absolute Error).
CoT improves fit: CoT prompting reduces error and increases out-of-sample generalization ( $R^2$ ), suggesting CoT helps models apply more consistent causal rules, especially under noisy (overloaded) conditions.

Q3: Rule-Following vs. Latent Factors

LLMs are stricter rule-followers: Most LLMs exhibit higher BACS than humans. They treat stated causal links as sufficient and ignore unmentioned latent factors.
Humans are "open-world": Humans consistently assign probability to the effect even when stated causes are absent, indicating they account for unmentioned background factors (lower BACS).
CoT Impact: CoT tends to push models toward even tighter rule-following (higher BACS), further distancing them from human-like open-world reasoning.

Q4: Human-Like Biases (The Core Finding)

Explaining Away (EA): Humans typically show weak explaining away. In contrast, most LLMs exhibit strong explaining away (closer to normative Bayesian inference).
Markov Violations (MV): Humans frequently violate the Markov property (believing causes are dependent even without evidence of the effect). Most LLMs are Markov-compliant (independent causes).
Conclusion: Despite training on human text, LLMs do not mirror characteristic human collider biases. They reason more normatively (or rigidly) than humans in these specific structures.

Q5: Robustness

Model Dependence: Robustness varies wildly. Gemini-2.5-pro is nearly invariant across all 8 conditions (high BACS, high $R^2$ , strong EA, MV $\approx$ 0).
Sensitivity: Smaller/older models show significant shifts in reasoning signatures under abstraction or noise.
CoT Benefit: CoT generally increases robustness, tightening the clustering of metrics across conditions and reducing the impact of irrelevant context.

5. Significance and Implications

Divergent Reasoning Strategies: The study reveals a fundamental divergence: Humans use heuristics and open-world assumptions (accounting for the unknown), while LLMs tend toward rigid, closed-world rule-following.
Complementarity: This suggests LLMs could complement human decision-making in scenarios where human biases (like weak explaining away) are detrimental. However, their rigidity may be a liability in real-world scenarios where uncertainty is intrinsic and "unknown unknowns" matter.
Safety and Deployment: The findings highlight the need to characterize LLM reasoning strategies before deployment. While CoT improves robustness and alignment with normative models, it may also suppress the "human-like" flexibility required for certain ambiguous real-world tasks.
Methodological Insight: The success of compressing LLM behavior into small CBNs validates the use of Bayesian cognitive models as a diagnostic tool for interpreting the internal logic of large neural networks.

Limitations: The study focuses exclusively on collider graphs; results may not generalize to other causal structures (e.g., chains or forks). Additionally, the lack of human data for the "overloaded" and "abstract" conditions limits direct comparison in those specific noise scenarios.