Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task

Imagine you have a very smart, super-advanced robot librarian. This robot has read almost every book, website, and document in existence. You might think, "If it knows everything, it must be a genius at logic, right?"

This paper puts that robot to the test with a classic brain teaser called the Wason Selection Task. Think of this task as a "logic gym" where we try to see if the robot is actually thinking or just guessing based on word patterns.

Here is the breakdown of what the researchers did and what they found, using some everyday analogies.

1. The Two Types of Logic Puzzles

The researchers gave the robot two different kinds of rules to follow, like giving it two different types of gym equipment:

The "Abstract" Workout (Descriptive Rules):
- The Rule: "If a card has an odd number on one side, the other side must have a capital letter."
- The Vibe: This is like trying to solve a puzzle with random shapes. It's dry, boring, and has no real-world meaning. It's just math symbols.
The "Social Contract" Workout (Deontic Rules):
- The Rule: "If a person spills blood, they must wear gloves."
- The Vibe: This is about rules, laws, and safety. It feels like a rule you'd see in a hospital or a workplace. It has a "should" or "must" attached to it.

The Human Secret: For decades, scientists have known that humans are terrible at the "Abstract" workout but surprisingly good at the "Social Contract" workout. We are wired to spot rule-breakers in social situations (like someone not wearing gloves when they should), but we get confused by random symbols.

2. The Big Question

The researchers wanted to know: Do AI models (LLMs) have this same "human quirk"?
Do they get better at the "Social Contract" rules because they understand the meaning of the words, or do they just treat all rules the same way?

3. The Trap: Confirmation vs. Matching

To test the robot's brain, the researchers set a trap. Humans often make two specific types of mistakes:

The "Yes-Man" Mistake (Confirmation Bias): You only look for evidence that proves the rule is right. You ignore the possibility that the rule could be broken.
The "Word-Matcher" Mistake (Matching Bias): You ignore the logic and just pick the cards that look like the words in the rule.
- Example: If the rule says "If not A, then not B," a word-matcher sees the words "A" and "B" and picks those cards, completely ignoring the word "not."

4. What Happened in the Experiment?

The researchers tested many different AI models (from small ones to massive ones) with these rules. Here is what they found:

A. The Robot is "Socially" Smart

Just like humans, the AI models got much better at the "Social Contract" rules (the blood/gloves ones) than the "Abstract" rules.

Analogy: Imagine a robot that is terrible at solving a math equation written in a secret code, but suddenly becomes a detective when you ask it, "Who broke the window?" It seems the AI, like us, is tuned to understand rules about obligations and permissions.

B. The Robot is a "Word-Matcher," Not a "Yes-Man"

When the AI made mistakes, it wasn't trying to "confirm" the rule. Instead, it was falling for the Matching Bias.

The Evidence: When the rule included a "NOT" (e.g., "If you do NOT wear a helmet..."), the AI often ignored the "NOT" and just picked the card that said "Helmet."
Analogy: It's like a student taking a test who sees the word "Dog" in the question and immediately circles "Dog" in the answer choices, without reading the rest of the sentence. The AI is so good at recognizing patterns that it sometimes skips the logic part.

5. Why Does This Matter?

This study tells us two big things about AI:

AI isn't a perfect logic machine yet. Even though it has read the whole internet, it still struggles with pure logic unless the problem feels like a real-world social rule. It's like a person who is great at following traffic laws but terrible at abstract math.
AI makes human-like mistakes. The fact that AI makes the same "word-matching" errors as humans suggests that these models might be processing language in a way that is surprisingly similar to how our brains work. They aren't just calculating; they are pattern-matching, and sometimes that leads them astray.

The Bottom Line

The researchers built a new set of logic puzzles to see if AI thinks like a human. They found that AI is surprisingly human: it's great at following social rules (like "wear gloves if you spill blood") but gets tripped up by abstract logic, often falling for the same "word-matching" traps that humans do.

It turns out, even the smartest robots need to learn that sometimes, you have to read the whole sentence, not just the words that look familiar.

Here is a detailed technical summary of the paper "Evaluation of Deontic Conditional Reasoning in Large Language Models: The Case of Wason's Selection Task."

1. Problem Statement

While Large Language Models (LLMs) have demonstrated significant linguistic capabilities, their reasoning abilities—specifically conditional reasoning (inference based on "if $p$ then $q$ " rules)—remain a critical area of investigation.

Human Context: Cognitive science has established that humans exhibit domain specificity in reasoning. Humans perform significantly better on deontic rules (involving norms, obligations, and permissions, e.g., "If you enter, you must wear a badge") compared to descriptive rules (abstract facts, e.g., "If a card has a 7, the other side is a D").
The Gap: It is unclear whether LLMs exhibit similar domain specificity. Furthermore, the source of errors in conditional reasoning is debated. Human errors are often attributed to either Confirmation Bias (seeking evidence that supports the rule) or Matching Bias (selecting items that lexically match the rule's terms while ignoring negation). Prior studies have not systematically distinguished between these biases in LLMs using a unified experimental design.

2. Methodology

Dataset Construction

The authors introduced a new Wason Selection Task (WST) dataset containing 160 problems designed to systematically isolate modality and polarity:

Modality: Divided into Deontic (rules with obligation/prohibition markers like "must," "must not") and Descriptive (abstract generalizations).
Polarity: Each modality includes four rule patterns based on the presence of negation in the antecedent ( $p$ $p$ ) and/or consequent ( $q$ $q$ ):
1. Pos-Pos: $p \to q$
2. Pos-Neg: $p \to \neg q$
3. Neg-Pos: $\neg p \to q$
4. Neg-Neg: $\neg p \to \neg q$
Structure: Each problem presents a rule and four cards (representing $p, \neg p, q, \neg q$ ). The model must select the cards that could falsify the rule.
Exclusion: Rules involving "permission" or "unnecessity" were excluded because they do not have a single logically falsifying card set, ensuring consistent ground truth.

Models and Evaluation

Models Evaluated: Five families of open-weight models (both reasoning and non-reasoning variants):
- Reasoning: gpt-oss (20B, 120B), Qwen 3 (14B, 32B).
- Non-Reasoning: Gemma 3 (4B, 12B, 27B), Llama 3.3 (70B), OLMo 2 (32B).
Prompting Strategies: Zero-Shot, Few-Shot (with intentionally incorrect exemplars to avoid pattern matching), and Chain-of-Thought (CoT).
Metric: Exact-match accuracy. A response is correct only if the model selects exactly the logically valid cards (True Antecedent and False Consequent).

Bias Analysis Framework

To distinguish between biases, the authors analyzed selection patterns across the four polarity types:

Confirmation Bias Prediction: Models should consistently select the True Antecedent ( $TA$ ) and True Consequent ( $TC$ ) regardless of negation.
Matching Bias Prediction: Models should select cards that lexically match the terms in the rule ( $p$ and $q$ ), ignoring negation. This implies a preference for $p$ over $\neg p$ and $q$ over $\neg q$ , even when logic dictates otherwise.

3. Key Contributions

New Dataset: Creation of a controlled WST dataset explicitly encoding deontic modality and varying negation patterns to separate domain specificity from general reasoning ability.
Comprehensive Evaluation: A systematic benchmark of state-of-the-art LLMs (including recent reasoning-tuned models) on deontic vs. descriptive reasoning.
Bias Differentiation: A rigorous experimental design that empirically distinguishes between confirmation bias and matching bias in LLMs, a distinction previously lacking in unified studies.
Human-LLM Parallels: Demonstration that LLMs, like humans, exhibit domain specificity and specific error patterns (matching bias) in conditional reasoning.

4. Results

Domain Specificity

Performance Gap: Across all models and prompting strategies, accuracy was significantly higher on deontic rules than on descriptive rules.
Magnitude: The improvement ranged from 5.0% to 41.2%.
- Example: The gpt-oss-20b model improved from 67.5% (descriptive) to 91.2% (deontic) in Few-Shot settings.
- Example: The gemma3-12b model improved from 58.8% to 78.7%.
Implication: LLMs possess a form of domain specificity, likely driven by the prevalence of normative language in their training data, mirroring human cognitive adaptations.

Confirmation Bias vs. Matching Bias

Rejection of Confirmation Bias: The data did not support confirmation bias. Models did not consistently select the True Consequent ( $TC$ ) across all polarities. In many cases, the selection rate for $TC$ was lower than for the False Consequent ( $FC$ ) when negation was involved.
Strong Evidence for Matching Bias:
- Models showed a strong tendency to select cards that lexically matched the rule's terms ( $p$ and $q$ ) while ignoring negation ( $\neg p$ and $\neg q$ ).
- Data Point: In the $TC$ column (checking the consequent), models frequently preferred $q$ over $\neg q$ even when the rule was $p \to \neg q$ .
- Negation Sensitivity: Errors often occurred when models failed to process the negation in the antecedent or consequent, selecting the positive term instead (e.g., selecting "Car in front of building" when the rule was "If the car is not in front...").
Reasoning Models: Even advanced "reasoning" models (e.g., gpt-oss-120b) exhibited matching bias, though they achieved near-perfect accuracy on deontic tasks.

5. Significance and Conclusion

Cognitive Parallel: The study provides strong evidence that LLM reasoning is not purely formal logic but is influenced by domain-specific mechanisms similar to human evolutionary adaptations for social exchange.
Error Characterization: The findings shift the understanding of LLM reasoning errors. Rather than a general desire to "confirm" a hypothesis, errors are better explained by a lexical matching bias and a difficulty in processing negation within conditional structures.
Implications for AI Safety: Since LLMs struggle with negation in conditional rules (a core component of logical reasoning), this poses risks for applications requiring strict adherence to rules (e.g., legal compliance, safety protocols).
Future Directions: The authors suggest future work should explore the mechanistic origins of these biases (e.g., training data distribution vs. architectural inductive biases) and extend this modality-based framework to other reasoning tasks beyond the Wason Selection Task.

In summary, the paper demonstrates that while LLMs are improving at reasoning, they still mirror human cognitive limitations: they excel in normative (deontic) contexts but struggle with abstract logic, and their errors are driven by a tendency to match surface-level lexical features rather than deep logical structures.