CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate

Imagine you are trying to solve a very tricky logic puzzle. You might think, "If I do X, then Y happens." But then you pause and wonder, "Wait, what if I did Z instead? Would that change the outcome?"

This is how humans naturally think about cause and effect. We don't just calculate an answer; we argue with ourselves, testing different "what if" scenarios until we find the one that makes the most sense.

This paper, CRAwDAD, is about teaching computers to do the same thing. Instead of a single computer trying to solve a problem alone, the authors set up a debate club between two advanced AI models.

Here is the breakdown of the paper using simple analogies:

1. The Problem: The "Causal Parrot"

For a long time, AI models were like parrots. If you asked them, "Does smoking cause cancer?", they would say "Yes" because they heard that phrase a million times in their training data. But if you asked a weird, made-up question like, "If a blue elephant eats a red apple, does it turn green?", the parrot would get confused because it had never heard that specific sentence before.

To fix this, researchers created a test called CLadder. It's like a math test for cause-and-effect. The questions are based on strict rules (like a game of chess), not real-world facts. You can't just guess based on what you've heard; you have to actually do the logic.

2. The Solution: The "Debate Club"

The authors took two smart AI models (called Qwen3 and DeepSeek-R1) and put them in a room to debate.

The Setup: One model (let's call him Alex) looks at a question and gives an answer with a step-by-step explanation.
The Critic: The second model (let's call her Sam) reads Alex's answer and acts like a strict editor. She looks for holes in his logic. "Wait, you said X causes Y, but the rules say Z causes Y. You made a mistake!"
The Resolution: If they disagree, they argue back and forth. Alex might say, "Oh, you're right, I misread the rule," and change his answer. Or, Sam might say, "Actually, I was wrong, your logic holds up." They keep talking until they agree on a final answer.

3. The Results: Two Heads (and Two Voices) Are Better Than One

The study found that this "debate" made the AI much smarter, especially on the hardest questions.

The "Underdog" Wins: The weaker AI (DeepSeek-R1) was like a student who knew the basics but got tripped up by complex questions. When it debated with the stronger AI (Qwen3), it learned a lot. Its accuracy jumped from 78% to 87%.
The "Star" Still Improves: Even the stronger AI (Qwen3) got better, going from 84% to 89%. It turns out, even the smartest person can benefit from having a friend point out their blind spots.
The Hardest Questions: The biggest improvement happened on Counterfactuals (the "What if?" questions). These are the hardest for AI because they require imagining a world that isn't real. The debate helped them get these right much more often.

4. A Funny Twist: The "Silent Partner"

The researchers noticed something funny about the debate style.

Qwen3 was like a lawyer. It gave long, detailed arguments, explained its reasoning, and tried hard to convince the other model.
DeepSeek-R1 was like a shy student. It often gave very short answers, sometimes just "Yes" or "No," even though it was thinking deeply inside its "brain."

Because DeepSeek-R1 didn't explain why it thought something, it was harder for Qwen3 to learn from it. But when DeepSeek-R1 did change its mind, it was usually because Qwen3's long, logical explanation was just too hard to argue with!

5. Why This Matters

This paper proves that AI doesn't have to be a lone genius. By creating a system where AI agents challenge each other, we can get much better results.

It's like a peer review in science. One scientist proposes a theory, and another tries to break it. If the theory survives the attack, it's probably true. The authors show that this method works for AI, too, making them much better at solving complex "cause and effect" puzzles than they were before.

In short: The paper shows that if you make two AI models argue with each other, they stop guessing and start thinking, leading to much smarter answers.

Here is a detailed technical summary of the paper "CRAwDAD: Causal Reasoning Augmentation with Dual-Agent Debate."

1. Problem Statement

Causal inference, particularly involving counterfactual reasoning ("what if" scenarios), remains a significant challenge for Large Language Models (LLMs). While humans naturally engage in collective dialogue to challenge assumptions and refine hypotheses, standard LLMs often rely on linguistic correlations rather than formal causal logic, leading to "amortized causal reasoning" (memorizing patterns rather than reasoning).

Although Reasoning Language Models (RLMs) (e.g., DeepSeek-R1, Qwen3) have shown improved step-by-step logical capabilities, their performance on formal causal benchmarks like CLadder is still suboptimal, especially on complex counterfactual questions (Rung 3). Furthermore, prior research has not extensively explored whether Multi-Agent Debate (MAD) can enhance causal inference skills in RLMs, nor has there been a comprehensive evaluation of RLMs on the full CLadder dataset (10,112 questions).

2. Methodology: CRAwDAD Framework

The authors propose CRAwDAD, a dual-agent debate framework designed to explicitly simulate the internal dialogue of competing causal hypotheses.

A. Dataset

CLadder: A benchmark linking natural language questions to formally defined causal graphs across Pearl's three rungs of causation:
- Rung 1 (Seeing): Statistical associations.
- Rung 2 (Doing): Interventions.
- Rung 3 (Imagining): Counterfactuals (most difficult).
The dataset includes questions with normal, nonsensical, and anti-commonsense alignments to prevent models from relying on real-world knowledge.

B. Agent Configuration

Models: Two open-source RLMs were selected to ensure diversity and avoid shared biases:
- Qwen3-32B
- DeepSeek-R1-Distill-Qwen-32B (derived from a Qwen2.5 base).
Debate Structure: A Dual-Agent setup (no judge).
- Round 1: Randomly selected "First Speaker" generates a structured causal inference (CausalCoT) including a final Yes/No answer and a confidence score.
- Round 2: The "Second Speaker" critiques the first response. If they agree, the answer is final. If they disagree, a debate ensues.
- Round 3 & 4: The first speaker responds to the critique (defending or revising). The second speaker provides a final response.
- Early Stopping: Debate continues only if agents disagree. If they agree immediately, the process stops.
Prompting Strategy:
- Agents are instructed to be self-confident and ignore commonsense contradictions (as the dataset is synthetic).
- A small auxiliary model (Granite3.3-2B) is used to extract structured Yes/No answers and confidence scores, as RLMs often struggle with strict output formatting.

C. Evaluation Metrics

Accuracy: Comparison of initial single-agent accuracy vs. final consensus accuracy.
Persuasion Dynamics: Tracking transitions (e.g., Incorrect $\to$ Correct vs. Correct $\to$ Incorrect).
Confidence Calibration: Analyzing the relationship between confidence scores and correctness.
Response Characteristics: Measuring response length, sentiment, and debate rounds required.

3. Key Contributions

Comprehensive Evaluation: The first in-depth analysis of open-source RLMs (Qwen3 and DeepSeek-R1) on the full CLadder dataset (10,112 questions), categorized by difficulty (Rung) and commonsense alignment.
Dual-Agent Debate Framework: Implementation of a novel MAD system specifically for causal inference that encourages direct adversarial engagement without a judge, utilizing confidence estimation and early-stopping strategies.
Empirical Validation: Demonstration that debate significantly improves causal reasoning accuracy, particularly for the most difficult counterfactual tasks, and that even stronger models benefit from debating weaker ones.

4. Key Results

A. Accuracy Improvements

Debate substantially boosted performance for both models:

DeepSeek-R1: Overall accuracy increased from 78.03% to 87.45%.
- Counterfactual (Rung 3) improvement: 67.94% $\to$ 80.04%.
Qwen3: Overall accuracy increased from 84.16% to 89.41%.
- Counterfactual (Rung 3) improvement: 71.53% $\to$ 80.35%.
Comparison: Both models surpassed the previous GPT-4 baseline (70.40%) even before debate, but debate pushed them significantly higher.

B. Persuasion Dynamics

Net Positive Effect: Debates were far more likely to correct an incorrect answer than to corrupt a correct one.
Asymmetry: DeepSeek-R1 was more susceptible to persuasion (often correcting its own errors when challenged by Qwen3). Qwen3 was more robust, successfully defending correct answers while also correcting its own initial errors.
Sentiment: Qwen3 adopted a more negative/critical tone during debates (sentiment -0.239), while DeepSeek-R1 remained neutral (0.015), which correlated with Qwen3's higher persuasiveness.

C. Confidence and Efficiency

Confidence Calibration: Models were often overconfident (>60%) even on incorrect answers. However, when a model was persuaded to change its answer, its confidence in the new answer increased significantly.
Debate Rounds:
- Rung 1: 93% of cases resolved immediately (no debate needed).
- Rung 3: ~25% of cases required debate, mostly settled in Round 3.
- Only 148 out of 10,112 questions (0.55%) failed to reach consensus, suggesting diminishing returns beyond 4 rounds.
Response Length: DeepSeek-R1 tended to produce very short debate responses (median 249 chars vs. Qwen3's 739 chars), often ignoring instructions to elaborate, which hindered its ability to persuade Qwen3 despite having the correct internal reasoning.

5. Significance and Limitations

Significance:

Validation of RLMs: Confirms that reasoning models are superior to standard LLMs for causal inference but still benefit from collaborative deliberation.
Multi-Agent Synergy: Demonstrates that diverse perspectives (heterogeneous models) are crucial for solving complex causal problems, with weaker agents helping stronger ones and vice versa.
Benchmarking: Establishes strong baselines for future research at the intersection of causal inference and collaborative AI.

Limitations:

Data Contamination: Since CLadder was created in 2023 and the models released in 2025, there is a risk the models saw the data during training.
Synthetic Nature: The dataset uses clean, synthetic text, which may not reflect the ambiguity of real-world causal reasoning.
Resource Constraints: The 10,112-question evaluation took 380 hours, preventing extensive ablation studies (e.g., testing 3+ agents or different judge frameworks).
Model Size: Limited to 32B parameter models due to hardware; larger models (e.g., DeepSeek-R1-671B) were not tested.

Conclusion:
CRAwDAD proves that structured debate between reasoning models is a powerful mechanism for augmenting causal inference capabilities. It transforms causal reasoning from a solitary calculation into a deliberative process, significantly improving accuracy on the most challenging counterfactual tasks.