AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Here is an explanation of the paper, "AILS-NTUA at SemEval-2026 Task 12," translated into everyday language with some creative analogies.

The Big Picture: The "Detective" Challenge

Imagine you are a detective trying to solve a mystery. You are given a specific event (like "The President resigned") and a massive pile of newspaper clippings, some of which are relevant and some of which are just noise (distractors). Your job is to look at four possible explanations and pick the one(s) that actually caused the event.

This is exactly what SemEval 2026 Task 12 asked computer programs (Large Language Models or LLMs) to do. It's called Abductive Reasoning. In simple terms, it's the art of saying, "Given what happened, what is the most likely story that explains why it happened?"

The team from AILS-NTUA (a lab at the National Technical University of Athens) built a system that didn't just guess; it acted like a super-detective. They won first place with a score of 0.95 out of 1.00.

How Their "Super-Detective" System Works

Instead of just asking the AI to "read and guess," they built a three-stage pipeline. Think of it as a three-person team working together:

Stage 1: The Librarian (Retrieval & Filtering)

The Problem: The AI was drowning in information. The "context" provided thousands of words, many of which were irrelevant. It's like trying to find a specific needle in a haystack that is also on fire.
The Solution: They built a Graph Map.

Imagine every document is a house.
If two houses have similar stories, they are connected by a road.
The team didn't just look at the house closest to the query; they looked at the neighborhood. They started at the most relevant houses and walked down every connected road to find the whole "connected community" of documents.
The Analogy: If you are looking for the cause of a fire, you don't just look at the house that burned down. You look at the house next door that had a faulty wire, and the house across the street that had a gas leak. This method filtered out the "distractors" (irrelevant news) and kept the "connected community" of facts.

Stage 2: The Analyst (The Reasoning Engine)

The Problem: Even with the right documents, AI models often get lazy or confused. They might jump to conclusions or miss subtle details.
The Solution: They used a technique called "Reflective Prompting."

Instead of letting the AI blurt out an answer, they forced it to write a "scratchpad" first.
The Analogy: It's like a student taking a test. Instead of just bubbling in "A," the student is forced to write: "Option A is wrong because the text says X. Option B looks good, but let me check if it's strong enough..."
They used a tool called GEPA (a smart optimizer) to evolve the best possible set of instructions for the AI, teaching it to be a critical thinker rather than a guesser.

Stage 3: The Editor (Post-Hoc Consistency)

The Problem: Even smart AI makes silly logical mistakes. For example, it might pick "None of the above" and "Option A" at the same time (which is a contradiction), or it might pick "Option A" but ignore "Option B" even though they are the exact same sentence.
The Solution: They added a Logic Police step.

After the AI gave its answer, this step ran a set of 8 "rules" to check for contradictions.
The Analogy: Imagine a teacher grading a test. If the student writes "The answer is A" but also writes "The answer is None," the teacher crosses out the "None" because the rules say they can't both be true. This step fixed the AI's logical slips without needing to re-run the whole AI.

What They Learned: The "Human" Flaws of AI

The most interesting part of the paper isn't just that they won, but why the other AI models failed. The team analyzed 14 different AI models and found they all shared the same three "bad habits" (inductive biases):

The "Last Thing" Bias (Proximate Cause):
- The Flaw: If a chain of events happened (A caused B, which caused C), the AI often picked B (the thing that happened right before C) and ignored A (the root cause).
- The Analogy: If a car crashes because the brakes failed, which happened because the mechanic forgot to tighten a bolt, the AI often blames the brakes failing and forgets to blame the mechanic. It focuses on the immediate trigger, not the root cause.
The "Drama" Bias (Salience Bias):
- The Flaw: The AI loved dramatic, exciting causes over boring, subtle ones.
- The Analogy: If a politician resigns because of a boring budget error and a scandalous affair, the AI will almost always pick the affair because it's more "newsworthy," even if the budget error was the actual legal reason.
The "Half-Story" Bias (Causal Chain Incompleteness):
- The Flaw: When the answer required picking multiple causes (e.g., "Both the rain and the poor drainage caused the flood"), the AI usually picked just one.
- The Analogy: It's like saying a cake failed because of "bad eggs" and forgetting to mention "the oven was broken." The AI is too conservative and rarely admits that multiple things can be true at once.

The Takeaway

The winning system worked because it didn't rely on the AI to be perfect. Instead, it built a safety net:

The Librarian made sure the AI had the right books.
The Analyst forced the AI to think before speaking.
The Editor fixed the AI's logical mistakes after it spoke.

By combining these three steps, they turned a smart but flawed AI into a near-perfect detective, proving that in the world of AI, process often beats raw intelligence.

Here is a detailed technical summary of the paper "AILs-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning."

1. Problem Definition

The paper addresses SemEval-2026 Task 12: Abductive Event Reasoning. The core challenge is to infer the most plausible direct cause(s) of a real-world event given a textual description of the event, a set of context documents (which may contain distractors), and four candidate explanations.

Nature of the Task: It is a multi-label classification problem where an event can have multiple valid causes.
Key Difficulty: Unlike standard deduction, abductive reasoning requires inferring the best explanation from incomplete information. The task is complicated by:
- Distractors: Context documents often contain irrelevant information.
- Multi-hop Reasoning: Causes may be indirect or part of a causal chain.
- Inductive Biases: LLMs often struggle with causal chains, preferring proximate causes or salient (dramatic) events over logical sufficiency.
Evaluation: The metric uses partial credit (1.0 for exact match, 0.5 for partial match, 0.0 for incorrect), rewarding models that identify at least one valid cause even if they miss others.

2. Methodology

The authors propose a three-stage pipeline that achieved the top rank on the evaluation leaderboard (0.95 accuracy).

Stage 1: Graph-Based Retrieval (Distractor Filtering)

Instead of simple vector search, the system constructs a Hybrid Document Graph for each topic to filter out distractors and preserve causal context.

Graph Construction: Nodes represent documents. Edges are weighted by a hybrid similarity score combining dense embeddings (Cohere Embed v4) and sparse lexical retrieval (BM25+ with entity boosting).
- Formula: $w(d_i, d_j) = \alpha \cdot \text{sim}_{\text{sem}} + (1-\alpha) \cdot \text{sim}_{\text{lex}}$ (where $\alpha=0.7$ ).
Traversal: The system identifies entry points (top 3 dense + top 2 sparse matches) and performs a Breadth-First Search (BFS) over the full connected component.
Strategy: This prioritizes recall over precision. Missing a document could break a multi-hop causal chain, whereas extra documents add manageable noise.
Topic-Wide Aggregation: Documents retrieved for one question in a topic are cached and reused for sibling questions, achieving a 91% cache hit rate and an 87% cost reduction.

Stage 2: LLM-Based Abductive Reasoner

The core inference engine uses a Large Language Model (LLM) with a structured prompting strategy.

Reflective Prompt Evolution (GEPA): The authors used GEPA (via DSPy) to explore the prompt design space. Rather than using the black-box optimized prompts directly, they analyzed the structural heuristics discovered by GEPA (e.g., prioritizing single-step reasoning, handling duplicate options) to inform a manually crafted prompt.
Structured Prompting: The prompt enforces an XML-based "Analysis-Before-Answer" format. The model must first generate an <analysis> block evaluating each option against specific criteria (direct textual support, logical sufficiency) before outputting an <answer> block.
Self-Consistency: The system samples $k=3$ responses at temperature $\tau=1.0$ and aggregates them via per-option majority voting.

Stage 3: Post-Hoc Consistency Enforcement

A deterministic heuristic layer corrects logical inconsistencies in the LLM's output without requiring additional model calls.

Heuristics: Eight rules are applied iteratively until convergence (typically 2 iterations). Key rules include:
- None-Exclusivity: If "None of the others" is selected, all other options must be rejected.
- Duplicate Consistency: If two options have identical text, they must receive the same truth value.
- Cross-Question Checks: Propagating logical constraints across sibling questions (questions sharing the same event context).
Impact: This stage provided the largest performance gain, correcting logical violations that the LLM frequently made.

3. Key Contributions

Winning System Architecture: A novel three-stage pipeline combining graph-based retrieval, reflective prompt design, and deterministic post-hoc correction, achieving 0.95 accuracy on the test set.
Comprehensive Error Analysis: An evaluation of 18 configurations across 7 model families (14 models total) revealing three shared inductive biases common to all frontier LLMs:
- Causal Chain Incompleteness: Models select one link in a chain but omit the rest (18/42 failures).
- Proximate Cause Preference: Models favor the most recent antecedent over enabling conditions (11/42 failures).
- Salience Bias: Models prefer dramatic, news-worthy causes over subtle contributing factors (9/42 failures).
Methodological Insights: Demonstrated that post-hoc heuristics (enforcing logical invariants) are more effective than increasing model size or complex prompting alone. Also showed that topic-wide caching significantly reduces inference costs while maintaining accuracy.

4. Results

Performance: The system ranked 1st on the leaderboard with a score of 0.95.
Model Performance:
- Claude Sonnet 4.5 Thinking was the strongest individual model (0.952 with heuristics).
- Extended Thinking (Chain-of-Thought) consistently improved performance (e.g., +3.8% for Sonnet 4.5).
- Smaller models (e.g., Haiku 3.5) benefited significantly from the retrieval module (+9% improvement), while frontier models saw marginal gains.
Component Impact:
- Post-Hoc Heuristics: Provided the single largest gain (+5.6 percentage points for the best model).
- Self-Consistency: Provided modest gains (+1.6 pp) when used alone but was most effective when combined with heuristics.
- Ensemble: A multi-model ensemble (Claude + GPT + Gemini) reached 0.926 without heuristics but fell short of the best single model with heuristics.
Error Analysis Findings:
- 42 questions were "no-exact-match" for all 14 models.
- Under-selection (missing valid causes) was the dominant failure mode (1,389 instances) compared to over-selection (52 instances).
- Models selected an average of 1.2 options when the gold standard required 2.4 (a 51% reduction in cause count).

5. Significance

Systematic Failure Modes: The paper provides strong evidence that current LLMs share systematic biases in causal reasoning that are not resolved by scaling alone. The convergence of errors across different model families suggests these are inherent to the pretraining priors of current architectures.
Practical Engineering: The work demonstrates that deterministic post-processing (heuristic consistency checks) is a highly effective, low-cost method to boost LLM reliability in structured reasoning tasks, often outperforming more complex model tuning.
Retrieval Strategy: It validates that for causal reasoning, preserving narrative context via graph traversal is superior to strict precision-based retrieval, as causal chains often rely on indirect, multi-hop evidence.
Future Directions: The gap between the best single model (0.952) and the oracle upper bound (0.895 for best-per-question selection, though the paper notes 0.952 is the achieved score, the oracle bound analysis suggests 6.7 pp headroom exists via model complementarity) indicates that targeted ensembles or multi-agent architectures could further close the performance gap.