AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

The AILS-NTUA team achieved first place in SemEval-2026 Task 12 with a 0.95 accuracy score by deploying a three-stage system that integrates graph-based retrieval, reflective prompt evolution for LLM-driven abductive reasoning, and post-hoc consistency enforcement, while their cross-model analysis identified systematic failure modes in multi-label causal reasoning across 14 models.

Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou

Published 2026-03-05
📖 5 min read🧠 Deep dive

Here is an explanation of the paper, "AILS-NTUA at SemEval-2026 Task 12," translated into everyday language with some creative analogies.

The Big Picture: The "Detective" Challenge

Imagine you are a detective trying to solve a mystery. You are given a specific event (like "The President resigned") and a massive pile of newspaper clippings, some of which are relevant and some of which are just noise (distractors). Your job is to look at four possible explanations and pick the one(s) that actually caused the event.

This is exactly what SemEval 2026 Task 12 asked computer programs (Large Language Models or LLMs) to do. It's called Abductive Reasoning. In simple terms, it's the art of saying, "Given what happened, what is the most likely story that explains why it happened?"

The team from AILS-NTUA (a lab at the National Technical University of Athens) built a system that didn't just guess; it acted like a super-detective. They won first place with a score of 0.95 out of 1.00.


How Their "Super-Detective" System Works

Instead of just asking the AI to "read and guess," they built a three-stage pipeline. Think of it as a three-person team working together:

Stage 1: The Librarian (Retrieval & Filtering)

The Problem: The AI was drowning in information. The "context" provided thousands of words, many of which were irrelevant. It's like trying to find a specific needle in a haystack that is also on fire.
The Solution: They built a Graph Map.

  • Imagine every document is a house.
  • If two houses have similar stories, they are connected by a road.
  • The team didn't just look at the house closest to the query; they looked at the neighborhood. They started at the most relevant houses and walked down every connected road to find the whole "connected community" of documents.
  • The Analogy: If you are looking for the cause of a fire, you don't just look at the house that burned down. You look at the house next door that had a faulty wire, and the house across the street that had a gas leak. This method filtered out the "distractors" (irrelevant news) and kept the "connected community" of facts.

Stage 2: The Analyst (The Reasoning Engine)

The Problem: Even with the right documents, AI models often get lazy or confused. They might jump to conclusions or miss subtle details.
The Solution: They used a technique called "Reflective Prompting."

  • Instead of letting the AI blurt out an answer, they forced it to write a "scratchpad" first.
  • The Analogy: It's like a student taking a test. Instead of just bubbling in "A," the student is forced to write: "Option A is wrong because the text says X. Option B looks good, but let me check if it's strong enough..."
  • They used a tool called GEPA (a smart optimizer) to evolve the best possible set of instructions for the AI, teaching it to be a critical thinker rather than a guesser.

Stage 3: The Editor (Post-Hoc Consistency)

The Problem: Even smart AI makes silly logical mistakes. For example, it might pick "None of the above" and "Option A" at the same time (which is a contradiction), or it might pick "Option A" but ignore "Option B" even though they are the exact same sentence.
The Solution: They added a Logic Police step.

  • After the AI gave its answer, this step ran a set of 8 "rules" to check for contradictions.
  • The Analogy: Imagine a teacher grading a test. If the student writes "The answer is A" but also writes "The answer is None," the teacher crosses out the "None" because the rules say they can't both be true. This step fixed the AI's logical slips without needing to re-run the whole AI.

What They Learned: The "Human" Flaws of AI

The most interesting part of the paper isn't just that they won, but why the other AI models failed. The team analyzed 14 different AI models and found they all shared the same three "bad habits" (inductive biases):

  1. The "Last Thing" Bias (Proximate Cause):

    • The Flaw: If a chain of events happened (A caused B, which caused C), the AI often picked B (the thing that happened right before C) and ignored A (the root cause).
    • The Analogy: If a car crashes because the brakes failed, which happened because the mechanic forgot to tighten a bolt, the AI often blames the brakes failing and forgets to blame the mechanic. It focuses on the immediate trigger, not the root cause.
  2. The "Drama" Bias (Salience Bias):

    • The Flaw: The AI loved dramatic, exciting causes over boring, subtle ones.
    • The Analogy: If a politician resigns because of a boring budget error and a scandalous affair, the AI will almost always pick the affair because it's more "newsworthy," even if the budget error was the actual legal reason.
  3. The "Half-Story" Bias (Causal Chain Incompleteness):

    • The Flaw: When the answer required picking multiple causes (e.g., "Both the rain and the poor drainage caused the flood"), the AI usually picked just one.
    • The Analogy: It's like saying a cake failed because of "bad eggs" and forgetting to mention "the oven was broken." The AI is too conservative and rarely admits that multiple things can be true at once.

The Takeaway

The winning system worked because it didn't rely on the AI to be perfect. Instead, it built a safety net:

  1. The Librarian made sure the AI had the right books.
  2. The Analyst forced the AI to think before speaking.
  3. The Editor fixed the AI's logical mistakes after it spoke.

By combining these three steps, they turned a smart but flawed AI into a near-perfect detective, proving that in the world of AI, process often beats raw intelligence.