AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

Imagine you are trying to solve a complex mystery, but you have a very smart detective (the AI) who knows a lot of general facts but doesn't have access to the specific, up-to-date police reports or case files you need. This is the core problem of Retrieval-Augmented Generation (RAG): giving a smart AI a "library" of documents to read before it answers your question.

However, real life isn't just one question; it's a conversation. You ask a question, the AI answers, you follow up with "What about that?" or "How much does it cost?", and the AI has to remember what you were talking about three turns ago. This is Multi-Turn RAG.

The team from NTUA (the "AILS-NTUA" team) built a system to win a competition called SemEval-2026 Task 8. They didn't just build a better detective; they built a super-efficient, multi-step investigation team. Here is how their system works, broken down into simple analogies.

1. The Search Team: "Don't Use One Detective, Use Five"

The Problem: When you ask a follow-up question like "How much does it cost?", the word "it" is confusing. If you just ask a search engine, it might get lost. Also, if you ask five different search engines, they might all give you slightly different lists of documents, and mixing them up can be messy.

Their Solution:
Instead of hiring five different search engines, they hired one very good search engine but asked it to look for the answer in five different ways at the same time.

The Minimalist: "Just fix the grammar and pronouns." (e.g., changing "it" to "IBM Cloud Object Storage").
The Specialist: "Use the specific jargon of this industry."
The Dreamer (HyDE): "Imagine what the answer looks like, then search for that."
The Thinker (CoT): "Break this down step-by-step."
The Keyword Hunter: "Just grab the most important names and numbers."

The Magic Trick (Nested Fusion):
Usually, if you mix five different search results, you get noise. But this team used a special "voting system" (called Nested Reciprocal Rank Fusion).

Think of it like a jury. The "Minimalist" and "Specialist" are the reliable, steady jurors who rarely make mistakes. The "Dreamer" and "Thinker" are the wildcards—they might find a brilliant clue no one else saw, or they might go off-track.
The system gives the steady jurors more votes, but it still listens to the wildcards to make sure they don't miss anything. This way, they get the best of both worlds: high precision (accuracy) and high recall (finding everything).

Result: They won 1st Place in the search task. They proved that asking one good engine to think in many ways is better than asking many different engines.

2. The Writing Team: "The Drafting and Editing Process"

The Problem: Once the AI finds the documents, it has to write an answer. But AI often "hallucinates" (makes things up) or copies the text too robotically.

Their Solution: They treated writing like a professional newsroom, not a one-person show.

The Fact-Checker: Before writing, a module scans the documents and pulls out only the exact sentences that contain the answer. It throws away the fluff.
The Dual Drafts: The AI writes two different versions of the answer.
- Draft A: Very strict, sticking closely to the facts (like a lawyer).
- Draft B: A bit more natural and conversational (like a friend).
The Judges: Two different "editors" review the drafts.
- The Technical Judge: Checks, "Did you make anything up? Did you use the facts?"
- The User Satisfaction Judge: Checks, "Does this sound natural? Would a human like this?"
The Final Polish: A tiny editor fixes small issues, like making sure the answer isn't too short or too long, and removes phrases like "I'm not sure" if the answer is actually known.

Result: They came in 2nd Place for the writing task. By separating the "finding facts" from "writing the story," they stopped the AI from lying.

3. The "Can We Even Answer?" Gatekeeper

The Big Challenge: Sometimes, the documents just don't have the answer. If the AI tries to guess, it fails. The hardest part of the competition was knowing when to say "I don't know."

Their Insight: They realized that the biggest bottleneck wasn't finding the documents; it was deciding if the documents were enough.

They built a "Gatekeeper" system with three judges. If the documents are weak, the Gatekeeper stops the process and politely says, "The information isn't available," rather than making up a fake answer.
They found that in real conversations, errors pile up. If the AI misses a document in Turn 1, it gets confused in Turn 2, and the whole conversation falls apart.

The Big Takeaway

The team's secret sauce wasn't using a bigger, more expensive AI model. It was about process and stability.

Analogy: Imagine trying to find a specific needle in a haystack.
- Old way: Throw a huge net (big model) and hope you catch it.
- Their way: Send in a team of five people with different metal detectors (query diversity), have them vote on the best spot (fusion), then have a fact-checker verify the needle is real (evidence extraction), and finally, have a judge decide if it's actually the right needle (multi-judge selection).

Why it matters: This system shows that for AI to be truly useful in real conversations, it needs to be careful, self-correcting, and aware of its own limits. It's not about being the smartest genius; it's about being the most reliable partner.

Here is a detailed technical summary of the paper "AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations."

1. Problem Definition

The paper addresses SemEval-2026 Task 8 (MTRAGEval), a benchmark designed to evaluate Retrieval-Augmented Generation (RAG) systems in multi-turn conversational settings. Unlike standard RAG tasks that rely on isolated queries, multi-turn RAG requires the system to:

Maintain context across a dialogue history ( $H_t$ ).
Handle non-standalone queries (e.g., pronoun resolution, topic shifts).
Distinguish between answerable, partially answerable, and unanswerable turns.
Prevent error accumulation, where a retrieval failure in an early turn degrades the quality of subsequent turns.

The task is divided into three subtasks:

Task A: Passage Retrieval (ranking top-10 relevant passages).
Task B: Reference-Grounded Generation (generating responses given gold passages).
Task C: End-to-End RAG (retrieving passages and generating responses).

2. Methodology

The AILS-NTUA system employs a unified architecture built on two core principles: Query Diversity over Retriever Diversity and a Staged, Agreement-Driven Generation Pipeline.

A. Task A: Multi-Strategy Retrieval

Instead of ensembling multiple different retrievers (which the authors found to introduce noise), the system uses a single, well-aligned sparse retriever (ELSER v1) combined with five complementary query reformulation strategies:

Minimal: Resolves coreferences and conversational omissions.
Corpus-Specific: Adapts queries to domain terminology (e.g., Wikipedia vs. Finance).
HyDE (Hypothetical Document Embedding): Generates a hypothetical answer passage to bridge vocabulary gaps.
Chain-of-Thought (CoT): Expands information needs via step-by-step reasoning.
Anchor-Keyword: Extracts salient entities and keywords for sparse lexical matching.

Fusion Mechanism (Nested RRF):
The system uses a Variance-Aware Nested Reciprocal Rank Fusion (RRF):

Level 1: The three high-variance strategies (HyDE, CoT, Anchor-Keyword) are aggregated into a "Weak Consensus" ranking to smooth out individual noise.
Level 2: This consensus is fused with the two stable strategies (Minimal, Corpus-Specific) using corpus-specific weights.
Hybrid Reranking: A cross-encoder (Cohere Rerank v4) is fused with the sparse retriever scores via weighted RRF to refine top-k precision.

B. Task B: Agentic Generation Pipeline

The generation process is decomposed into a multi-stage decision flow to minimize hallucinations:

Answerability Check: Determines if the context supports an answer; triggers a calibrated refusal if not.
Evidence Span Extraction: Instead of feeding full passages, the system extracts up to 8 verbatim sentences (spans) from the retrieved documents.
Dual Candidate Generation: Two candidates are generated:
- Candidate A: Greedy generation ( $\tau=0.0$ ) for high faithfulness.
- Candidate B: Stochastic generation ( $\tau=0.1$ ) for naturalness.
Judge-Based Selection:
- Technical Judge: Evaluates faithfulness and completeness against extracted spans.
- User Satisfaction Judge: Evaluates naturalness (invoked on 60% of turns).
- Extractiveness Shaping: A scoring function penalizes outputs that are too abstract (hallucination risk) or too verbatim (robotic), targeting a 4-gram overlap of 28–38%.
Micro-Adjustments: A final pass to fix length violations or remove hedging phrases (e.g., "I'm not sure").

C. Task C: End-to-End RAG

Task C integrates the retrieval and generation pipelines. A critical addition is a Multi-Judge Answerability Gate:

Three specialized judges (Document, Span, and Answer judges) evaluate whether retrieved evidence justifies an answer.
An Arbiter aggregates these verdicts using confidence-weighted voting with a "dissenter override" mechanism.
If the system deems a turn unanswerable, it produces a calibrated refusal; otherwise, it proceeds to the Task B generation pipeline.

3. Key Contributions

Stability-Oriented Rewriting: Demonstrated that query diversity (multiple reformulations of a single query) over a single, high-quality retriever outperforms retriever ensembles. The authors showed that adding diverse retrievers often introduces fusion noise that degrades top-k precision (R@5) even if recall (R@100) improves.
Agentic Generation Pipeline: Introduced a pipeline that separates extraction, drafting, and selection. This "agentic" approach, particularly the use of span extraction and dual-candidate selection, significantly reduces conversational hallucinations compared to greedy single-pass generation.
Answerability as a Bottleneck: Provided empirical evidence that in end-to-end RAG (Task C), the primary failure mode is not retrieval or generation quality, but answerability estimation. The system identified a structural bias where LLMs prefer semantic plausibility over abstention, leading to hallucinations on unanswerable turns.
Nested RRF Architecture: Proposed a two-level fusion strategy that treats high-variance strategies as a single "vote" to prevent them from dominating the ranking, while still leveraging their complementary coverage.

4. Results

The system achieved top rankings on the SemEval-2026 leaderboard:

Task A (Retrieval):
- Rank: 1st out of 38 teams.
- Metric: nDCG@5 = 0.5776.
- Improvement: +20.5% over the strongest baseline.
- Key Insight: The single-retriever, multi-query strategy with nested RRF was the primary driver of this success.
Task B (Generation with References):
- Rank: 2nd out of 26 teams.
- Metric: Harmonic Mean (HM) = 0.7698.
- Performance: High faithfulness (RLF = 0.8971) and LLM-judged quality (RBllm = 0.8321).
Task C (End-to-End RAG):
- Rank: 11th out of 29 teams.
- Metric: HM = 0.5409.
- Analysis: The significant drop from Task B to Task C (0.7698 $\to$ 0.5409) was attributed to retrieval errors compounding and, more critically, the difficulty of answerability classification in the test set (which had a 3x higher rate of unanswerable turns than the development set).

5. Significance and Conclusions

Retriever Strategy: The paper challenges the common assumption that ensembling diverse retrievers is always beneficial. It argues that for well-aligned sparse retrievers (like ELSER), query reformulation diversity is a more effective lever for improving top-k precision than adding heterogeneous retrievers.
Generation Stability: The "agentic" approach of decomposing generation into extraction and selection stages proves superior to end-to-end generation for grounded responses, effectively acting as a guardrail against hallucination.
The Answerability Bottleneck: The study highlights that current RAG systems struggle most with knowing when not to answer. The test set's high rate of unanswerable turns exposed a structural bias in LLMs toward "semantic plausibility" over "evidence availability."
Future Directions: The authors suggest that future work must focus on distribution-adaptive calibration for answerability classifiers and mechanisms to detect and recover from upstream retrieval errors in multi-turn dialogues (error propagation).

In summary, the AILS-NTUA system demonstrates that controlled query diversity and structured, multi-stage generation are the keys to robust multi-turn RAG, provided that the system can accurately distinguish between answerable and unanswerable queries.

AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

1. The Search Team: "Don't Use One Detective, Use Five"

2. The Writing Team: "The Drafting and Editing Process"

3. The "Can We Even Answer?" Gatekeeper

The Big Takeaway

1. Problem Definition

2. Methodology

A. Task A: Multi-Strategy Retrieval

B. Task B: Agentic Generation Pipeline

C. Task C: End-to-End RAG

3. Key Contributions

4. Results

5. Significance and Conclusions

More like this

One Language, Two Scripts: Probing Script-Invariance in LLM Concept Representations

MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

ConFu: Contemplate the Future for Better Speculative Sampling

SciTaRC: Benchmarking QA on Scientific Tabular Data that Requires Language Reasoning and Complex Computation

Automated Thematic Analysis for Clinical Qualitative Data: Iterative Codebook Refinement with Full Provenance