SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Imagine you are hiring a super-smart research assistant to answer tricky questions for you. You tell them, "Go look this up on the internet and give me the right answer."

In the past, if you asked a question like "Who won the Super Bowl last year?", the internet would give you a clear, big sign saying "Kansas City Chiefs." Your assistant would read the sign and get it right. Easy peasy.

But the real world isn't like that. Sometimes, the internet is a messy, noisy bazaar. One vendor shouts "The Chiefs won!" while another screams "No, it was the Eagles!" and a third is selling you a fake ticket from 1995.

SEALQA is a new "exam" designed to test if our smartest AI assistants can navigate this messy bazaar without getting confused, lying to you, or giving up.

Here is the breakdown of this paper in simple terms:

1. The Problem: The "Noisy Bazaar"

Current AI models (like the ones in your phone or computer) are great at memorizing facts they learned in school. But when they have to go "live" on the internet to find new info, they often fail.

The Issue: The internet is full of conflicting info, outdated news, and misleading articles.
The Test: The researchers created SEALQA, a set of questions specifically designed to trigger this chaos. They asked questions where a simple Google search would give you three different, contradictory answers.

2. The Three Levels of the Exam

The researchers didn't just make one test; they made three flavors, like a video game with increasing difficulty:

Level 1: SEAL-0 (The "Impossible" Level)
- The Analogy: This is like asking a question where every sign in the bazaar is lying.
- The Goal: They picked questions so tricky that even the world's best AI (like GPT-4.1) got them 0% right when they first tried. It's the "zero-accuracy" benchmark. If an AI can't solve this, it's not ready for the real world.
Level 2: SEAL-HARD (The "Tough" Level)
- The Analogy: The bazaar is still noisy, but maybe one or two vendors are telling the truth.
- The Goal: A broader set of difficult questions. Even the "smart" AI models struggle here, often getting less than half right.
Level 3: LONGSEAL (The "Needle in a Haystack" Level)
- The Analogy: Imagine the bazaar has 50 stalls. Only one stall has the correct answer. The other 49 are selling junk, but they look exactly like the real thing.
- The Goal: Can the AI ignore the 49 fake stalls and find the one real one without getting overwhelmed?

3. The Shocking Results

The researchers tested the "frontier" models (the smartest AIs available, including new ones like GPT-5 and DeepSeek-R1). Here is what they found:

The "Overthinker" Trap: You might think that if an AI thinks harder and longer (using more computer power), it will get smarter. Wrong.
- Analogy: Imagine a detective who starts reading every single clue, including the fake ones. The more he reads, the more confused he gets. He starts doubting the truth because the fake clues look so convincing.
- Result: For many models, giving them more time to think actually made them worse at finding the answer. They got tangled in the noise.
The "Search" Paradox: Giving the AI a search engine didn't always help.
- Analogy: It's like giving a child a map to a city where half the streets are blocked and the signs are wrong. If the child just follows the first sign they see, they get lost. If they try to read every sign, they get a headache.
- Result: Some advanced models actually did worse when they were allowed to search, because the search results were so noisy that they got tricked.
The "Lost in the Middle" Myth: People thought AI would forget the answer if it was buried in the middle of a long document.
- Result: Surprisingly, the AI didn't forget the middle. Instead, it just couldn't tell the difference between the "real" document and the "fake" ones. It's not a memory problem; it's a trust problem.

4. The Human Comparison

The researchers even asked real humans to take the test.

Humans: Did better than the AI, but not by much! Even humans got confused by the noise.
The Gap: The best AI models got about 28% right. The best humans got about 64% right.
Takeaway: We are still a long way from having an AI that can reliably navigate the messy, conflicting reality of the internet.

5. Why This Matters

This paper is a wake-up call. We are currently building AI that is amazing at writing poems and coding, but terrible at being a reliable fact-checker in a chaotic world.

If we want AI to help us make medical decisions, legal arguments, or news summaries, it needs to learn how to:

Ignore the noise.
Spot the lies.
Trust the right sources, even when they are buried under 50 fake ones.

In short: SEALQA is the "driver's test" for AI. Right now, most of our smartest cars are crashing because they can't handle traffic jams and confusing road signs. The researchers are handing us the blueprint to build better drivers.

Here is a detailed technical summary of the paper "SEALQA: RAISING THE BAR FOR REASONING IN SEARCH-AUGMENTED LANGUAGE MODELS", published as a conference paper at ICLR 2026.

1. Problem Statement

Current Large Language Models (LLMs) have reached high saturation on static knowledge benchmarks (e.g., MMLU) and simple fact-retrieval tasks. However, real-world search is often messy, returning documents that are outdated, misleading, conflicting, or superficially relevant. Existing benchmarks for Search-Augmented Generation (SAG) often fail to capture this complexity, relying on queries where top-ranked results provide direct answers.

The core problem addressed is the inability of frontier models to perform robust reasoning when faced with noisy, conflicting, or ambiguous search results. Specifically, current models struggle to:

Filter misinformation and reconcile contradictions between retrieved documents.
Distinguish between similar entities or track temporal changes in facts.
Identify relevant "needles" in a "haystack" of distractors within long contexts.
Leverage increased test-time compute (reasoning effort) to improve performance in these adversarial settings.

2. Methodology: The SEALQA Benchmark

The authors introduce SEALQA, a rigorously curated, small-scale benchmark designed to evaluate search-augmented reasoning under adversarial conditions. The dataset was constructed by NLP researchers over eight months, with each question undergoing multiple rounds of review to ensure it triggers conflicting or unhelpful search results.

SEALQA consists of three distinct flavors:

SEAL-0 (Main): A core set of 111 questions specifically engineered so that frontier non-reasoning models (e.g., GPT-4.1) achieve near-zero accuracy (0%) even with browsing tools. These questions require deep reasoning to resolve ambiguity.
SEAL-HARD: A broader set of 254 questions (including SEAL-0) containing difficult queries that did not meet the strict "0% accuracy" threshold but remain highly challenging.
LONGSEAL: A "needle-in-a-haystack" variant of SEAL-HARD (254 questions) designed to test long-context, multi-document reasoning. Each question is paired with one "gold" document and up to 50 "hard negative" distractors (irrelevant or misleading content).

Question Categories:
The benchmark covers diverse reasoning skills, including:

Advanced Reasoning: Multi-hop reasoning, interpreting charts/tables, and counting.
Entity/Event Disambiguation: Distinguishing between similar entities.
Temporal Tracking: Identifying changes to entities over time.
Cross-lingual Reasoning: Answering English questions using non-English sources.
False-Premise Detection: Debunking incorrect assumptions in the query.

Data Curation & Quality Control:

Questions are verified against multiple models (GPT-4o, GPT-4.1, Llama-4, etc.) with and without browsing.
Questions are classified by "freshness" (Never-changing, Slow-changing, Fast-changing).
Search results are annotated as CONFLICT (mixed correct/misleading) or UNHELPFUL (no correct answers).

3. Experimental Setup

The authors evaluated a wide range of models, including:

Proprietary Models: GPT-4o, GPT-4.1, O3, O4-mini, GPT-5 (and variants), and OpenAI's agentic models.
Open-Weight Models: Llama-3/4 series, DeepSeek-R1 (and distilled variants), Qwen3.
Evaluation Protocol: Models were tested with and without search tools. Search was implemented via built-in browser tools (for agentic models) or via FreshPrompt (injecting search results into the prompt for non-agentic models).
Metrics: Accuracy was measured using a GPT-4o-mini auto-rater, validated against human ratings (98% agreement).

4. Key Results

A. Performance on SEAL-0 and SEAL-HARD

Low Accuracy: Even frontier models struggle significantly. On SEAL-0, GPT-5-high (the strongest model tested) achieved only 43.2% accuracy with search tools. Most other models scored below 20%.
Reasoning vs. Search: Advanced reasoning models (e.g., DeepSeek-R1-671B, O3) are surprisingly vulnerable to noisy search results. In some cases, adding search tools degraded their performance compared to relying on parametric knowledge alone.
- Example: DeepSeek-R1-671B dropped from 22.4% (no search) to 11.0% (with search) on SEAL-HARD.
Search Integration: Models with native tool-use training (e.g., O3, O4-mini) generally outperformed those relying on prompt-based search injection (FreshPrompt), but still failed to solve the majority of SEAL-0 questions.

B. Test-Time Scaling Limitations

A critical finding is that increasing test-time compute does not yield reliable gains on SEALQA.

For models like O3-mini and O4-mini, increasing reasoning effort (low, medium, high) led to performance plateaus or even declines.
Hypothesis: Longer chains of thought on noisy data may amplify spurious information, causing the model to entangle itself in misleading evidence rather than filtering it out.

C. Long-Context Reasoning (LONGSEAL)

Distractor Sensitivity: As the number of hard negatives ( $k$ $k$ ) increased from 12 to 30, model accuracy dropped significantly.
- Example: GPT-4o-mini dropped from 24.0% to 3.9% as $k$ increased.
No "Lost-in-the-Middle" Bias: Unlike previous findings (Liu et al., 2024), newer models did not show a strong U-shaped performance dip based on document position. Instead, the failure mode shifted to a general inability to identify relevance regardless of position when distractor density is high.
Reasoning vs. Retrieval: Even when provided only with the gold document (no distractors), models performed poorly (e.g., GPT-4.1 at 48%), indicating that the difficulty stems from both retrieval failure and reasoning limitations.

D. Human vs. Model Performance

Humans significantly outperformed models. On a subset of 50 questions, the best model (O3-high) reached 28.0%, while human average accuracy was 38.8% (open search) and 50.4% (oracle).
Humans answered correctly 53% of the time even when answering in under 5 minutes, highlighting the dual challenge of retrieval and reasoning.

5. Key Contributions

SEALQA Benchmark: Introduction of a high-quality, adversarial benchmark specifically targeting the intersection of search, noise, and reasoning. It includes three flavors (SEAL-0, SEAL-HARD, LONGSEAL) to test different failure modes.
Empirical Evidence of Limitations: The paper demonstrates that current state-of-the-art models, including agentic and reasoning-focused models, are brittle when faced with conflicting search results. It challenges the assumption that "more compute" or "better reasoning" automatically solves retrieval-augmented tasks.
Analysis of Test-Time Scaling: The finding that scaling reasoning effort does not reliably improve performance on noisy search tasks suggests a fundamental architectural or training gap in handling conflicting evidence.
Open Release: The dataset and evaluation code are publicly released to facilitate future research into robust retrieval-augmented reasoning.

6. Significance

SEALQA represents a critical step forward in evaluating LLMs beyond static knowledge. It highlights that retrieval-augmented generation is not yet robust enough for real-world deployment where information is ambiguous or conflicting. The results suggest that future research must focus not just on scaling parameters or compute, but on developing mechanisms for conflict resolution, noise filtering, and salience detection in retrieval-augmented systems. The benchmark serves as a "stress test" for the next generation of search-augmented agents.