SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

The paper introduces SimpleQA Verified, a rigorously filtered 1,000-prompt benchmark designed to overcome the limitations of OpenAI's original SimpleQA by providing a more reliable evaluation of LLM factuality, on which Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6.

Lukas Haas, Gal Yona, Giovanni D'Antonio, Sasha Goldshtein, Dipanjan Das

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Imagine you are trying to test how well a student knows the world without letting them use a textbook or the internet. You want to see what's actually stored in their brain (their "parametric knowledge").

For a while, the AI community used a test called SimpleQA to do this. But, much like a high school exam that was written by a single teacher who had a weird obsession with 19th-century poets and kept asking the same question in three different ways, the original test had some major flaws. It was noisy, biased, and sometimes the answer key was just wrong.

Enter SimpleQA Verified. Think of this as the "Gold Standard" re-examination of that test, created by Google DeepMind to fix the mess.

Here is the story of how they cleaned it up, explained with some everyday analogies:

1. The "Garbage In, Garbage Out" Problem

The original SimpleQA was like a bucket of fruit that had some rotten apples, some double-counted oranges, and a whole section dedicated only to pineapples.

  • The Issue: The original questions were too similar (redundant), came from a narrow set of sources (biased), and sometimes had the wrong answers (noisy labels).
  • The Fix: The researchers acted like a team of strict but fair librarians. They went through the bucket and:
    • Removed Duplicates: If two questions were basically the same (e.g., "When was City A founded?" and "When was City B founded?" when they were just variations of the same fact), they kept only the hardest one.
    • Respected the Owners: Some websites had put up "Do Not Disturb" signs (robots.txt) saying, "Hey AI, don't use our content to train your brain." The researchers respected that and removed questions relying on those sites.
    • Balanced the Menu: They made sure the test wasn't just about Science and Math. They added questions about Art, Sports, and Geography so the test measured general knowledge, not just one specific skill.

2. The "Fuzzy Math" Problem

One of the biggest headaches in grading AI is numbers.

  • The Old Way: If the answer was "8,282" and the AI said "8,333," the old grading system would mark it Wrong. That's like failing a student because they estimated the distance to the moon as 238,900 miles instead of 238,855 miles.
  • The New Way: The researchers introduced a "Margin of Error" rule. Now, if the answer is a number, they give the AI a little wiggle room (like a target with a bullseye and a surrounding ring). If the AI hits the ring, it gets credit. This makes the test fairer and less about memorizing exact digits and more about knowing the right ballpark.

3. The "AI Teacher" Upgrade

To grade these tests, they use another AI (an "autorater") to act as the teacher.

  • The Problem: The old teacher was easily confused. If the AI student gave the right answer but added a long, polite explanation or said, "I think it's X, but maybe Y," the old teacher would get confused and mark it wrong.
  • The Fix: They gave the teacher a new rulebook. Now, the teacher is told: "Ignore the fluff. Just look for the core answer. If the student hedges their bets (says 'maybe'), mark it as 'not attempted' so they can't game the system by guessing wildly."

The Results: Who Passed?

They ran the new, stricter, fairer test on the world's smartest AI models (like GPT-5, Claude Opus, and Gemini).

  • The Winner: Gemini 2.5 Pro took the top spot with a score of 55.6%.
  • The Context: Don't panic if 55% sounds low! Remember, this is a hard test. It's like a PhD qualifying exam. Getting 55% means the AI knows a lot, but it still hallucinates (makes things up) about 45% of the time.
  • The Takeaway: The other models (like GPT-5 and Claude) did well, but Gemini 2.5 Pro was the only one to consistently beat them on this specific, cleaned-up test.

Why Should You Care?

Think of SimpleQA Verified as a new, high-quality speedometer for AI.

  • Before, the speedometer was broken; it told you the car was going 100 mph when it was actually going 60, or it got confused by rain.
  • Now, we have a reliable speedometer. This helps researchers see real progress. Are AI models actually getting smarter at remembering facts, or are they just memorizing the answers to the old, flawed test?

By releasing this new benchmark, Google is saying: "Here is a clean, honest ruler. Let's measure the future of AI with it, so we can build systems that are truly trustworthy and don't just make things up."