Position: Science of AI Evaluation Requires Item-level Benchmark Data

This position paper argues that establishing a rigorous science of AI evaluation requires the adoption of item-level benchmark data to overcome systemic validity failures, supporting this claim with cross-disciplinary analysis and the introduction of OpenEval, a new repository for evidence-centered evaluation.

Han Jiang, Susu Zhang, Xiaoyuan Yi, Xing Xie, Ziang Xiao

Published 2026-04-07
📖 5 min read🧠 Deep dive

The Big Picture: The "Report Card" Problem

Imagine you are a parent trying to decide which school to send your child to. The schools hand you a single piece of paper with a final grade: "95%."

  • School A got 95% because they taught your child how to solve complex calculus problems.
  • School B got 95% because they taught your child how to guess the right answer on a multiple-choice test by spotting patterns in the font size.

If you only see the final "95%" number, you can't tell the difference. You might send your child to School B, thinking they are a math genius, only to find out they can't actually do math.

This is the current state of AI evaluation. We have AI models (like the schools) and we have "benchmarks" (the tests). Currently, we mostly look at the final score (the 95%). The authors of this paper argue that this is dangerous, especially when we want to use AI for serious things like healthcare, law, or finance.

The Core Argument: Look at the "Answer Sheet," Not Just the Grade

The paper argues that to truly understand AI, we need Item-level data.

  • Current Way: We see the total score. "The AI got 80% on the History Test."
  • Proposed Way: We see the specific questions the AI got right or wrong. "The AI got 100% on questions about the Civil War but 0% on questions about the Cold War, and it guessed 'Washington' on every question about presidents."

The authors say that without seeing the individual questions (the "items") and the specific answers the AI gave, we are flying blind. We don't know if the AI is smart, or if it just memorized the test answers.

Why Do We Need This? (The Three Big Problems)

The paper identifies three main reasons why looking only at the final score is broken:

1. The "Leaked Answer Key" Problem (Data Contamination)

Imagine a student who accidentally saw the test questions before the exam. They memorized the answers. On test day, they get a perfect score.

  • The Issue: AI models are trained on massive amounts of internet data. Often, the "test questions" (benchmarks) are already in that data. The AI isn't "thinking"; it's just recalling what it saw during training.
  • The Fix: If we have item-level data, we can look at specific questions and say, "Hey, this question looks exactly like something in the training data. Let's throw it out." Without the specific questions, we can't catch this cheating.

2. The "Outdated Textbook" Problem (Saturation)

Imagine a math test from 1990. Today's students would crush it because they have calculators and better teaching. The test is no longer a good measure of intelligence; it's just too easy.

  • The Issue: AI is getting smarter so fast that old benchmarks are becoming "saturated." Every AI gets 100% on them, so the test stops telling us who is actually better.
  • The Fix: By looking at item-level data, researchers can see which questions are too easy and replace them with harder ones, keeping the test relevant.

3. The "Fake Skill" Problem (Validity)

Imagine a driver's test where the car only has to drive in a straight line. If a car passes, does that mean it's a good driver? No, it just means it can drive straight.

  • The Issue: Many AI tests measure the wrong things. An AI might pass a "Reasoning" test not because it reasons, but because it found a shortcut (a "trick") to get the right answer.
  • The Fix: Item-level analysis allows researchers to use tools from Psychology (the science of testing humans) to see if a test actually measures what it claims to measure.

The Solution: Borrowing from Psychology

The paper suggests that AI researchers should stop acting like they invented testing from scratch and start borrowing from Psychometrics (the field that designs SATs, IQ tests, and medical exams).

Psychologists have spent 100 years figuring out how to analyze individual test questions. They ask:

  • "Is this question too hard?"
  • "Does this question actually measure 'logic' or just 'vocabulary'?"
  • "If a student gets this question wrong, does it mean they don't understand the concept?"

The authors say AI needs to do the same. They want to treat every AI benchmark question like a psychological test item, analyzing the data to ensure the test is fair, accurate, and actually measuring "intelligence" rather than "memorization."

The New Tool: OpenEval

To make this happen, the authors are launching a new library called OpenEval.

  • Think of it like a public library of answer sheets.
  • Instead of just publishing the final score of an AI, researchers will upload the specific questions, the AI's specific answers, and the scores for each question.
  • This allows anyone in the community to dig into the data, find the "leaked" questions, spot the "tricks," and build better tests.

Summary: Why Should You Care?

If we don't switch to this "item-level" approach:

  1. We will trust AI too much: We might think an AI is a doctor because it passed a test, but it might have just memorized the test answers.
  2. We will waste money: Companies will buy AI systems that are actually quite dumb at the specific tasks they need to do.
  3. Progress will stall: We won't know what AI is actually good at, so we won't be able to improve it effectively.

The Bottom Line: The paper is a call to stop looking at the "Final Grade" and start looking at the "Answer Sheet." Only by seeing the details can we build a science of AI that is trustworthy, safe, and truly intelligent.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →