Position: Science of AI Evaluation Requires Item-level Benchmark Data

The Big Picture: The "Report Card" Problem

Imagine you are a parent trying to decide which school to send your child to. The schools hand you a single piece of paper with a final grade: "95%."

School A got 95% because they taught your child how to solve complex calculus problems.
School B got 95% because they taught your child how to guess the right answer on a multiple-choice test by spotting patterns in the font size.

If you only see the final "95%" number, you can't tell the difference. You might send your child to School B, thinking they are a math genius, only to find out they can't actually do math.

This is the current state of AI evaluation. We have AI models (like the schools) and we have "benchmarks" (the tests). Currently, we mostly look at the final score (the 95%). The authors of this paper argue that this is dangerous, especially when we want to use AI for serious things like healthcare, law, or finance.

The Core Argument: Look at the "Answer Sheet," Not Just the Grade

The paper argues that to truly understand AI, we need Item-level data.

Current Way: We see the total score. "The AI got 80% on the History Test."
Proposed Way: We see the specific questions the AI got right or wrong. "The AI got 100% on questions about the Civil War but 0% on questions about the Cold War, and it guessed 'Washington' on every question about presidents."

The authors say that without seeing the individual questions (the "items") and the specific answers the AI gave, we are flying blind. We don't know if the AI is smart, or if it just memorized the test answers.

Why Do We Need This? (The Three Big Problems)

The paper identifies three main reasons why looking only at the final score is broken:

1. The "Leaked Answer Key" Problem (Data Contamination)

Imagine a student who accidentally saw the test questions before the exam. They memorized the answers. On test day, they get a perfect score.

The Issue: AI models are trained on massive amounts of internet data. Often, the "test questions" (benchmarks) are already in that data. The AI isn't "thinking"; it's just recalling what it saw during training.
The Fix: If we have item-level data, we can look at specific questions and say, "Hey, this question looks exactly like something in the training data. Let's throw it out." Without the specific questions, we can't catch this cheating.

2. The "Outdated Textbook" Problem (Saturation)

Imagine a math test from 1990. Today's students would crush it because they have calculators and better teaching. The test is no longer a good measure of intelligence; it's just too easy.

The Issue: AI is getting smarter so fast that old benchmarks are becoming "saturated." Every AI gets 100% on them, so the test stops telling us who is actually better.
The Fix: By looking at item-level data, researchers can see which questions are too easy and replace them with harder ones, keeping the test relevant.

3. The "Fake Skill" Problem (Validity)

Imagine a driver's test where the car only has to drive in a straight line. If a car passes, does that mean it's a good driver? No, it just means it can drive straight.

The Issue: Many AI tests measure the wrong things. An AI might pass a "Reasoning" test not because it reasons, but because it found a shortcut (a "trick") to get the right answer.
The Fix: Item-level analysis allows researchers to use tools from Psychology (the science of testing humans) to see if a test actually measures what it claims to measure.

The Solution: Borrowing from Psychology

The paper suggests that AI researchers should stop acting like they invented testing from scratch and start borrowing from Psychometrics (the field that designs SATs, IQ tests, and medical exams).

Psychologists have spent 100 years figuring out how to analyze individual test questions. They ask:

"Is this question too hard?"
"Does this question actually measure 'logic' or just 'vocabulary'?"
"If a student gets this question wrong, does it mean they don't understand the concept?"

The authors say AI needs to do the same. They want to treat every AI benchmark question like a psychological test item, analyzing the data to ensure the test is fair, accurate, and actually measuring "intelligence" rather than "memorization."

The New Tool: OpenEval

To make this happen, the authors are launching a new library called OpenEval.

Think of it like a public library of answer sheets.
Instead of just publishing the final score of an AI, researchers will upload the specific questions, the AI's specific answers, and the scores for each question.
This allows anyone in the community to dig into the data, find the "leaked" questions, spot the "tricks," and build better tests.

Summary: Why Should You Care?

If we don't switch to this "item-level" approach:

We will trust AI too much: We might think an AI is a doctor because it passed a test, but it might have just memorized the test answers.
We will waste money: Companies will buy AI systems that are actually quite dumb at the specific tasks they need to do.
Progress will stall: We won't know what AI is actually good at, so we won't be able to improve it effectively.

The Bottom Line: The paper is a call to stop looking at the "Final Grade" and start looking at the "Answer Sheet." Only by seeing the details can we build a science of AI that is trustworthy, safe, and truly intelligent.

1. Problem Statement

The paper identifies a critical crisis in the current paradigm of AI evaluation: systemic validity failures in Large Language Model (LLM) benchmarks. While benchmarks are the primary evidence for deploying generative AI in high-stakes domains, they suffer from:

Lack of Transparency: Critical design choices (capability definitions, content curation, metric selection) often lack formal justification.
Validity Degradation: Benchmarks suffer from saturation (items becoming too easy), data contamination (training-test overlap), and obsolescence due to rapid AI evolution.
Aggregation Bias: Current analysis relies on benchmark-level aggregate scores, which obscure underlying issues. Aggregate scores cannot diagnose why a model performed well or poorly, nor can they distinguish between genuine capability gains, data contamination, or spurious correlations.
Missing Diagnostic Tools: Without item-level data, researchers cannot perform granular diagnostics (e.g., checking if an item measures the intended construct or is influenced by formatting artifacts), making it impossible to validate the scientific rigor of AI evaluations.

2. Methodology and Theoretical Framework

The authors propose a paradigm shift from aggregate analysis to item-level analysis, drawing heavily on established practices from psychometrics and educational testing.

Theoretical Basis: The paper argues that AI evaluation should adopt the Evidence-Centered Design (ECD) framework used in psychometrics. This involves treating benchmarks not just as tools for ranking, but as measurement instruments requiring validation of:
- Construct Validity: Does the benchmark measure the intended capability (e.g., reasoning) rather than irrelevant factors (e.g., memorization)?
- Item Functioning: Do individual items behave as intended?
Analytical Techniques: The authors apply classical and modern statistical methods to item-level data:
- Classical Test Theory (CTT): Calculating item difficulty ( $p_i$ ) and item discrimination ( $r_i$ ) to identify items that fail to differentiate between model capabilities.
- Item Factor Analysis (IFA): Using Singular Value Decomposition (SVD) and Generalized Low Rank Models (GLRM) to uncover latent sub-constructs within a benchmark. This reveals whether a benchmark measures a single capability or a mix of unrelated skills.
- Convergent/Discriminant Validity: Correlating item-level factor scores with external benchmarks to verify if the measured constructs align with theoretical expectations.
Case Studies: The authors demonstrate these methods on existing datasets:
- MMLU & MMLU-Pro: Analyzing item difficulty distributions and discrimination to detect saturation and noise.
- BabiQA: Using IFA to show that performance is driven by answer-key biases (e.g., selecting specific animals) rather than deductive reasoning.
- MMLU-Pro Sub-constructs: Identifying four distinct reasoning dimensions (e.g., formal quantitative modeling vs. domain-specific recall) that aggregate scores mask.

3. Key Contributions

Theoretical Position: The paper formally argues that item-level benchmark data is a prerequisite for a rigorous "Science of AI Evaluation." It posits that without access to individual item responses, validity claims are empirically underdetermined.
OpenEval Repository: The authors introduce OpenEval, a growing, open-source repository designed to standardize the release of item-level data. It organizes:
- Item content and metadata.
- Model responses (raw text).
- Per-item scores and metrics.
- Model information and provenance.
- Current Scale: Covers >225k items from 64 datasets and >8M item-level responses.
Diagnostic Framework: They provide a concrete toolkit for analyzing benchmarks, including methods to detect:
- Construct Contamination: Items measuring irrelevant factors.
- Saturation: Items that have become trivial for current models.
- Data Contamination: Anomalies in item performance that suggest training data leakage.
Counter-Arguments: The paper addresses and refutes common objections to releasing item-level data (e.g., concerns about data contamination), arguing that transparency is the only way to effectively detect and mitigate contamination, whereas hiding data exacerbates the problem.

4. Key Results and Findings

Through empirical illustrations, the paper demonstrates the unique insights gained from item-level data:

Benchmark Saturation: Analysis of MMLU-Pro revealed that despite being designed as "harder," a significant proportion of items still have very low difficulty for post-2024 models, indicating rapid saturation.
Poor Discrimination: Many items in established benchmarks (even after expert review) showed negative or near-zero discrimination, meaning high-performing models performed worse on them than low-performing models, suggesting ambiguity or miskeying.
Hidden Constructs in BabiQA: Factor analysis revealed that BabiQA performance was largely driven by the specific animal mentioned in the answer key (e.g., "wolf" vs. "mouse") rather than the intended deductive reasoning capability.
Multidimensionality of MMLU-Pro: Contrary to the assumption that MMLU-Pro is a unified measure of reasoning, IFA revealed four distinct sub-constructs (Formal Quantitative, Domain Recall, Conceptual Understanding, Applied Synthesis). Aggregate scores mask these nuances, leading to misleading conclusions about model strengths.
Validity Evidence: The factor scores showed strong convergent validity with external benchmarks (e.g., GPQA, Omni-MATH) for specific reasoning types, validating the utility of item-level decomposition.

5. Significance and Impact

Scientific Rigor: The paper elevates AI evaluation from "benchmarking" (ranking) to "measurement science," ensuring that claims about model capabilities are backed by granular, reproducible evidence.
Improved Benchmark Design: By understanding item properties (difficulty, discrimination, latent constructs), future benchmarks can be designed to be more robust, less noisy, and better aligned with specific user needs.
Governance and Trust: Item-level data enables regulators and stakeholders to trace aggregate claims back to specific errors or coverage gaps, facilitating evidence-based AI governance and auditing.
Community Standardization: OpenEval aims to become the foundational infrastructure for the field, encouraging a shift toward standardized, transparent, and traceable evaluation practices across the AI community.

In conclusion, the paper asserts that the future of reliable AI deployment depends on abandoning the "black box" of aggregate scores and embracing the granular, evidence-centered approach provided by item-level benchmark data.