Criterion-referenceability determines LLM-as-a-judge… — Plain-Language Explanation

✨

This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are the principal of a very strict school, and you've hired a team of super-smart, tireless robots (Large Language Models, or LLMs) to help your human teachers grade thousands of physics exams. You want to know: Can we trust these robots to give fair grades, or will they just be making things up?

This paper is like a massive, rigorous "stress test" for these robot graders. The researchers didn't just ask, "Are the robots smart?" Instead, they asked, "Do the robots know how to grade?"

Here is the story of their findings, broken down into simple analogies.

The Three Types of Exams

The researchers tested the robots on three very different kinds of physics homework, which they call "assessment formats." Think of these as three different types of puzzles:

The Math Puzzle (Structured Questions): These are standard physics problems with a specific answer, like "Calculate the force of gravity on this rock." There is a right way and a wrong way.
The Essay (Written Arguments): Students have to write a paragraph explaining a concept. There isn't just one "right" sentence; it's about flow, logic, and style. It's very subjective.
The Graph (Scientific Plots): Students write code to draw a chart. The chart either looks right (correct axes, labels, data) or it looks wrong. It's visual but follows strict rules.

The "Blind" Test vs. The "Cheat Sheet" Test

The researchers tested the robots in different scenarios:

Blind: The robot sees the student's answer but has no idea what the correct answer is. It has to figure it out from scratch.
With the Cheat Sheet (Solution): The robot is given the teacher's official answer key.
The Trap (False Solution): The robot is given a fake answer key that looks real but is actually wrong. This tests if the robot is smart enough to spot the error or if it just blindly follows instructions.

What Happened? The Results

1. The Math Puzzles: The Robots are Great! 🧮

When grading the "Math Puzzles" (structured questions), the robots were surprisingly good.

Without a cheat sheet: They could still tell the difference between a good answer and a bad one. They got the ranking right about 60-70% of the time compared to humans.
With the cheat sheet: They became almost perfect.
The Trap: When given a fake answer key, the robots' scores went haywire. They gave bad grades to good answers just because the "cheat sheet" said so. This proves they are very obedient but sometimes lack independent critical thinking.

The Analogy: Imagine a robot grading a math test. If you give it the answer key, it's a genius. If you give it a fake answer key that says "2+2=5," it will happily mark a student who wrote "4" as wrong. It follows the rules too strictly.

2. The Graphs: The Robots are Visual Artists 🎨

When grading the "Scientific Plots," the robots were amazing.

They could look at a chart and instantly tell if the axes were labeled correctly or if the data was messy.
They agreed with human teachers almost perfectly.
Why? Because a graph is either "clean and correct" or "messy and wrong." It's easy to see the rules.

The Analogy: Think of grading a graph like checking if a car is parked in a parking spot. It's either in the lines or it isn't. The robots are excellent at spotting if the car is in the lines.

3. The Essays: The Robots are Lost in the Fog 🌫️

This is where things got weird. When grading the "Essays," the robots failed completely.

The Problem: Even the human teachers couldn't agree on who wrote the best essay. One teacher gave an essay an 80; another gave it a 60. The essays were just too subjective.
The Robot's Reaction: The robots tried to guess what the humans wanted.
- Blind: They were harsh and random.
- With Examples (Anchoring): The researchers showed the robots examples of "good" and "bad" essays. Suddenly, the robots' scores looked perfectly aligned with the humans. They gave the exact same average score!
- The Catch: Even though their average score was right, they still couldn't tell which specific essay was better. They were just mimicking the distribution of scores, not actually judging quality.

The Analogy: Imagine a robot trying to judge a cooking contest where the judges can't agree on what "delicious" means. If you show the robot a picture of a "good" dish and a "bad" dish, the robot might learn to give everyone a "medium" score to look safe. It looks like it's doing a good job because the average is right, but it has no idea which dish is actually better.

The Big Discovery: "Criterion-Referenceability"

The researchers came up with a fancy term to explain all this: Criterion-Referenceability.

Let's translate that to plain English: "Can you write down the rules clearly?"

High Criterion-Referenceability (Math & Graphs): You can write a rulebook: "If the number is 5, give 10 points. If the graph has no title, give 0 points." The robots love this. They can follow the rulebook perfectly.
Low Criterion-Referenceability (Essays): You cannot write a clear rulebook. You have to say, "This essay feels deep and insightful." This relies on "holistic judgment" (a gut feeling). The robots are terrible at gut feelings.

The "Human Baseline" Surprise

The most important finding is about the human teachers.
In the essay section, the human teachers themselves couldn't agree on the rankings. If humans can't agree, how can we expect a robot to?
The paper argues that when a robot's scores look "perfectly aligned" with humans on essays, it's often a trick. The robot is just copying the pattern of the human scores (like averaging them out) without actually understanding the work. It's like a student who copies the answer sheet's average score instead of solving the problems.

The Takeaway for Teachers and Parents

So, should we let robots grade physics?

Yes, for Math and Graphs: If the task has clear rules (like a math problem or a coding graph), robots are great assistants. They can do the boring grading quickly and fairly.
No, for Essays (yet): If the task requires a "gut feeling" or creative judgment, robots are dangerous. They might look like they are grading fairly, but they are actually just guessing based on patterns.
The Golden Rule: Before you hire a robot to grade, ask: "Can a human teacher grade this consistently?" If the humans can't agree, the robot definitely won't either.

In short: Robots are excellent at following a recipe (Math/Graphs), but they are terrible at judging a work of art (Essays) unless the art is actually just following a recipe. If the task is too vague, the robot will just pretend to be smart, and that's a risk we can't take in education.

1. Problem Statement

As Large Language Models (LLMs) demonstrate increasing capability in solving physics problems, there is growing interest in using them as automated judges for student assessment. However, the reliability of "LLM-as-a-judge" systems remains uncertain. Previous research suggests that aggregate metrics (like Mean Absolute Error) can be misleading, masking systematic biases such as position bias, verbosity preferences, and anchoring to provided reference materials.

The core problem addressed is whether LLM marking validity is a general capability or if it is strictly dependent on the assessment format. Specifically, the authors investigate how validity changes across three distinct modalities of physics student output:

Structured Questions: Mathematical derivations and short answers.
Essays: Holistic, open-ended scientific writing.
Scientific Plots: Visual data representations generated via code.

The study aims to determine if LLMs can be trusted to rank-order student work (discriminative validity) and provide accurate scores (absolute accuracy) under various prompting conditions (blind, solution-provided, false-solution, and exemplar-anchored).

2. Methodology

Datasets

The study utilized three distinct datasets drawn from a single undergraduate physics program (Durham University) to control for institutional confounds:

Structured Questions ( $N=1,922$ ):
- Blind Exam Set: 771 questions from university exams (2018–2022) without access to official solutions.
- Curriculum Set: 1,151 questions from GCSE, A-Level, and textbooks with known solutions.
- Note: Responses for structured questions were AI-generated (GPT-3.5/4) to create a controlled pool of correct and incorrect answers, as authentic student scripts were unavailable for the blind exam set.
Essays ( $N=275$ ): 55 scripts containing 5 short-form essays each, originally marked by human experts. The dataset includes both human and AI-authored essays.
Scientific Plots ( $N=1,400$ ): 1400 individual plot elements from 100 submissions (student and AI-generated) for a "Laboratory Skills" module, evaluated on a 0–5 scale.

Models Evaluated

Five state-of-the-art models were tested:

GPT-5.2 (OpenAI)
Claude Opus 4.5 (Anthropic)
Gemini Pro 3 (Google DeepMind)
DeepSeek-V3.2 (DeepSeek)
Grok 4.1 (xAI)
Committee Aggregation: A consensus score derived from the rounded mean of all individual models.

Experimental Conditions

The study varied the informational structure of the prompts:

Blind: No reference solution or rubric provided.
Solution Provided: Official correct solution included.
False Solution: A deliberately corrupted solution (e.g., sign flip, factor-of-10 error) provided to test anchoring bias.
Exemplar Anchoring (Essays only): Five exemplar essays with known human scores (5th to 95th percentiles) provided to calibrate the model's score distribution.

Metrics

Absolute Accuracy: Mean Absolute Error (MAE) and fractional MAE (fMAE).
Discriminative Validity: Spearman rank correlation ( $\rho$ ) and Quadratic Weighted Kappa (QWK) to measure the ability to correctly rank student responses.
Calibration: Comparison of predicted score distributions against human distributions.

3. Key Contributions

Introduction of "Criterion-Referenceability": The paper defines and empirically validates the concept of criterion-referenceability—the extent to which grading criteria can be made explicit, observable, and consistently applied. The authors argue that this metric, rather than raw model capability, is the primary determinant of LLM marking validity.
Task-Dependent Validity: The study demonstrates that LLM marking reliability is not uniform; it varies drastically based on the assessment format, ranging from high validity in structured tasks to near-zero validity in holistic essay tasks.
The "Anchoring Trap": The research reveals that providing exemplars to LLMs can artificially improve distributional agreement (lowering MAE) without improving the model's ability to distinguish quality (discriminative validity), particularly in subjective tasks.
Anchoring Bias Quantification: The study quantifies the severity of anchoring bias, showing that LLMs will defer to a provided solution (even a false one) over independent physics reasoning, significantly degrading absolute accuracy while preserving rank ordering.

4. Key Results

A. Structured Questions

Blind Conditions: Models achieved moderate absolute accuracy (fMAE $\approx$ 0.22) and robust discriminative validity ( $\rho > 0.6$ ).
With Correct Solution: Providing the official solution reduced fMAE significantly (e.g., committee fMAE dropped from 0.131 to 0.085) and strengthened validity ( $\rho \approx 0.88$ ).
With False Solution: Absolute accuracy degraded severely (fMAE $\approx$ 0.32), confirming models defer to the reference. However, discriminative validity remained surprisingly robust ( $\rho \approx 0.77$ ), indicating models could still rank answers even when penalizing correct answers for not matching the false reference.
Conclusion: Structured tasks are highly criterion-referenceable; models can map responses to explicit criteria effectively.

B. Essay Marking

Human Baseline: Human markers showed extremely low inter-rater reliability for rank ordering ( $\rho \approx 0.05$ ), indicating the task itself is inherently noisy.
Blind/Scheme Conditions: AI marking was harsher and more variable than humans. Discriminative validity was near-zero ( $\rho \approx 0.1$ ) and statistically indistinguishable from zero across all conditions.
Anchored Condition: Adding exemplars shifted the AI mean to match the human mean and compressed variance (MAE dropped to $\approx$ 3.2, better than human inter-rater error). Crucially, discriminative validity remained near-zero ( $\rho \approx 0.03$ ).
Conclusion: In low-criterion-referenceability tasks, LLMs can mimic human score distributions without actually learning to discriminate quality. Low MAE in this context is misleading.

C. Scientific Plots

Performance: Models achieved exceptionally high discriminative validity ( $\rho > 0.84$ ) and near-linear calibration without any reference material.
Accuracy: AI MAE fell within the range of human inter-rater disagreement.
Conclusion: Despite being visual, the task was sufficiently constrained by the Jupyter notebook context and rubric (axes, units, labels) to be highly criterion-referenceable.

D. Authorship Effects

Analysis showed no evidence of "self-preference" bias (AI grading its own work more leniently). In fact, AI models were often stricter on AI-generated essays than human work, suggesting the validity conclusions are not inflated by authorship bias.

5. Significance and Implications

Governance and Regulation: The findings support regulatory caution (e.g., EU AI Act, Ofqual). A system that produces low average error but fails to distinguish strong from weak work is not a valid assessor. This is critical for high-stakes education where marks determine progression.
Deployment Strategy:
- Safe for AI: Structured questions and rubric-constrained visual outputs (plots) are suitable for AI-assisted marking, preliminary grading, or anomaly detection.
- Unsafe for AI: Holistic essays and open-ended conceptual writing should not be autonomously graded by LLMs, as validity is indistinguishable from chance.
Pedagogical Design: Educators should prioritize "criterion-referenceability" when designing assessments if they intend to use AI. If human markers cannot achieve stable rank ordering, AI cannot either.
Limitations of Aggregation: While committee aggregation (averaging multiple models) improves absolute accuracy, it cannot recover discriminative validity if the underlying task lacks clear criteria.

Final Conclusion: The validity of LLM-as-a-judge is not a function of the model's intelligence but of the task structure. Validity tracks criterion-referenceability and benchmark reliability. In physics education, AI is a viable tool for structured and visual assessments but remains unreliable for holistic, subjective essay grading.

Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats