Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

This study reveals that LLMs used as automated judges exhibit significant scoring inconsistencies across different models, temperatures, and repeated runs, challenging their reliability for enterprise workflows and highlighting the need for robust monitoring and hybrid evaluation strategies.

Fiona Lau

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you have hired five different super-tasters to judge a new batch of cookies. You give them all the exact same cookie and ask them to rate it on a scale of 0 to 10 for three things: Taste, Freshness, and How Full the Box Is.

You expect that if you ask the same taster to judge the same cookie ten times in a row, they will give you the same score every time. You also expect that if you ask five different tasters to judge the same cookie, their scores should be somewhat similar.

This paper is basically a report card on what happens when you actually try this experiment with AI models (the "tasters") instead of humans. The author, Fiona Lau, discovered some very surprising and slightly worrying things.

Here is the breakdown in simple terms:

1. The "Ghost in the Machine" Problem (Inconsistency)

Even when you tell the AI, "Be as serious and boring as possible" (which is what setting the temperature to 0 means in AI terms), the AI still acts like a moody artist.

  • The Analogy: Imagine asking a friend, "How much do you like this pizza?" ten times in a row while they are staring at the same slice. You'd expect them to say "8/10" every time. But with these AIs, they might say "8/10" the first time, "6/10" the second time, and "9/10" the third time, even though nothing changed.
  • The Finding: The study found that no model was truly consistent. They gave different scores for the exact same answer, over and over again. This is a big problem for businesses that rely on these scores to make decisions.

2. The "Strict vs. Generous" Judges (Different Models, Different Scores)

The study tested five famous AI models (from OpenAI, Google, and Anthropic). They are like five different judges on a TV competition show.

  • The Analogy: Think of Judge A as a grumpy food critic who hates everything and gives low scores. Judge B is a super-nice grandma who thinks everything is delicious and gives high scores.
  • The Finding:
    • Google's Gemini was the "Grandma." It was very generous, especially with the "Completeness" score (did the answer cover everything?).
    • Anthropic's Claude was the "Grumpy Critic." It was much stricter.
    • The Result: If you ask the same question to these different models, you could get a "Pass" from one and a "Fail" from the other. This means the choice of which AI you use changes the outcome of your business logic.

3. The "Completeness" Trap

The study looked at three specific criteria: Relevance (did it answer the question?), Accuracy (is it true?), and Completeness (did it miss anything?).

  • The Analogy: Imagine a student answering a math problem.
    • Relevance: They wrote about math. (Good)
    • Accuracy: The math is correct. (Good)
    • Completeness: They forgot to write the final "Therefore, x = 5." (Bad)
  • The Finding: The AI models were the worst at judging Completeness. They couldn't agree on whether an answer was "full" or "missing pieces." It was like asking five people if a glass is "half full," and they all give you different answers. This is dangerous because in business, "completeness" often determines if a customer gets a refund or if a ticket gets escalated.

4. The "Temperature" Knob (Does turning it down help?)

In AI, "Temperature" is a dial that controls how random the AI is.

  • High Temperature (1.0): The AI is creative, wild, and takes risks.

  • Low Temperature (0.0): The AI is supposed to be robotic, logical, and repeatable.

  • The Analogy: It's like telling a jazz musician to "play the same note exactly the same way every time."

  • The Finding:

    • For OpenAI (GPT) and Google (Gemini), turning the dial down to 0 helped a lot. They became much more consistent.
    • For Anthropic (Claude), turning the dial down barely helped. They were still inconsistent even when told to be robotic.
    • The Lesson: You can't just "fix" inconsistency by turning a setting to zero. It depends entirely on which AI you are using.

Why Should You Care? (The Real-World Impact)

You might think, "So the AI gives a slightly different score? Who cares?"

Here is why it matters:
Imagine a bank uses an AI to decide if a loan application is "Good" or "Bad."

  • Scenario A: The AI gives a score of 85 (Approved).
  • Scenario B: The same AI, looking at the same application 10 minutes later, gives a score of 45 (Rejected).

If this happens, the bank is being unfair. One customer gets a loan, and another identical customer gets rejected, just because the AI had a "bad day" or a "random thought."

The Bottom Line

The paper concludes that we cannot blindly trust AI to be a "Judge" yet.

  • Don't trust the score: A score of 80 today might be a 60 tomorrow.
  • Watch the "Completeness": This is the most unstable metric.
  • Mix it up: If you need reliability, don't rely on just one AI. You might need a "hybrid" system where a human double-checks the AI, or you use multiple AIs to vote on the answer.

In short: AI is a powerful tool, but right now, it's a bit like a weather forecast that changes its mind every time you ask it the same question. We need to learn how to live with that uncertainty before we let it make big decisions for us.