The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

This paper reveals a significant performance disparity where Large Language Models excel at generation tasks but struggle with evaluation, often producing unfaithful judgments even in areas where they lack competence, thereby challenging the assumption that generative proficiency guarantees evaluative reliability.

Juhyun Oh, Eunsu Kim, Inha Cha, Alice Oh

Published Mon, 09 Ma
📖 4 min read☕ Coffee break read

Imagine you hire a world-class chef to cook a gourmet meal. They are incredible at chopping vegetables, seasoning the sauce, and plating the dish. Everyone agrees they are a master of cooking.

Now, imagine you ask that same chef to act as a food critic for a blind taste test. You hand them a plate of food (which might be their own cooking or someone else's) and ask, "Is this delicious? Is it cooked correctly?"

You would expect the master chef to be the best judge possible, right? After all, they know exactly what good food tastes like.

This paper says: "Not so fast."

The researchers at KAIST and Georgia Tech discovered a strange phenomenon they call the "Generative AI Paradox." It turns out that just because an AI is amazing at creating answers (cooking), it doesn't mean it's good at checking answers (critiquing).

Here is the breakdown of their findings using simple analogies:

1. The "Chef Who Can't Taste" (The Paradox)

The researchers tested several AI models (like GPT-4 and others) on a trivia game.

  • The Cooking Task (Generation): The AI had to answer questions like, "Where was actor Nigel Hawthorne born?"
  • The Critic Task (Evaluation): The AI had to look at an answer (e.g., "Coventry") and decide if it was right or wrong.

The Shocking Result:

  • Case A: The AI correctly cooked the answer ("Coventry"), but when asked to critique that same answer, it said, "No, that's wrong!"
  • Case B: The AI cooked a wrong answer ("London"), but when asked to critique a different wrong answer, it said, "Yes, that's correct!"

It's like a chef who can perfectly bake a cake but then looks at a picture of that same cake and says, "That's definitely not a cake."

2. The "Overconfident Student" (Unfaithfulness)

The paper introduces the concept of Faithfulness. This asks: Does the AI judge based on what it actually knows?

Imagine a student taking a math test.

  • If they get a problem wrong, they should admit, "I don't know the answer to this."
  • But these AIs are like overconfident students who get a question wrong, but when they grade someone else's paper, they confidently mark the correct answer as "Wrong" because they are confused about their own knowledge.

The study found that these AIs often don't know what they don't know. Even when they are stumped and can't answer a question themselves, they rarely say "I don't know" when grading. Instead, they guess, and often guess confidently and incorrectly.

3. The "Inconsistent Judge" (Lack of Reliability)

The researchers also found that the AI judges are inconsistent.

  • If you give the AI two very similar wrong answers, it might mark one as "Wrong" and the other as "I don't know."
  • It's like a referee in a soccer game who blows the whistle for a foul in one game, but ignores the exact same foul in the next game.

This inconsistency means you can't fully trust the AI to be the final judge of truth, even if it is very good at writing stories or answering questions.

The Big Takeaway

For a long time, people assumed: "If an AI is smart enough to write a great essay, it must be smart enough to grade essays."

This paper proves that assumption is false.

  • Generation is like performing a magic trick.
  • Evaluation is like understanding the mechanics of the trick.

Just because an AI can perform the trick perfectly doesn't mean it understands the mechanics well enough to spot a fake trick.

Why does this matter?
As we start using AI to grade school papers, check medical diagnoses, or fact-check news, we need to be careful. We can't just assume the AI is a "super-judge" because it's a "super-writer." We need to build better systems that check the AI's grading, because the AI might be grading on a whim rather than on actual knowledge.