Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

This paper extends the minimal pairs paradigm for evaluating language models by replacing binary grammaticality judgments with ordinal-scaled classification tasks, utilizing information-theoretic surprisal curves and entropy to assess model preferences and uncertainty across diverse domains without relying on expensive text generation.

Andrew Katz

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are trying to figure out what a very smart, but slightly mysterious, robot is thinking. You have two ways to do this:

  1. The "Ask and Wait" Method: You ask the robot a question, and it writes out a long, detailed answer. You then read the answer to see if it's right.
  2. The "Surprise Meter" Method: You don't ask the robot to speak. Instead, you show it a sentence that stops right before the end, and you quietly check how "surprised" the robot would be if you filled in the blank with different words.

This paper is all about upgrading the second method (the Surprise Meter) to make it a much more powerful tool for testing AI.

The Old Way: The Binary Switch

Previously, researchers used a "Minimal Pairs" test. This was like a simple on/off switch.

  • Scenario: You show the AI: "The cat sat on the ___."
  • Option A: "mat" (Grammatically correct)
  • Option B: "matte" (Grammatically weird)
  • The Test: The AI is "less surprised" by "mat" and "more surprised" by "matte." If the surprise level for "mat" is lower, the AI passes.

The Problem: This only works for simple "Yes/No" or "Right/Wrong" questions. It's like trying to measure the temperature of a room with a switch that only says "Hot" or "Cold." It misses all the nuance in between. Also, asking the AI to write an answer is slow, expensive, and the AI might just make up a fancy-sounding reason for a wrong answer (a "post-hoc rationalization") just to look good.

The New Way: The Ordinal Surprisal Curve

The author, Andrew Katz, suggests we stop using a simple switch and start using a slider or a dial.

Instead of asking "Is this sentence true or false?", we ask the AI to rate it on a scale of 1 to 5 (or 1 to 9).

  • The Setup: We don't let the AI write the answer. We just feed it the question and check its "surprise level" for every possible number on the scale (1, 2, 3, 4, and 5).
  • The Result: We get a Surprisal Curve.
    • If the AI is confident, the curve looks like a sharp spike. It's very surprised by 1, 2, 4, and 5, but not surprised at all by 3. It knows the answer is 3.
    • If the AI is confused, the curve looks like a flat hill. It's equally surprised by all the numbers. It doesn't know what to think.

The Magic Ingredient: Entropy (The "Confusion Meter")
The paper introduces a concept called Entropy. Think of this as a "Confusion Meter."

  • Low Entropy (Sharp Spike): The AI is sure of itself. It's like a detective who is 100% certain the butler did it.
  • High Entropy (Flat Hill): The AI is genuinely unsure. It's like a detective looking at a crime scene with no clues.

This is huge because it lets us tell the difference between an AI that is wrong but confident (a sharp spike on the wrong answer) and an AI that is rightfully confused (a flat hill).

Where Did They Test This?

The author tested this "Surprise Meter" on four very different types of jobs, proving it works beyond just grammar:

  1. The "What is this?" Game (SETS):

    • Task: Is a "spring" a natural thing (ecological) or a machine part (technological)?
    • Result: The AI could tell the difference. If you said "The spring in the pen," the AI was very surprised by "natural" and not surprised by "machine." If you said "The spring in the garden," it flipped. The bigger the AI model, the better it was at this.
  2. The "Cause and Effect" Detective:

    • Task: Does "Smoking causes cancer" express a real cause? What about "Ice cream sales go up when it's hot"? (That's just a correlation, not a cause).
    • Result: The AI gave a sharp "Yes" for smoking and a flat, confused "Maybe" for the ice cream. This shows the AI understands the difference between causing something and just happening at the same time.
  3. The "Metaphor" Spotter:

    • Task: Is "The words hung in the air" literal or figurative?
    • Result: The AI knew that for the literal version (a banner hanging), the answer was "Not metaphorical." For the figurative version (words as objects), the answer was "Very metaphorical." It could tell the difference even though the words were almost the same.
  4. The "Survey Coder":

    • Task: Reading open-ended survey answers from people about the pandemic and tagging them with themes (e.g., "Work/Life Balance").
    • Result: The AI could assign a score of how well a theme fit a specific answer. If the fit was obvious, the curve spiked. If the answer was vague, the curve flattened, telling the human researcher, "Hey, I'm not sure about this one, you should double-check it."

Why Should You Care?

This paper offers a faster, cheaper, and more honest way to test AI.

  • No Talking Required: The AI doesn't have to waste time writing a paragraph. We just check its internal "gut feeling" (probability).
  • Honesty Check: It stops the AI from bluffing. If an AI is confused, the "Confusion Meter" (Entropy) will go up, warning us that the answer might be shaky.
  • Nuance: It moves us away from simple "Right/Wrong" thinking to understanding how sure the AI is.

In a nutshell: Instead of asking the AI to give a speech and hoping it's telling the truth, this method listens to the AI's internal "gut reaction" to different answers. It tells us not just what the AI thinks, but how sure it is about that thought. It's like checking a weather forecast not just by looking at the "Sunny" or "Rainy" icon, but by looking at the probability percentages to see if the meteorologist is actually confident in the prediction.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →