Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Imagine you are trying to figure out what a very smart, but slightly mysterious, robot is thinking. You have two ways to do this:

The "Ask and Wait" Method: You ask the robot a question, and it writes out a long, detailed answer. You then read the answer to see if it's right.
The "Surprise Meter" Method: You don't ask the robot to speak. Instead, you show it a sentence that stops right before the end, and you quietly check how "surprised" the robot would be if you filled in the blank with different words.

This paper is all about upgrading the second method (the Surprise Meter) to make it a much more powerful tool for testing AI.

The Old Way: The Binary Switch

Previously, researchers used a "Minimal Pairs" test. This was like a simple on/off switch.

Scenario: You show the AI: "The cat sat on the ___."
Option A: "mat" (Grammatically correct)
Option B: "matte" (Grammatically weird)
The Test: The AI is "less surprised" by "mat" and "more surprised" by "matte." If the surprise level for "mat" is lower, the AI passes.

The Problem: This only works for simple "Yes/No" or "Right/Wrong" questions. It's like trying to measure the temperature of a room with a switch that only says "Hot" or "Cold." It misses all the nuance in between. Also, asking the AI to write an answer is slow, expensive, and the AI might just make up a fancy-sounding reason for a wrong answer (a "post-hoc rationalization") just to look good.

The New Way: The Ordinal Surprisal Curve

The author, Andrew Katz, suggests we stop using a simple switch and start using a slider or a dial.

Instead of asking "Is this sentence true or false?", we ask the AI to rate it on a scale of 1 to 5 (or 1 to 9).

The Setup: We don't let the AI write the answer. We just feed it the question and check its "surprise level" for every possible number on the scale (1, 2, 3, 4, and 5).
The Result: We get a Surprisal Curve.
- If the AI is confident, the curve looks like a sharp spike. It's very surprised by 1, 2, 4, and 5, but not surprised at all by 3. It knows the answer is 3.
- If the AI is confused, the curve looks like a flat hill. It's equally surprised by all the numbers. It doesn't know what to think.

The Magic Ingredient: Entropy (The "Confusion Meter")
The paper introduces a concept called Entropy. Think of this as a "Confusion Meter."

Low Entropy (Sharp Spike): The AI is sure of itself. It's like a detective who is 100% certain the butler did it.
High Entropy (Flat Hill): The AI is genuinely unsure. It's like a detective looking at a crime scene with no clues.

This is huge because it lets us tell the difference between an AI that is wrong but confident (a sharp spike on the wrong answer) and an AI that is rightfully confused (a flat hill).

Where Did They Test This?

The author tested this "Surprise Meter" on four very different types of jobs, proving it works beyond just grammar:

The "What is this?" Game (SETS):
- Task: Is a "spring" a natural thing (ecological) or a machine part (technological)?
- Result: The AI could tell the difference. If you said "The spring in the pen," the AI was very surprised by "natural" and not surprised by "machine." If you said "The spring in the garden," it flipped. The bigger the AI model, the better it was at this.
The "Cause and Effect" Detective:
- Task: Does "Smoking causes cancer" express a real cause? What about "Ice cream sales go up when it's hot"? (That's just a correlation, not a cause).
- Result: The AI gave a sharp "Yes" for smoking and a flat, confused "Maybe" for the ice cream. This shows the AI understands the difference between causing something and just happening at the same time.
The "Metaphor" Spotter:
- Task: Is "The words hung in the air" literal or figurative?
- Result: The AI knew that for the literal version (a banner hanging), the answer was "Not metaphorical." For the figurative version (words as objects), the answer was "Very metaphorical." It could tell the difference even though the words were almost the same.
The "Survey Coder":
- Task: Reading open-ended survey answers from people about the pandemic and tagging them with themes (e.g., "Work/Life Balance").
- Result: The AI could assign a score of how well a theme fit a specific answer. If the fit was obvious, the curve spiked. If the answer was vague, the curve flattened, telling the human researcher, "Hey, I'm not sure about this one, you should double-check it."

Why Should You Care?

This paper offers a faster, cheaper, and more honest way to test AI.

No Talking Required: The AI doesn't have to waste time writing a paragraph. We just check its internal "gut feeling" (probability).
Honesty Check: It stops the AI from bluffing. If an AI is confused, the "Confusion Meter" (Entropy) will go up, warning us that the answer might be shaky.
Nuance: It moves us away from simple "Right/Wrong" thinking to understanding how sure the AI is.

In a nutshell: Instead of asking the AI to give a speech and hoping it's telling the truth, this method listens to the AI's internal "gut reaction" to different answers. It tells us not just what the AI thinks, but how sure it is about that thought. It's like checking a weather forecast not just by looking at the "Sunny" or "Rainy" icon, but by looking at the probability percentages to see if the meteorologist is actually confident in the prediction.

1. Problem Statement

Current evaluation paradigms for Large Language Models (LLMs) face three primary limitations:

Reliance on Text Generation: Standard prompting requires generating text, which is computationally expensive and may elicit "post-hoc rationalizations" (plausible-sounding but inaccurate reasoning) rather than reflecting the model's true internal state.
Binary Limitations: Existing "minimal pairs" evaluations (comparing probabilities of grammatical vs. ungrammatical sentences) are largely confined to binary grammaticality judgments. They fail to capture degrees of confidence or uncertainty in more complex, applied tasks.
Lack of Uncertainty Quantification: Binary outputs or verbalized confidence (asking the model "how sure are you?") are often poorly calibrated. Verbalized confidence can be misleading, and binary outputs discard information about the model's uncertainty distribution.

The paper proposes a method to probe LLM representations directly without generating text, moving beyond binary grammaticality to ordinal classification and scoring tasks across diverse applied domains.

2. Methodology: Surprisal-Based Evaluation Framework

The core innovation is extending the minimal pairs paradigm from binary choices to ordinal scales (e.g., 1–5 or 1–9) and utilizing entropy to quantify uncertainty.

A. Core Formulation

Instead of asking the model to generate an answer, the framework measures the surprisal (negative log probability, $S = -\log P$ ) assigned by the model to specific tokens representing different points on a rating scale.

Surprisal Curve: For a given context $c$ and a set of alternative completions $A = \{a_1, ..., a_n\}$ (e.g., numbers 1 through 9), the model's surprisal is calculated for each token.
Preferred Response: The token with the minimum surprisal ( $a^* = \arg\min S(a_i|c)$ ) represents the model's most "natural" or expected response.
Uncertainty Quantification (Entropy): By renormalizing the probabilities of the restricted set of alternatives, the framework calculates Entropy ( $H$ $H$ ).
- Low Entropy: A peaked distribution (steep curve) indicates high confidence.
- High Entropy: A flat distribution (flat curve) indicates genuine ambiguity or uncertainty.

B. Experimental Design

The authors employed a factorial design varying:

Model Factors: Different sizes and architectures (specifically the Qwen2.5 family: 3B, 7B, 14B, and 14B-Instruct).
Context Factors: No context, minimal definition, and comprehensive background information.
Task Factors: Four distinct domains (SETS, Causal Reasoning, Figurative Language, Deductive Coding).
Scale Factors: Binary (True/False) vs. Ordinal (1–5, 1–9) and different anchor wordings.

3. Key Contributions

Paradigm Extension: Extends minimal pairs from binary grammaticality to ordinal-scaled classification and scoring, enabling the measurement of confidence levels and uncertainty.
Entropy as a Signal: Demonstrates that entropy over the completion set can distinguish between genuine task ambiguity (high entropy) and confident errors (low entropy).
Domain Generalization: Validates the framework across four diverse applied domains:
- SETS Classification: Scoring entities on Social-Ecological-Technological dimensions.
- Causal Reasoning: Identifying causal vs. non-causal statements (binary and scaled).
- Figurative Language: Detecting metaphor vs. literal meaning.
- Deductive Coding: Applying qualitative codes to survey responses.
Efficiency: The approach requires only a single forward pass to read logits for a small set of tokens, offering significant computational speedups compared to text generation.

4. Key Results

The experiments were conducted primarily using the Qwen2.5 model family.

Model Size and Accuracy: Generally, larger models (14B) outperformed smaller ones (3B, 7B) in accuracy and alignment with expected scores.
- SETS Task: 14B models achieved the lowest Mean Absolute Error (MAE ~1.43), while 3B models struggled (MAE ~2.95).
- Causal Task: 14B-Instruct achieved ~77% accuracy in binary classification and ~89% in ordinal directional accuracy.
Context Sensitivity:
- Context provision helped smaller models significantly (e.g., 3B causal accuracy jumped from 44.7% to 68.0% with full context).
- Larger models were often less sensitive to context, sometimes performing better with no context or minimal definitions, suggesting their internal representations were already robust.
Surprisal Curves and Ambiguity:
- Clear-cut cases produced steep, monotonic curves with low entropy.
- Ambiguous cases (e.g., correlational statements like "students who study more get better grades") produced flat, parabolic curves with high entropy, correctly flagging the difficulty.
Instruction Tuning Effects: In the Figurative Language task, the 14B Base model (95% discrimination) outperformed the 14B-Instruct model (66.7%). This suggests instruction tuning may introduce response biases that distort raw surprisal distributions, making base models potentially better for probing implicit representations.
Tokenization Sensitivity: The study highlighted that surface form competition (e.g., leading spaces, capitalization) affects token probabilities, necessitating careful prompt design.

5. Significance and Implications

Access to Implicit Knowledge: The framework provides a window into the model's learned probability distributions and implicit representations without relying on the model's ability to articulate reasoning (which can be hallucinated or rationalized).
Uncertainty-Aware AI: By using entropy as a principled confidence metric, this method offers a way to flag ambiguous items for human review in human-in-the-loop systems, a critical requirement for high-stakes applications.
Efficiency: It offers a computationally cheap alternative to generation-based evaluation, making large-scale assessment feasible.
Theoretical Insight: The results suggest a potential analogy to Dual-Process Theory, where surprisal-based evaluation accesses "System 1" (intuitive, reflexive) responses, while prompting with reasoning chains accesses "System 2" (deliberate) processing.

6. Limitations and Future Work

Token Constraints: The method is currently limited to single-token completions to ensure clean measurement, restricting the complexity of responses.
Calibration: While entropy correlates with ambiguity, the relationship between entropy and actual accuracy is not perfectly calibrated across all models and domains.
API Access: The method requires access to token-level log-probabilities (logits), which are often restricted in proprietary API-only models.
Comparison Gap: The paper notes a lack of direct head-to-head comparison between surprisal-based and standard prompting-based classification on the same tasks, which is a priority for future research.

In conclusion, the paper successfully demonstrates that ordinal surprisal curves and entropy provide a robust, efficient, and interpretable framework for evaluating LLMs across diverse applied tasks, offering a complementary approach to traditional generation-based evaluation.

Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

The Old Way: The Binary Switch

The New Way: The Ordinal Surprisal Curve

Where Did They Test This?

Why Should You Care?

1. Problem Statement

2. Methodology: Surprisal-Based Evaluation Framework

A. Core Formulation

B. Experimental Design

3. Key Contributions

4. Key Results

5. Significance and Implications

6. Limitations and Future Work

More like this

Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

LLMs Struggle with Abstract Meaning Comprehension More Than Expected

Benchmarking Deflection and Hallucination in Large Vision-Language Models

Think Through Uncertainty: Improving Long-Form Generation Factuality via Reasoning Calibration