Verbalizing LLM's Higher-order Uncertainty via Imprecise Probabilities

Imagine you are asking a very smart, well-read robot (a Large Language Model, or LLM) a question. You want to know not just the answer, but how sure the robot is about that answer.

Currently, if you ask the robot, "How confident are you?" it usually gives you a single number, like "I'm 80% sure." The authors of this paper argue that this single number is often a lie, or at least a very poor description of reality. It's like a weather forecaster saying, "There is an 80% chance of rain," without telling you if they are looking at a clear sky or a hurricane.

Here is the core idea of the paper, broken down with simple analogies.

1. The Problem: The "Single Number" Trap

The paper identifies three situations where asking for a single "confidence score" fails:

The Ambiguous Question: Imagine asking, "Who hosted the 2019 Cricket World Cup?"
- The Robot's Dilemma: The answer is technically "England and Wales" (co-hosts). But if you ask a human, they might say "England" or "Wales" or "The UK."
- The Failure: A standard robot might say, "I'm 90% sure the answer is England." This is misleading. The robot isn't 90% sure; it's actually confused because the question itself is fuzzy. It can't give a single number that captures this confusion.
The "Learning" Scenario (In-Context Learning): Imagine you give the robot 10 examples of a math puzzle, then ask it to solve a new one.
- The Failure: As you give it more examples, the robot gets better at solving the puzzle. Its actual error rate drops. But if you ask for its confidence, it often stays stuck at a high "uncertainty" level. It doesn't realize it has learned the pattern yet.
The Self-Reflection Trap: Imagine the robot picks an answer and then explains why it picked it.
- The Failure: Often, the robot's explanation doesn't match its confidence score. It might say, "I'm 99% sure this is right," but then give a weak, shaky reason. The numbers and the logic don't line up.

2. The Solution: The "Fuzzy Interval" (Imprecise Probabilities)

The authors propose a new way to talk to the robot. Instead of asking for a single number (a precise point), they ask the robot to give a range (an interval).

Think of it like this:

Old Way (Precise Probability): "I think the temperature is exactly 72°F." (This feels confident, but might be wrong).
New Way (Imprecise Probability): "I think the temperature is somewhere between 65°F and 80°F."

This range tells you two different things at once:

First-Order Uncertainty (The "What"): How spread out are the possible answers? (e.g., "Is it 65 or 80?"). This is the natural randomness of the question.
Second-Order Uncertainty (The "How Sure I Am"): How wide is that range?
- If the range is narrow (68°F to 72°F), the robot is confident in its knowledge. It knows the answer well.
- If the range is wide (50°F to 90°F), the robot is admitting ignorance. It doesn't know enough to narrow it down.

3. The Creative Analogy: The Detective and the Witness

Imagine a detective (the Robot) trying to solve a crime.

The Old Method: You ask the detective, "How sure are you that John did it?"
- The detective says, "80% sure."
- Problem: You don't know why he is 80%. Is he 80% sure because the evidence is shaky? Or is he 80% sure because the question is confusing? You can't tell.
The New Method (Imprecise Probabilities): You ask, "Give me a range of how likely it is that John did it."
- Scenario A (Ambiguous Question): The detective says, "It could be anywhere from 10% to 90%."
  - Meaning: "The question is so vague (maybe 'John' refers to two different people) that I can't even narrow it down. I am uncertain about my own uncertainty."
- Scenario B (Learning from Clues): You give the detective more clues.
  - Result: The range shrinks. "Now I'm 70% to 85% sure."
  - Meaning: "I have learned enough to narrow down the possibilities. My ignorance has decreased."

4. How They Did It (The "Magic Prompt")

The researchers didn't change the robot's brain (which is often a secret "black box"). Instead, they changed the questions they asked.

They used a clever prompting technique based on an old idea from a mathematician named Bruno de Finetti. They asked the robot to act like a gambler.

The Prompt: "If you had to bet money on this answer being correct, what is the lowest price you would pay to buy the bet, and what is the highest price you would accept to sell it?"
The Result:
- If the robot is confused, it will give a huge gap between the buy and sell price (e.g., "I'd buy at $0.10 but sell at $0.90"). This wide gap represents high second-order uncertainty (ignorance).
- If the robot is sure, the gap will be tiny (e.g., "Buy at $0.85, Sell at $0.86"). This narrow gap represents low second-order uncertainty (knowledge).

5. Why This Matters

This method makes AI more honest and useful.

Better Decision Making: If a doctor asks an AI, "Is this tumor cancerous?" and the AI says, "I'm 90% sure," the doctor might operate. But if the AI says, "I'm 90% sure, but my confidence range is 10% to 99% because the scan is blurry," the doctor knows to get a second opinion.
Cost Effective: The paper shows this doesn't require expensive, complex computing. It just requires asking the right questions.
Fixing "Hallucinations": It helps the AI realize when it is making things up because it doesn't have enough information to narrow down its answer range.

Summary

The paper teaches us that uncertainty isn't just one thing. Sometimes we are unsure because the world is chaotic (First-Order). Sometimes we are unsure because we don't know enough (Second-Order).

By asking AI to give a range instead of a single number, we get a much clearer picture of what the AI actually knows, what it is guessing, and when it is simply confused. It turns the AI from a "confident liar" into a "humble expert."

Here is a detailed technical summary of the paper "Verbalizing LLM's Higher-order Uncertainty via Imprecise Probabilities."

1. Problem Statement

Large Language Models (LLMs) are increasingly deployed in high-stakes scenarios requiring reliable uncertainty quantification (UQ). Current methods, primarily vanilla verbalized uncertainty (where models output a single confidence score, e.g., "80% confident"), suffer from systematic failure modes:

Ambiguity: They fail to distinguish between questions with a single correct answer and ambiguous questions with multiple valid interpretations.
In-Context Learning (ICL): They fail to reflect reduced uncertainty as more context examples are provided (prediction error drops, but reported uncertainty remains flat).
Self-Reflection: They often fail to align the model's selected answer with the utility implied by its reported confidence, violating Bayesian rationality.

The core issue is the assumption that uncertainty can be fully captured by a single, precise probability (first-order uncertainty). The authors argue that this framework cannot represent higher-order uncertainty (uncertainty about the uncertainty itself), which is crucial for distinguishing between aleatoric uncertainty (inherent randomness/ambiguity) and epistemic uncertainty (lack of knowledge).

2. Methodology

The paper proposes a framework grounded in Imprecise Probabilities (IP) to elicit and quantify both first- and second-order uncertainty via prompting and post-processing.

Core Concepts

First-Order Uncertainty: Intrinsic randomness over possible responses (e.g., a question having multiple valid answers).
Second-Order Uncertainty: Indeterminacy in the probability model itself (e.g., the model is unsure which probability distribution is correct due to lack of data).
Representation: Instead of point estimates, IP uses probability intervals $[p(y), \bar{p}(y)]$ , where the width represents the degree of imprecision (ignorance).

Proposed Techniques

The authors introduce three specific prompting strategies to elicit these intervals:

DeFinetti (First-Order Refinement):
- Based on Bruno de Finetti's coherent betting interpretation.
- The model is asked to assign "buy prices" (probabilities) for each answer such that they sum to 1.0.
- A verifier enforces probability axioms (non-negativity, normalization) to ensure the output is a valid probability distribution.
ProbInt (Probability Intervals):
- The model is prompted to provide a lower bound ( $p(y)$ ) and an upper bound ( $\bar{p}(y)$ ) for each answer.
- Lower bound: The smallest probability the model considers plausible.
- Upper bound: The largest probability the model considers defensible.
- Constraints ensure the sum of lower bounds $\le 1$ and sum of upper bounds $\ge 1$ .
Credal Sets & Possibility Functions:
- Credal Sets: An ensemble of models (or multiple seeds) reports point probabilities; the convex hull of these distributions forms the interval.
- Possibility Functions: The model assigns a "plausibility" score (non-additive) to answers, allowing for "none of the above" without redistributing mass.

Post-Processing: Maximum Mean Imprecision (MMI)

To convert intervals into a scalar uncertainty score, the authors use the Maximum Mean Imprecision (MMI) metric:

For a single answer, it is the interval width: $\bar{p}(y) - p(y)$ .
For a set of answers, it uses a tractable upper bound: $1 - \sum p(y)$.
This metric quantifies the "gap" in the model's knowledge, serving as a proxy for epistemic uncertainty.

3. Key Contributions

First IP Instantiation for LLMs: This is the first work to concretely implement Imprecise Probabilities for verbalized uncertainty in LLMs, moving beyond single-point estimates.
Disentanglement of Uncertainty: The framework successfully separates ambiguity (first-order, irreducible) from ignorance (second-order, reducible via more data/examples).
Novel Prompting & Verification: Introduces "DeFinetti" and "ProbInt" prompts with algorithmic verifiers to ensure mathematical coherence (adherence to probability axioms).
Cost Efficiency: Unlike sampling-based methods (e.g., semantic entropy) that require hundreds of generations, the proposed verbalized methods require only 1–2 API calls per query, significantly reducing cost.

4. Experimental Results

The authors evaluated their methods on synthetic tasks and real-world QA benchmarks (MAQA, AmbigQA, MMLU-Pro).

Synthetic Experiments (Sequence Transformation):
- First-Order: Both vanilla and DeFinetti correctly increased uncertainty scores as "ambiguity noise" (multiple valid answers) increased.
- Second-Order: As the number of in-context examples increased (reducing epistemic uncertainty), ProbInt and Credal scores decreased, tracking the drop in prediction error. In contrast, vanilla uncertainty scores remained high and flat, failing to reflect improved performance.
- Group Uncertainty: In multi-agent settings, the Credal set approach significantly outperformed standard aggregation (majority voting) in AUROC for correctness detection.
Real-World QA Benchmarks:
- Ambiguity Detection: The DeFinetti method achieved the highest AUROC in distinguishing ambiguous vs. clear questions, outperforming Semantic Entropy and direct ambiguity prompts.
- Correctness Detection: ProbInt and Credal methods achieved state-of-the-art performance in detecting correct answers (AUROC), often outperforming sampling-based baselines like "Is-True" and "Label Probability."
- Alignment with Ground Truth: The proposed methods showed the strongest correlation (Concordance Index) with theoretical proxies for first-order (entropy) and second-order (KL divergence) uncertainty derived from corpus statistics.
- Rationality: The model's selected answers aligned best with the Maximin rule (choosing the answer with the highest lower probability), a standard decision rule under IP, validating the internal consistency of the elicitation.

5. Significance

Faithful Uncertainty Reporting: By allowing models to express "I don't know" via wide intervals rather than forcing a precise but potentially wrong confidence score, the method improves the credibility of LLMs in ambiguous or data-scarce scenarios.
Decision Support: The separation of first- and second-order uncertainty allows downstream systems to make better decisions:
- High First-Order: The question is ambiguous; the system should ask for clarification.
- High Second-Order: The model lacks knowledge; the system should retrieve more context or abstain.
Scalability: The method offers a principled alternative to expensive sampling-based uncertainty estimation, making high-quality UQ feasible for production LLM applications with low API costs.

In conclusion, the paper demonstrates that modeling LLM uncertainty through the lens of Imprecise Probabilities resolves critical failure modes of current methods, providing a more robust, coherent, and cost-effective framework for uncertainty quantification.