Improving reasoning at inference time via uncertainty minimisation

Imagine you are trying to solve a very difficult math problem. You sit down, and your brain starts working. But instead of just writing down the first idea that pops into your head, you pause. You think, "Wait, is this the right path? Or should I try a different angle?"

This paper is about teaching AI models (like the ones that power chatbots) to do exactly that: pause, evaluate their own confidence, and pick the best path forward.

Here is the breakdown of the paper's ideas using simple analogies:

1. The Problem: The "Hasty Thinker"

Current AI models are like hasty students taking a test. When they see a question, they immediately start writing the first answer that comes to mind.

The Issue: Sometimes, that first idea is wrong. Because the AI is so fast, it doesn't stop to check if it's making sense. It just keeps writing, getting further and further down a wrong path until it runs out of time or space.
The Old Fix: To fix this, researchers used to make the AI write the same answer 10 or 20 times and then pick the one that appeared most often (like asking 20 friends for advice and taking a vote). This works, but it's slow and expensive, like hiring 20 people to solve one math problem.

2. The New Idea: The "Confidence Compass"

The authors propose a smarter way. Instead of asking the AI to write the answer 20 times, they let it write just a few options (say, 2, 4, or 8) for each step of the thinking process.

But here is the magic trick: They don't pick the answer that looks "most popular." Instead, they ask the AI: "Which of these paths do you feel most certain about?"

The Metaphor: Imagine you are hiking in a foggy forest.
- Old Method: You try 20 different paths at once, hoping one leads out.
- New Method: You try 4 paths for 10 minutes. Then, you check your internal "gut feeling" (confidence). One path feels solid and clear; the others feel shaky and foggy. You pick the solid one and continue. If you get to a fork in the road later, you check your gut feeling again.

3. "Thoughts" vs. "Letters"

The paper makes a crucial distinction. Most AI methods look at the letters (tokens) one by one.

The Analogy: Imagine trying to navigate a city by looking at every single brick in the sidewalk. It's too noisy and confusing.
The Solution: This paper suggests looking at whole "thoughts" (like whole sentences or logical steps). It's like looking at the whole street corner instead of individual bricks. This gives the AI a clearer picture of where it's going.

4. The Big Discovery: "The Early Decision"

The researchers found something surprising while watching the AI think.

The Discovery: If the AI is going to get the answer right, it usually figures out the right path very early. Its "confidence" spikes up quickly and stays high.
The Wrong Path: If the AI is going to get it wrong, it keeps wandering, its confidence keeps dropping, and it takes a very long time to realize it's lost.
The Lesson: You don't need to check the AI's confidence for the whole 40 steps of a problem. You only need to check it for the first few steps. Once the AI picks the right path early on, it just needs to follow it.

5. Does it work in other languages?

The team tested this not just in English, but also in Danish (a language with fewer data resources).

The Result: It worked just as well! This proves that the AI's "gut feeling" (self-certainty) isn't just about knowing English words; it's about understanding the logic of the problem. It's a universal skill.

Summary: Why is this a big deal?

It's Cheaper: You don't need to hire 20 "friends" (run 20 simulations) to get a good answer. You just need to ask the AI to check its own confidence a few times.
It's Smarter: It stops the AI from confidently walking off a cliff. It forces the AI to pause and say, "I'm not sure about this step, let me try a different one."
It's Efficient: By focusing only on the beginning of the thought process, you save a massive amount of computer power while getting better results.

In a nutshell: The authors taught AI to stop being a "fast talker" and start being a "careful thinker" by listening to its own internal confidence meter, especially at the very start of a problem.

1. Problem Statement

Large Language Models (LLMs) have demonstrated strong multi-step reasoning capabilities, particularly when using Chain-of-Thought (CoT) prompting. However, existing methods to improve reasoning performance at inference time (without retraining) face significant limitations:

Computational Cost: Methods like Self-Consistency (majority voting) or Best-of-N require generating hundreds of full trajectories, which is token-intensive and expensive.
Granularity Mismatch: Most uncertainty minimization techniques operate at the token level (noisy, local decisions) or the full generation level (ignoring the dynamic structure of reasoning). Neither aligns well with the cognitive view of reasoning as a sequence of semantically coherent intermediate steps.
Dependency on External Evaluators: Many advanced search methods (e.g., Beam Search with Process Reward Models) require training external judges or reward models, increasing complexity.
Language Bias: Many reasoning strategies are optimized for high-resource languages (English) and may not transfer robustly to low-resource or typologically different languages.

The authors propose a method that operates at the "thought level" (intermediate reasoning steps) to minimize uncertainty efficiently, using only internal model signals.

2. Methodology: Thought-Level Self-Certainty Maximization

The core proposal is an inference-time scaling strategy that selects reasoning steps by maximizing Self-Certainty, a metric derived from the model's internal predictive distribution.

Key Definitions

Reasoning Step (Thought): A sequence of tokens generated between predefined delimiters (e.g., sub-derivations in math problems).
Self-Certainty ( $C$ ): Defined as the Kullback-Leibler (KL) Divergence between the model's predicted token distribution and a uniform distribution.
- Mathematically, for a step $y_i$ , the self-certainty is:
  $C(y_i) = D_{KL}(U \parallel p(\cdot|x, y_{<i})) = -\frac{1}{V} \sum_{j=1}^{V} \log(V \cdot p(j|x, y_{<i}))$
- Higher self-certainty indicates a "peaked" distribution where the model is internally confident about the continuation.
Sentence-Level Certainty: The average of token-wise self-certainty across the step, normalizing for length.

The Algorithm

Sampling: At each reasoning step, the model samples $k$ candidate continuations (thoughts).
Scoring: Each candidate is scored based on its average self-certainty.
Selection: The candidate with the highest self-certainty is selected and appended to the context.
Iteration: This process repeats until a valid answer is generated or a maximum step limit (40 steps) is reached.

Key Properties:

Model-Internal: Relies solely on the base model's logits; no external evaluators or reward models.
Online: Operates step-by-step, allowing for early stopping and reduced token usage compared to full-trajectory sampling.
Open-Ended: Applicable to open-ended questions, unlike methods relying on answer aggregation (e.g., majority voting).

3. Experimental Setup

Datasets:
- MATH500: 500 high-difficulty competition-level math problems.
- GSM8K: 100 grade-school math problems (original English and Danish translations).
Models: Qwen2.5-Instruct (0.5B, 1.5B, 3B) and Llama-3.2-Instruct (1B, 3B).
Baselines:
- Greedy Forward Pass (1 pass).
- Self-Consistency (Majority voting with 2, 4, or 8 samples).
Metrics: Accuracy and Token Budget efficiency.

4. Key Results

A. Performance Improvement

Superiority over Baselines: Self-certainty maximization consistently matched or outperformed both greedy decoding and Self-Consistency across all model sizes and datasets.
Efficiency: Significant gains were observed with as few as 2 samples per step. It achieved comparable or better accuracy than Self-Consistency using similar token budgets, but with a more structured, step-wise selection process.
Small Models: The method was particularly effective for smaller models (e.g., 0.5B, 1B), suggesting these models possess latent reasoning capabilities that greedy decoding fails to retrieve reliably.

B. Cross-Linguistic Generalization

Danish Evaluation: When applied to Danish translations of GSM8K, the method maintained robust performance.
Language Agnostic: While baseline accuracy dropped significantly for Danish prompts, the relative improvement provided by self-certainty maximization was proportional to English results (up to 4x improvement in some small models). This confirms the method relies on structural reasoning signals rather than language-specific patterns.

C. Dynamics of Self-Certainty

Early Convergence: Analysis of reasoning trajectories revealed that correct paths exhibit higher self-certainty from the very first steps compared to incorrect paths.
Uncertainty Resolution: Correct trajectories show a rapid decrease in self-certainty gains (i.e., the model commits to a stable plan early). Incorrect trajectories continue to explore competing hypotheses, showing fluctuating or decreasing certainty over long chains.
Predictive Power: The internal confidence signal is predictive of final accuracy within the first ~20 steps, often much earlier.

D. Strategic Budget Allocation

Early Stopping: Experiments varying the number of steps where sampling occurs showed an inverted U-shape in performance.
Optimal Strategy: Allocating the sampling budget (16 candidates) only to the first 1–5 steps yielded the highest accuracy.
Diminishing Returns: Sampling at every step (up to 40) actually degraded performance, likely due to over-optimization or "brittle" reasoning paths. This suggests that the critical planning phase happens early, and subsequent steps are largely execution.

5. Key Contributions

Novel Inference-Time Method: Introduced a principled strategy for uncertainty minimization at the thought level (intermediate steps) rather than tokens or full generations.
Efficiency: Demonstrated that high reasoning performance can be achieved with minimal sampling (2–4 steps) and no external training, outperforming expensive Self-Consistency.
Insight into Reasoning Dynamics: Revealed that reasoning correctness is determined early in the generation process, with correct paths converging to stable plans rapidly.
Robustness: Validated that the method transfers effectively to low-resource languages (Danish), proving its reliance on universal reasoning structures.

6. Significance and Implications

Cost-Effective Scaling: This method offers a highly efficient alternative to "brute-force" inference scaling (generating hundreds of full answers). It allows smaller models to approach the performance of larger systems by optimizing where computation is spent.
Mechanistic Insight: The findings challenge the notion that reasoning is a uniform process. Instead, it highlights a planning phase (early steps) where uncertainty is resolved, followed by an execution phase. Future inference strategies should focus computational budgets on these early decision points.
Accessibility: By removing the need for external reward models or massive compute clusters, this approach makes advanced reasoning capabilities more accessible for deployment in resource-constrained environments.

In conclusion, the paper argues that reasoning is an uncertainty minimization process best guided by the model's own internal confidence at the level of semantic thoughts, with the most critical decisions occurring at the very beginning of the reasoning chain.