EDIS: Diagnosing LLM Reasoning via Entropy Dynamics

Imagine you are trying to solve a complex math problem with a very smart, but sometimes overconfident, robot friend.

The Problem: The "Confident Fool"

Currently, when we ask AI models (like the ones in this paper) to solve problems, we often check their "confidence" to see if they are right. Think of this like asking the robot, "Are you sure?"

Most existing methods treat confidence like a final report card. They look at the whole answer and give it one single grade: "This answer is 80% confident."

The Flaw: A robot can write a long, rambling, confused answer that somehow ends with a very confident "The answer is 42!" The old methods might say, "Great, high confidence at the end! It must be right." But the robot was actually lost for most of the journey.

The Discovery: Watching the Journey, Not Just the Destination

The authors of this paper, EDIS, realized that how the robot thinks is more important than what it thinks at the end. They decided to watch the robot's "thought process" in real-time, like watching a GPS map while driving.

They found that wrong answers have a very specific, chaotic "driving style" that right answers don't have:

The "Burst Spike" (The Panic Spiral):
Imagine the robot starts driving confidently. Suddenly, it hits a bump, gets confused, then gets more confused, then even more confused. The "uncertainty meter" keeps climbing steadily. It's like a driver who realizes they are lost, speeds up, swerves, and keeps swerving harder.
- In the paper: This is called a Burst Spike. The robot's confidence keeps dropping as it generates more words.
The "Peak-Valley Spike" (The False Hope):
Imagine the robot is driving, then suddenly thinks, "Aha! I found the answer!" (Confidence goes up, uncertainty drops). But then, two seconds later, it realizes, "Wait, that doesn't make sense!" (Confidence crashes, uncertainty spikes). It's like a driver spotting a sign, turning the wheel sharply, then realizing it was the wrong turn and slamming the brakes.
- In the paper: This is called a Peak-Valley Spike. It's a "V-shape" of false confidence followed by panic.

Correct answers, on the other hand, are like a smooth highway drive. The robot knows where it's going, the "uncertainty meter" stays low and steady, and there are no sudden swerves.

The Solution: The EDIS Score

The team created a new tool called EDIS (Entropy Dynamics Instability Score).

Think of EDIS as a "Stability Detector" for the robot's brain. Instead of just looking at the final grade, it watches the whole drive.

Low EDIS Score: The drive was smooth. The robot was consistently confident. Likely a correct answer.
High EDIS Score: The drive was a rollercoaster. The robot panicked, got confident, panicked again, and swerved. Likely a wrong answer.

Why This Matters (The Magic Results)

The researchers tested this on math problems. Here is what happened:

Better Filtering (The "Sieve"):
Imagine you ask the robot to generate 16 different answers to the same math problem.
- Old Way: You pick the one that sounds the most confident at the end.
- EDIS Way: You look at the "drive logs" of all 16 answers. You throw away the ones that had the panic spirals and false hopes. You keep the smooth ones.
- Result: They improved the accuracy of the AI by 82% just by using this filter! They didn't need to teach the robot anything new; they just learned how to pick the better answers.
Better Training (The "Coach"):
They also tried using EDIS to teach the robot.
- When the robot was learning, they told it: "If you solve a problem smoothly (Low EDIS), that's a great example, keep doing that!"
- "If you solve a problem with a panic spiral (High EDIS), that's a bad example, don't do that again."
- Result: The robot learned faster and became much better at reasoning, even without a human teacher checking every step.

The Big Picture

This paper teaches us that reasoning isn't just about the final answer; it's about the journey.

Just like you can tell a good driver from a bad one by how smoothly they drive, not just by whether they arrived at the destination, we can tell if an AI is thinking clearly by watching how its confidence changes from word to word. EDIS is the tool that finally lets us see that "driving style," helping us build smarter, more reliable AI.

1. Problem Statement

Large Language Models (LLMs) have achieved significant success in complex reasoning tasks, yet a critical challenge remains: distinguishing correct reasoning from plausible-sounding errors without external verification.

Limitation of Current Methods: Existing approaches treat model confidence as a static quantity, typically aggregating token-level probabilities or entropy into a single scalar (e.g., mean entropy) or examining only the final output.
The Gap: This static view ignores the temporal evolution of confidence during autoregressive generation. Recent evidence suggests entropy calibration degrades over time, and the dynamics of how uncertainty changes may contain richer diagnostic information than aggregate statistics alone.

2. Methodology: Entropy Dynamics Instability Score (EDIS)

The authors propose that incorrect reasoning is characterized not just by high uncertainty, but by instability in how that uncertainty evolves. They introduce EDIS, a trajectory-level metric designed to quantify these dynamics.

A. Empirical Observations

Through systematic analysis of token-level entropy trajectories, the authors identified two characteristic failure patterns in incorrect reasoning that persist across models and training stages:

Burst Spikes: A sustained, progressive increase in entropy over consecutive tokens, indicating the model is becoming increasingly confused as it generates.
Peak-Valley (Rebound) Spikes: A "V-shaped" trajectory where entropy drops to a local minimum (false confidence) before sharply rebounding (renewed uncertainty).

B. The EDIS Metric

EDIS formalizes these observations into a single score combining spike frequency and overall variance:
$\text{EDIS}(H) = S(H) \cdot (1 + \text{Var}(H))$
Where:

$S(H)$ $S (H)$ is the combined spike score, calculated as the average of:
- $S_{burst}$ : Count of positions where cumulative entropy growth exceeds a threshold $\tau_b$ within a sliding window.
- $S_{rebound}$ : Count of positions where current entropy rises significantly above the historical running minimum ( $\min_{s<t} H_s$ ) by a threshold $\tau_r$ .
$\text{Var}(H)$ is the variance of the entropy trajectory.
Interpretation: A lower EDIS indicates stable, confident reasoning. A higher EDIS indicates unstable, erratic reasoning.

C. Application in Reinforcement Learning (RL)

Beyond inference, the authors explore using EDIS for training-time sample curation in Group Relative Policy Optimization (GRPO):

Sequence Filtering: Retaining only the most stable correct responses (low EDIS) and the most unstable incorrect ones (high EDIS) to create high-signal training data.
Sequence Weighting: Assigning differential weights to all samples based on EDIS. Correct trajectories with low EDIS are up-weighted, while incorrect trajectories with high EDIS (indicating genuine struggle) are also up-weighted to provide learning signals, whereas "lucky" correct guesses (high EDIS) are down-weighted.

3. Key Contributions

Systematic Analysis of Entropy Dynamics: The paper establishes that incorrect reasoning exhibits distinct temporal instability patterns (burst and peak-valley spikes) that are consistent across different models, temperatures, and training stages.
Introduction of EDIS: A simple, interpretable metric that captures trajectory-level instability, outperforming static confidence measures like mean entropy.
Empirical Validation: Comprehensive experiments demonstrating EDIS's effectiveness in both inference-time selection and RL training, showing it provides a robust signal for reasoning quality without requiring external verifiers.

4. Experimental Results

The authors evaluated EDIS on mathematical reasoning benchmarks (GSM8K, MATH, AMC23, AIME24) using Qwen2.5-Math and Qwen3 models.

A. Inference-Time Selection (Best-of-N)

Accuracy Gains: EDIS-based filtering substantially improved reasoning accuracy. Across four benchmarks and three models, average accuracy improved from 29.9% to 54.5% (an 82% relative gain) as the candidate pool size increased.
Comparison with Baselines: EDIS consistently outperformed alternative confidence measures:
- EDIS: 60.6% overall accuracy.
- Self-Certainty (SC): 51.7%.
- Sequence Entropy (Mean): 50.9%.
Discriminative Power: EDIS achieved an AUC of 0.804 in distinguishing correct from incorrect responses, significantly higher than mean entropy (0.673). At the top 10% retention rate, EDIS achieved 91.1% accuracy vs. 61.0% for mean entropy.

B. Reinforcement Learning (RL) Training

Training Efficiency: Using EDIS to guide sample selection and weighting in GRPO led to significant performance improvements.
- At $T=0.6$ , full EDIS-informed training improved validation accuracy by +5.4% (majority vote) and +8.1% (mean) over the baseline.
- Weighting vs. Filtering: While filtering helped, sequence weighting provided the largest gains (+7.7 pp), suggesting that "middle-ground" samples still carry useful gradient signals when properly weighted.
Qualitative Improvements: EDIS-guided models produced responses with lower entropy (more focused reasoning) and shorter lengths, avoiding the uncertainty cascades typical of incorrect trajectories.

5. Significance and Future Directions

Paradigm Shift: The paper shifts the focus from static confidence scores to dynamic entropy evolution, revealing that the process of reasoning is as informative as the result.
Scalability: EDIS requires no external verifiers, human annotations, or additional training, making it a highly scalable solution for improving LLM reasoning.
Future Work: The authors suggest extending EDIS to other domains (code, science), refining model-specific thresholds, and using token-level EDIS signals for unsupervised process supervision and fine-grained credit assignment in reward models.

In conclusion, EDIS demonstrates that entropy dynamics are a powerful, underexplored lens for diagnosing and improving LLM reasoning, offering substantial accuracy gains through both inference-time selection and training-time data curation.