Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Imagine you are watching a student take a math test. You want to know if they actually solved the problem or if they are just guessing and hoping the answer looks right.

Traditionally, we've tried to judge this by looking at the final answer or asking the student, "How confident are you?" (a scalar probability). But as this paper points out, a student can be very confident while being completely wrong. They might say, "I'm 100% sure the answer is 42!" while having no idea why.

The authors of this paper, TRACED, propose a new way to judge reasoning. Instead of just looking at the final answer or a confidence score, they watch how the student moves their mind while thinking. They treat the thinking process like a physical journey through a landscape.

Here is the simple breakdown using a creative analogy:

The Analogy: The Hiker in the Fog

Imagine the Large Language Model (LLM) is a hiker trying to find a specific campsite (the correct answer) in a dense, foggy forest (the complex reasoning problem).

The paper suggests we don't just check if they arrived at the campsite. Instead, we track their footprints on the map to see how they walked. They measure two things:

1. Progress (The Distance Covered)

The Good Hiker (Correct Reasoning): This hiker moves forward steadily. Every step takes them closer to the campsite. They don't walk in circles. If you look at their path on a map, it's a long, straight line from the start to the finish.
- In the paper: This is called High Displacement. The "thought" is moving forward, accumulating certainty.
The Lost Hiker (Hallucination): This hiker is stuck. They walk in tight circles, backtracking, or pacing in the same spot. They might take 1,000 steps, but they haven't moved an inch from where they started.
- In the paper: This is Low Displacement. The model is generating words, but the "meaning" isn't actually going anywhere.

2. Stability (The Smoothness of the Path)

The Good Hiker: Their path is smooth. They don't suddenly swerve left, then right, then left again. They have a clear direction.
- In the paper: This is Low Curvature. The thinking is stable and logical.
The Lost Hiker: Their path is jagged and chaotic. They swerve wildly, do a U-turn, then swerve again. They are constantly changing their mind, confused about which way to go.
- In the paper: This is High Curvature. The paper calls this a "Hesitation Loop." It's the geometric signature of the model panicking, going back to re-evaluate, and getting stuck in a loop of doubt.

The Big Discovery

The researchers found a clear pattern:

Correct Answers look like a smooth, straight highway. (High Progress, Low Curvature).
Wrong Answers (Hallucinations) look like a spaghetti noodle. (Low Progress, High Curvature).

Even if the model generates a huge amount of text (a long "thought chain"), if the path looks like spaghetti (wiggly and stuck), the answer is likely wrong. If the path looks like a highway, the answer is likely right.

Why This Matters

It's a "Lie Detector" for AI: Current methods often get fooled by confident-sounding nonsense. This method looks at the structure of the thinking. If the AI is "stalling" or "wiggling" too much, TRACED flags it as unreliable, even if the final sentence sounds perfect.
No Extra Training Needed: Unlike other methods that require a teacher to grade every answer, this method just looks at the internal "footprints" the AI leaves behind as it thinks. It's like judging a runner by their stride, not by a stopwatch.
It Works Everywhere: They tested this on math problems, science questions, and even social stories. The "spaghetti vs. highway" pattern held true for all of them.

The Takeaway

The paper gives us a new lens to understand AI. Instead of asking, "Did it get the right answer?" we can now ask, "Did it walk the right path to get there?"

If the AI's thinking process is a smooth, forward-moving journey, we can trust it. If it's a frantic, circling mess, we know it's hallucinating, even if it tries to sound confident. It turns the invisible process of "thinking" into a visible map we can actually read.

Here is a detailed technical summary of the paper "Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability" (TRACED).

1. Problem Statement

Current methods for evaluating Large Language Model (LLM) reasoning reliability suffer from significant limitations:

Scalar Reduction: Existing internal assessment methods (e.g., token probabilities, entropy, perplexity) reduce complex, multi-step reasoning trajectories into static scalars. This discards critical temporal dynamics and structural information.
External Dependency: External assessment methods rely on ground truth or auxiliary verifiers, which are not scalable for real-time inference where supervision is absent.
Lack of Interpretability: While recent work suggests hidden states contain geometric information, current approaches fail to bridge the gap between abstract geometric features and cognitive reasoning behaviors (e.g., distinguishing "hesitation" from "logical deduction").
Instability: Models often generate plausible-sounding but logically flawed derivations (hallucinations) that standard metrics fail to detect because the final output probability may appear confident.

2. Methodology: TRACED Framework

The authors propose TRACED (Topological Reasoning Assessment via Curvature Evolution and Displacement Dynamics), a framework that evaluates reasoning quality through geometric kinematics in the model's latent space.

A. Semantic Geometry Construction

To ensure geometric measurements reflect semantic changes rather than noise, the authors define a Semantic Metric:

They utilize the model's unembedding matrix ( $W_U$ ) to induce a metric tensor $G = W_U^\top W_U$ .
Hidden states are "whitened" using $G^{1/2}$ to filter out non-semantic artifacts (anisotropy) and ensure that Euclidean distance in the transformed space corresponds to semantic change.
A Reasoning Quality Space is constructed by extracting the top- $k$ eigenvectors of the contrastive covariance matrix between correct and incorrect reasoning chains, maximizing the separation of valid vs. invalid trajectories.

B. Geometric Signatures

The framework decomposes reasoning traces into two primary kinematic components:

Progress (Displacement, $M_n$ ):
- Definition: The normalized net geometric distance traversed from the start to the end of the reasoning chain ( $||z_T - z_0||$ ).
- Interpretation: Represents Certainty Accumulation. High displacement indicates the model is confidently transitioning between distinct semantic states toward a solution.
Stability (Curvature, $K_n$ ):
- Definition: The average extrinsic curvature of the trajectory, measuring the rate of change in direction (acceleration relative to velocity).
- Interpretation: Represents Logical Stability. Low curvature implies a smooth, directed deduction. High curvature indicates "Hesitation Loops," where the model oscillates between exploration and reflection without making progress.

C. Theoretical Foundation

The authors model reasoning as a Stochastic Differential Equation (SDE):

Correct Reasoning: Dominated by a logical velocity field ( $v_{logic}$ ), resulting in linear scaling of displacement ( $D(T) \propto T$ ) and minimal curvature.
Incorrect Reasoning (Hallucination): Dominated by noise ( $\sigma dW_t$ ), resulting in sub-linear scaling ( $D(T) \propto \sqrt{T}$ , resembling a random walk) and high curvature due to orthogonal, uncorrelated steps.

D. Probabilistic Assessment

TRACED employs a Bayesian probabilistic model to classify reasoning quality. It treats the geometric features ( $M_n, K_n$ ) as evidence and uses Maximum A Posteriori (MAP) estimation to determine the probability of a trajectory being correct, leveraging the topological divergence between correct (High $M$ , Low $K$ ) and incorrect (Low $M$ , High $K$ ) distributions.

3. Key Contributions

Geometric Decomposition: Introduces a novel framework that evaluates reasoning via Displacement (Progress) and Curvature (Stability), proving that valid reasoning manifests as high-progress, stable trajectories, while hallucinations are characterized by low-progress, unstable patterns.
Latent Kinematics Assessment: Constructs a robust, label-free probabilistic model that achieves competitive performance and superior robustness across diverse benchmarks without requiring external supervision.
Geometric-Cognitive Correspondence: Bridges geometry and cognition by mapping:
- High Curvature $\rightarrow$ "Hesitation Loops" (oscillation between exploration and reflection).
- High Displacement $\rightarrow$ "Certainty Accumulation" (convergence toward a final answer).
Universal Topological Signature: Demonstrates that the geometric distinction between correct and incorrect reasoning is domain-invariant, holding true across structured (Math, Theorems) and open-ended (Social, Fables) reasoning tasks.

4. Experimental Results

The framework was evaluated on four models (including DeepSeek-R1, Qwen3, Llama-3.1, Qwen2.5) across six benchmarks (GSM8K, MATH, TheoremQA, GPQA, Social IQA, Understanding Fables).

Performance: TRACED consistently outperformed or matched state-of-the-art baselines, including:
- Output Probability Methods: MSP, Perplexity, Entropy.
- Hidden State Probes: LR Probe, SAPLMA.
- Trajectory Modeling: CoE, CoT-Kinetics.
- Key Metric: TRACED achieved the highest AUROC and AUPR in most configurations, particularly excelling in distinguishing hallucinations in complex reasoning tasks (e.g., GPQA, MATH).
Robustness:
- Complexity: Performance remained stable across Easy, Medium, and Hard reasoning steps ( $\Delta \le 2.8\%$ ).
- Data Efficiency: The model reached a performance plateau with only ~400 reference samples.
- Cross-Domain: A single "Global Fit" model trained on aggregated data performed competitively across all domains, confirming the universality of geometric signatures.
Scaling Laws: Empirical validation confirmed the theoretical scaling laws: Correct reasoning followed linear displacement scaling ( $slope \approx 0.82$ ), while incorrect reasoning followed sub-linear scaling ( $slope \approx 0.53$ ).

5. Significance and Impact

Beyond Accuracy: TRACED shifts the paradigm from evaluating what the model outputs to how the model thinks, providing a "physical lens" to decode internal machine thought dynamics.
Interpretability: By mapping geometric features to cognitive states (e.g., identifying "Hesitation Loops"), the framework offers actionable insights into model failure modes, aiding in the development of more reliable reasoning models.
Deployment Viability: The method is computationally efficient (millisecond-level latency) and does not require expensive external verifiers or massive training datasets, making it suitable for real-time, label-free reasoning evaluation in production environments.
Theoretical Advancement: It provides a rigorous mathematical grounding (via SDEs and differential geometry) for understanding LLM reasoning as a geometric flow, bridging the gap between abstract representation theory and cognitive science.