Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Imagine you are trying to figure out if a student is actually understanding a math problem or just guessing the answer based on the words they see.

If you only look at the final answer they write down, it's hard to tell the difference. They might get the right answer by luck, or they might get the wrong answer even if they understood the concept.

This paper introduces a new way to "peek inside the brain" of Large Language Models (LLMs)—the AI chatbots we use today. Instead of just looking at the final answer, the authors propose watching the entire journey the AI takes to get there.

Here is the breakdown of their idea, "Truth as a Trajectory," using simple analogies.

1. The Old Way: Taking a Snapshot

The Problem:
Currently, researchers try to understand AI by taking a "snapshot" of its brain at a single moment (usually in the middle of its processing). They ask, "Is this specific thought pattern 'toxic' or 'correct'?"

The Flaw:
The authors say this is like trying to judge a movie by looking at just one frame.

If the AI sees the word "poison," a snapshot might scream "DANGER!" even if the sentence is "The poison ivy is dangerous to touch" (which is a safe, educational sentence).
The AI's brain is messy. It mixes up facts, grammar, and specific words all at once. A single snapshot is too cluttered to tell if the AI is actually reasoning or just repeating a pattern it memorized.

2. The New Way: Watching the Movie (The Trajectory)

The Solution:
The authors suggest we shouldn't look at a single frame. Instead, we should watch the whole movie of how the AI's thoughts change from the first word to the last, layer by layer.

They call this "Truth as a Trajectory."

The Analogy: The Hiker vs. The Drunkard
Imagine two people trying to walk from the bottom of a hill to the top (the "correct answer").

The Hiker (Correct Reasoning): They walk in a smooth, steady path. They might zigzag a little to avoid rocks, but their overall direction is consistent. They are making progress toward the goal.
The Drunkard (Spurious Reasoning): They stumble, spin in circles, take giant steps backward, and then lurch forward. Their path is jagged, chaotic, and full of sharp, sudden turns.

The paper argues that correct reasoning leaves a smooth, geometric "footprint" in the AI's brain as it processes information. Incorrect reasoning (or hallucinations) leaves a jagged, chaotic footprint.

3. How They Did It: Measuring the "Steps"

Instead of asking, "What is the AI thinking right now?" they asked, "How did the AI's thinking change from the last step to this one?"

Displacement: They measured the "step" the AI took between layers.
Velocity & Curvature: They looked at how fast the AI was changing its mind and how sharply it was turning.

They found that when an AI is reasoning correctly, its internal "steps" are smooth and consistent. When it is guessing or lying, its internal steps are jerky and erratic.

4. The Results: Why This Matters

The researchers tested this on many different tasks:

Logic Puzzles: Can the AI solve a riddle?
Toxicity: Is the AI being mean?

The Big Win:
The old methods (the "snapshots") were easily tricked. If you changed the words slightly, they failed.
The new "Trajectory" method was like a super-detective.

It could tell the difference between someone saying a bad word (like quoting a villain in a story) and someone intending to be bad.
It worked even when the AI was talking about a completely new topic it hadn't seen before. It recognized the shape of good reasoning, not just the specific words.

5. The Bottom Line

Think of the AI's brain as a factory assembly line.

Old Method: You check the product at the end of the line. If it looks good, you assume the factory is working well.
New Method (TaT): You watch the conveyor belt. You see if the parts are being assembled smoothly or if the machine is jamming and spitting out parts randomly.

Why is this a big deal?
It means we can build better safety systems for AI. Instead of just blocking bad words, we can detect if the AI is thinking in a dangerous or illogical way, even if it's using polite language. It helps us trust AI not just because it says the right thing, but because we can see it doing the right thing.

1. Problem Statement

Current explainability methods for Large Language Models (LLMs) predominantly rely on the Linear Representation Hypothesis (LRH). This paradigm treats hidden states as static points in activation space, assuming that properties like reasoning validity or toxicity can be detected via linear probes at a single, specific layer.

The authors identify several critical limitations in this static approach:

Polysemanticity: Activations are saturated with mixed features (lexical content, syntax, task artifacts), causing linear probes to latch onto surface-level correlations (e.g., specific tokens) rather than underlying reasoning structures.
Lack of Generalization: Probes trained on one dataset often fail to generalize to others because the "geometry of truth" is often task-specific and orthogonal across domains.
Static vs. Dynamic: Reasoning is a dynamic process of iterative refinement, but static probes ignore the temporal evolution of representations across layers.
Layer Selection Ambiguity: There is no principled way to select which layer to probe, and intervention results are often inconsistent.

The core problem is the inability to distinguish between a model that follows valid reasoning patterns and one that relies on spurious, surface-level heuristics, particularly in safety-critical domains.

2. Methodology: Truth as a Trajectory (TaT)

The authors propose Truth as a Trajectory (TaT), a framework that reframes LLM inference as a dynamical system rather than a collection of static snapshots.

Core Concept

Instead of analyzing raw activations ( $h_{\ell}$ ) at a specific layer, TaT models the inference process as a continuous trajectory through representation space. It focuses on layer-wise displacement vectors:
$\Delta h_{\ell} = h_{\ell+1} - h_{\ell}$
This transformation is motivated by the Privileged Basis Hypothesis. By taking the difference between successive layers, the method attenuates static, high-magnitude components (like token identity and prompt-specific content) and isolates the active residual updates ( $f_{\theta}(h_{\ell})$ ) that represent the model's active reasoning process.

Architecture

Trajectory Construction: For a given input sequence, the model extracts the sequence of displacement vectors across all tokens and all layers, forming a unified temporal sequence $S_i$ .
Dynamics Modeling: A lightweight Long Short-Term Memory (LSTM) network processes this sequence. The LSTM is chosen to explicitly model the sequential dependencies of geometric updates and to capture non-linear structural invariants that simple kinematic descriptors (like velocity or curvature alone) might miss.
Classification: The final hidden state of the LSTM is passed through a linear head to predict the validity of the reasoning (e.g., correct vs. incorrect, toxic vs. benign).

Key Distinction

Unlike previous works that analyze static vectors or simple kinematic measures (velocity, acceleration), TaT learns decision boundaries directly on the manifold of trajectory dynamics, capturing how representations evolve rather than where they sit.

3. Key Contributions

Trajectory-Based Explainability: Introduces TaT, shifting the focus from static layer analysis to the dynamic geometric evolution of reasoning across the entire computational graph.
Cross-Task Geometric Invariants: Demonstrates that analyzing displacement vectors reveals structural invariants of valid reasoning that transcend specific lexical patterns and task prompts.
Robust Behavior Detection: Extends trajectory analysis to complex behavioral properties like toxicity, showing superior ability to distinguish between toxic intent and the benign use of toxic vocabulary (e.g., in quotes).
Empirical Validation: Provides extensive evidence that trajectory analysis outperforms both static linear probes and the base model's own zero-shot/few-shot capabilities in detecting reasoning validity.

4. Experimental Results

The authors evaluated TaT on diverse benchmarks (commonsense reasoning, QA, factuality, toxicity) across dense (Llama-3.1-8B, Qwen2.5) and Mixture-of-Experts (MoE) architectures.

Out-of-Distribution (OOD) Generalization:
- TaT trained on a single dataset (e.g., ARC-Challenge) generalized significantly better to unseen datasets than linear probes.
- Linear probes showed high in-distribution accuracy but sharp performance drops on OOD tasks, indicating overfitting to dataset-specific lexical patterns.
- TaT consistently outperformed the base model's zero-shot and few-shot (In-Context Learning) performance, proving that the geometric structure of validity is a stronger signal than surface-level model outputs.
Toxicity Detection:
- On the ToxiGen benchmark (designed to be keyword-agnostic), TaT significantly outperformed linear probes and raw trajectory models.
- TaT successfully distinguished toxic intent from quoted or contextualized toxic vocabulary, whereas raw activation models overfit to specific toxic words.
Ablation Studies:
- Displacement vs. Raw: Using displacement vectors (TaT) was crucial. Raw activation trajectories performed well in-distribution but failed to generalize, confirming that static content interferes with reasoning signals.
- Grid Structure: Collapsing the trajectory to a single layer or a single token degraded performance, proving that both depth (layer evolution) and context (token sequence) are necessary for the signal.
- Order Invariance: An order-invariant baseline (Set MLP) underperformed the LSTM, confirming that the sequence of updates (the temporal dynamics) carries the discriminative signal.
- Comparison with LoRA: TaT outperformed Low-Rank Adaptation (LoRA) in generalization tasks, suggesting that modifying weights (LoRA) leads to overfitting, while observing geometry (TaT) captures robust invariants.

5. Significance and Implications

Beyond Static Probes: The paper challenges the prevailing Linear Representation Hypothesis in the context of reasoning, suggesting that validity is a dynamic property best captured by geometric displacement.
Safety and Monitoring: TaT offers a robust tool for monitoring LLM safety. It can detect spurious reasoning or toxic intent even when the model uses "safe" vocabulary or when the prompt structure changes, addressing a major bottleneck in deploying LLMs in high-stakes environments.
Task-Agnostic Signature: The findings suggest that a generalizable, task-agnostic signature of reasoning validity exists within the geometry of the inference trajectory, paving the way for more reliable interpretability tools.
Future Directions: The authors propose using TaT as a readout for causal interventions (e.g., activation patching) to identify specific circuit components responsible for reasoning errors, bridging macroscopic geometry with microscopic mechanistic interpretability.

6. Limitations

Computational Cost: TaT requires extracting activations across all layers and tokens, which is more expensive than single-layer probing (though the authors note the overhead is modest relative to the base forward pass in realistic deployment).
Implicit Features: While the LSTM detects validity, the specific geometric features it learns remain implicit and lack the interpretability of individual attention heads or circuits.
Training Data Requirement: The method still requires training on generalizable datasets, unlike some theoretical kinematic descriptors that aim to be training-free.