Directional Reasoning Trajectory Change (DRTC): Identifying Critical Trace Segments in Reasoning Models

Imagine you are watching a detective solve a complex mystery. The detective writes down a long, winding story of their thoughts: "Maybe the butler did it... no, wait, the candlestick is too heavy... oh! What if the maid left the window open?"

Sometimes, the detective circles back, crosses things out, changes their mind, and finally lands on the correct answer.

The Problem:
If you want to understand how the detective solved the case, you can't just look at the final answer. You need to know:

When did they change their mind? (The "Aha!" moment).
What specific clue triggered that change?
Did that clue actually steer them toward the truth, or was it just noise?

Current tools for analyzing AI are like highlighting every word in the detective's story that appears in the final solution. But that doesn't tell you why the detective thought that way or when they made the critical switch.

The Solution: DRTC (Directional Reasoning Trajectory Change)
The authors of this paper invented a new tool called DRTC. Think of it as a "Thought-Steering Compass."

Here is how it works, using a simple analogy:

1. Finding the "Crossroads" (Pivot Discovery)

Imagine the detective's thought process is a long hiking trail. Most of the time, they just walk straight. But occasionally, they reach a crossroads where they hesitate, look around, and decide to turn left instead of right.

DRTC scans the entire trail and finds these specific crossroads (called "pivots"). These are the moments where the AI is unsure, confused, or about to change its strategy.

2. The "Time-Travel Test" (Causal Intervention)

Once DRTC finds a crossroad, it asks a magical question: "What if we erased a specific clue from the detective's memory right before they reached this crossroad?"

The Trick: It doesn't make the detective walk a new path (which would be confusing to compare). Instead, it keeps the detective walking the exact same path they already took, but it "mutes" the information from that specific clue only at the moment they are making the decision.
The Result: It checks: "Did the detective's decision at this crossroad wobble or change direction because that clue was missing?"

3. The "Compass Reading" (Directional Attribution)

This is the clever part. DRTC doesn't just ask, "Did the answer change?" (Yes/No). It asks, "Did the thought process get pushed away from the correct path, or toward it?"

Positive Score: If removing a clue makes the AI's thought process wobble away from the correct answer, that clue was helpful. It was steering the ship in the right direction.
Negative Score: If removing a clue actually makes the AI's thought process straighten out and align better with the answer, that clue was distracting. It was steering the ship off-course.

4. The "Road Curve" (Curvature Diagnostics)

Sometimes, the detective makes a sharp U-turn. DRTC has a special sensor that measures how sharp that turn was.

If the AI suddenly stops thinking about "murder" and starts thinking about "math," that's a sharp curve. DRTC notes this as a "reorientation," helping us see where the strategy completely flipped.

Why is this a big deal?

Before this, we were like people trying to understand a car crash by looking at the final wreckage. We knew what happened, but not how the driver lost control.

DRTC lets us watch the driver's hands on the wheel in real-time. It tells us:

"The driver was driving straight, then saw a deer (a pivot), and the 'deer' sign (a specific text chunk) made them swerve left."
"Actually, that 'deer' sign was a fake billboard (negative score); it almost made them crash!"

In Summary:
DRTC is a tool that maps the journey of an AI's thinking. It identifies the exact moments of decision, tests which pieces of information actually pushed the AI toward the right answer, and which ones were just noise or distractions. It turns a black box of "magic thinking" into a transparent map of cause and effect.

1. Problem Statement

Current interpretability methods for Large Language Models (LLMs) struggle to explain long-horizon reasoning (e.g., complex math or planning tasks) where models generate long, winding traces involving backtracking, strategy shifts, and verification.

Limitations of Existing Methods: Standard attribution techniques (e.g., gradient-based, occlusion) often highlight tokens correlated with the final answer but fail to reveal where consequential reasoning turns occur, what earlier context triggers them, or how specific context steers the generation trajectory.
The Causal Gap: Editing a trace while holding later text fixed is "off-policy" (unrealistic), while resampling after an edit often produces a qualitatively different trajectory that is hard to compare. Furthermore, reasoning is path-dependent; once a model commits to a line of thought, subsequent generations are constrained by that commitment.
Goal: The authors aim to identify which specific earlier context segments causally steer a single realized on-policy rollout at critical decision points, without resampling or altering the final output.

2. Methodology: Directional Reasoning Trajectory Change (DRTC)

DRTC is a process-causal framework designed to attribute reasoning steps within a single generated trace. It operates in three main stages:

A. Pivot Discovery (Locating Decision Points)

Instead of analyzing every token, DRTC identifies a small set of pivot positions ( $\tau_k$ ) where the model is likely to commit, revise, or redirect its reasoning.

Signals: It uses uncertainty and distribution-shift metrics:
- Entropy ( $H_t$ ): High entropy indicates uncertainty.
- Top-2 Margin: Low margin between top tokens indicates a difficult choice.
- Jensen-Shannon Divergence ( $S_t$ ): Measures local distribution shifts between windows before and after a token.
Selection: The top $K$ positions are selected as pivots, weighted by their uncertainty scores.

B. Receiver-Side Interventions (Causal Probing)

At each pivot $\tau_k$ , DRTC tests the causal influence of earlier text chunks ( $c_i$ ) using a receiver-side attention masking technique.

Mechanism: The model is run under teacher forcing on the realized prefix. At the pivot position, the attention mechanism is modified to block information flow from a specific earlier chunk $c_i$ to the pivot query.
Constraint: Crucially, the realized rollout (the generated text) is preserved. The model does not resample the continuation. This creates a "pivot-local counterfactual" that isolates the effect of the chunk on the decision at that moment without generating a new trajectory.
Screening: A relevance gate filters out chunk-pivot pairs where the intervention has negligible effect on the pivot's logits.

C. Directional Trajectory Attribution

The core contribution is defining the causal target as a directional change in log-probability space.

Global Direction ( $g$ ): Defined by the vector connecting the start and end of the pivot sequence in log-probability space.
Local Effect: For a chunk $c_i$ at pivot $\tau_k$ , the intervention creates a difference vector $e_{k,i}$ between the baseline logits and the intervened logits.
Attribution Score: The causal contribution is the projection of this difference onto the global direction: $\delta_{k,i} = \langle e_{k,i}, g \rangle$ $δ_{k, i} = ⟨ e_{k, i}, g ⟩$ .
- Positive $\delta$ : The chunk supports the realized rollout direction.
- Negative $\delta$ : The chunk opposes the realized direction (e.g., a hesitation or wrong turn that was later corrected).
Aggregation: Final scores are aggregated across pivots using learned weights and relevance gates.

D. Curvature Diagnostics (Geometric Supplement)

As a complementary diagnostic (not used for scoring), DRTC computes turning-angle curvature in raw logit space.

It measures how much an intervention changes the "sharpness" of the trajectory turn at a pivot.
Curvature Signatures: These summarize how different chunks reorient the trajectory, allowing the grouping of chunks into "roles" based on shared geometric responses.

3. Key Contributions

Pivot-Localized Decision Discovery: A method to automatically identify critical decision points in a trace using uncertainty and distribution-shift signals.
Temporally Valid, On-Policy Interventions: A novel receiver-side masking technique that blocks information flow at specific pivots while preserving the realized rollout, avoiding the pitfalls of off-policy editing or resampling.
Directional Attribution: Moving beyond binary "correct/incorrect" or likelihood-based metrics to measure signed directional steering in log-probability space.
Geometric Diagnostics: Introducing curvature signatures to characterize the geometry of intervention responses, providing a structural view of how reasoning reorients.

4. Results

The method was evaluated on four reasoning models (LFM2.5, Ministral-3B, Phi-4-Mini, R1-Distill-Qwen-1.5B) across math problems (MATH dataset).

Concentration of Influence: Attribution is sharply concentrated. The top 5% of chunks account for ~23–28% of total influence (Gini coefficient $\approx$ 0.50–0.58), suggesting reasoning is driven by a sparse set of critical segments.
Falsification (Learned vs. Random): Learned pivots induce significantly stronger intervention magnitudes than matched random spans (median $\Delta = 0.409$ in the 500-problem scaling study, $p = 2.3 \times 10^{-21}$ ), validating that the detected pivots are genuinely consequential.
Robustness: Results are invariant to whether curvature diagnostics are computed (C0 vs. C8 configurations show $\rho=1.0$ correlation).
Outcome Linkage: On a stability-filtered subset, top-ranked DRTC chunks degrade the log-probability of the gold answer more than random controls when subjected to graded embedding interpolation, demonstrating a link between these chunks and the final outcome.
Qualitative Findings: High-magnitude DRTC chunks typically correspond to strategy-setting constraints or key structural commitments (e.g., identifying an invariant). Negative chunks often align with early hesitation, exploratory detours, or incorrect assumptions that were later abandoned.

5. Significance and Future Work

Interpretability: DRTC provides a causally grounded view of how specific context elements steer on-policy reasoning trajectories, filling the gap between "what" the model answered and "how" it got there.
Auditability: The framework exports full provenance artifacts (pivot locations, masking effects, scores) allowing for step-by-step human verification without re-running inference.
Limitations:
- Currently uses fixed-stride chunks and a fixed pivot budget ( $K=8$ ), which may miss adaptive granularity.
- Curvature is diagnostic only and does not identify the specific internal circuits (neurons/layers) implementing the reasoning.
- Evaluation is currently limited to math problems; broader domain testing is needed.
Future Directions: Integration with mechanistic interpretability (circuit discovery) to map DRTC chunks to specific neural mechanisms, and extending to adaptive pivots and variable-resolution chunking.

In summary, DRTC offers a rigorous, process-level map of reasoning, identifying the "critical path" of information flow that steers a model's trajectory toward a solution, distinguishing between productive reasoning steps and decorative or erroneous detours.