TRAJEVAL: Decomposing Code Agent Trajectories for… — Plain-Language Explanation

Imagine you hire a very smart, but sometimes clumsy, robot assistant to fix a broken machine in a giant warehouse (a software codebase). Your goal is for the robot to find the broken part and fix it.

Currently, when we test these robots, we only ask one question at the end: "Did it work?"

If the machine runs, we say "Good job!"
If it doesn't, we say "Fail."

This is like grading a student only on whether they got the final answer right on a math test, without looking at their work. If they got it wrong, you have no idea if they:

Couldn't find the right page in the textbook.
Found the page but didn't understand the formula.
Understood the formula but wrote down the wrong numbers.

TRAJEVAL is a new "magnifying glass" that lets us watch the robot's entire journey step-by-step to see exactly where it got stuck.

The Three-Stage Journey

The authors break the robot's work into three simple stages, like a detective solving a mystery:

The Search (Finding the Crime Scene):
- The Metaphor: The robot has to find the specific room in the warehouse where the broken machine is.
- The Problem: Sometimes the robot opens 100 doors just to find the one that matters. It's like searching the whole house for a lost key when it was just in the kitchen.
- TRAJEVAL's View: It measures Recall (Did you find the right room?) and Precision (Did you waste time opening the wrong doors?).
The Read (Reading the Blueprint):
- The Metaphor: Once in the room, the robot has to read the instruction manual to understand how the machine works.
- The Problem: The robot might open the manual but only read the cover, or read the wrong chapter.
- TRAJEVAL's View: Did the robot actually read the specific paragraph that explains the fix?
The Edit (Turning the Wrench):
- The Metaphor: The robot tries to tighten the loose screw.
- The Problem: The robot might understand the problem perfectly but try to tighten the wrong screw, or tighten the right screw in the wrong spot.
- TRAJEVAL's View: Did the robot touch the exact part that needed fixing?

What They Discovered

By watching 16,000 of these robot journeys, the researchers found some surprising things:

The "Over-Explorers": Almost every robot is incredibly inefficient. They look at 22 times more code than they actually need to. They are like someone who reads the entire dictionary to find the definition of "cat."
Different Robots, Different Flaws:
- Robot A (GPT-5): Is great at finding the right room and reading the manual, but it's clumsy with the wrench. It knows what to fix but where to fix it.
- Robot B (Qwen-32B): Is terrible at finding the room. It wanders around the warehouse forever and never finds the broken machine.
The Secret to Success: The most important thing for a robot to succeed isn't being fast (Precision); it's being thorough enough to find the right stuff (Recall). If a robot reads every single file in the warehouse but eventually fixes the right screw, it wins. If it's super fast but fixes the wrong screw, it loses.

The "Magic Nudge"

The coolest part of the paper is that they didn't just watch the robots; they helped them.

Imagine the robot is searching the warehouse. Every time it walks into a room that actually contains the broken machine, a little voice whispers: "Hey, you're in the right place! Keep looking here!"

The Result: This simple nudge made the robots better (they fixed more bugs) and cheaper (they used less computer power) because they stopped wasting time in the wrong rooms.

Why This Matters

Before this, if a robot failed, we just knew it failed. Now, we have a diagnostic dashboard.

If a robot is failing, we can say, "Oh, it's bad at finding files," and give it a better map.
Or, "It's good at finding files but bad at editing," and give it better wrenches.

In short: TRAJEVAL turns the black box of AI coding into a transparent process. It stops us from guessing why an AI failed and starts giving us a roadmap to fix the AI itself.

1. Problem Statement

Current evaluation metrics for Large Language Model (LLM) based code agents (e.g., SWE-Agent, OpenHands) rely heavily on outcome-based metrics like Pass@k. These metrics provide a binary success/failure signal but lack visibility into where or why an agent failed.

Limitation: A single Pass@1 score collapses the entire execution history into one number, making it impossible to distinguish between an agent that failed to find the right file versus one that found the file but edited the wrong function.
Goal: To move beyond outcome-based benchmarking toward mechanism-driven diagnosis that decomposes agent behavior into interpretable stages to identify specific failure modes and inefficiencies.

2. Methodology: TRAJEVAL Framework

The authors introduce TRAJEVAL, a diagnostic framework that decomposes an agent's execution trajectory into three sequential, interpretable stages. It compares the agent's actions against a Golden Context derived from the ground-truth reference patch ( $P^*$ ).

A. Three-Stage Decomposition

The framework defines the "Golden Context" ( $G$ ) as the minimal set of files ( $F^*$ ) and functions ( $H^*$ ) modified in the reference patch. It then analyzes the agent's trajectory ( $T$ ) across three stages:

Search (File-Level Localization):
- Metric: Measures the agent's ability to locate relevant files.
- Precision ( $P_s$ ): Efficiency of exploration (how many viewed files were actually needed).
- Recall ( $R_s$ ): Effectiveness of discovery (did the agent find all necessary files?).
Read (Function-Level Comprehension):
- Metric: Measures the agent's ability to identify relevant functions within the viewed files.
- Precision ( $P_r$ ): Did the agent read only relevant functions?
- Recall ( $R_r$ ): Did the agent read all necessary functions?
Edit (Modification Targeting):
- Metric: Measures the accuracy of the final code modifications.
- Precision ( $P_e$ ): Were the edited functions the correct ones?
- Recall ( $R_e$ ): Did the agent edit all necessary functions?

B. Golden Context Hierarchy

To ensure robustness, the framework uses a two-tier golden context:

Tier 0 (Core): Files and functions explicitly modified in the patch. Used for strict Recall calculation.
Tier 1 (Extended): Tier 0 plus one-hop structural dependencies (imports, parent classes, test files). Used to refine Precision calculations, distinguishing between wasteful navigation and reasonable structural exploration.

C. Feature Extraction & Prediction

Extraction: The system parses agent logs (tool calls like view, edit, bash) to extract sets of viewed files/functions and edited targets. This is architecture-agnostic.
Prediction Model: A lightweight Logistic Regression classifier uses the six trajectory features ( $P_s, R_s, P_r, R_r, P_e, R_e$ ) to predict the probability of task success (Pass@1).

3. Key Contributions

Novel Diagnostic Framework: Introduced TRAJEVAL, the first framework to decompose agent trajectories into Search, Read, and Edit stages with precision/recall metrics.
Large-Scale Empirical Analysis: Analyzed 16,758 trajectories across 3 agent architectures (SWE-Agent, OpenHands, LiveSWE-Agent) and 7 LLMs (ranging from 8B to 480B parameters) on SWE-bench and PolyBench.
Actionable Diagnostics: Demonstrated that trajectory metrics are not just descriptive but actionable. Real-time feedback based on these signals significantly improves performance.
Universal Inefficiency vs. Distinct Failure Modes: Revealed that while all agents suffer from massive over-exploration (low precision), different models fail at distinct stages (e.g., search vs. edit).

4. Key Results & Findings

A. Predictive Power

High Accuracy: The trajectory features predict model-level Pass@1 with a Mean Absolute Error (MAE) of 0.87% – 2.1% across various distribution shifts (instance, repository, and language held-out).
Ranking Preservation: The framework maintains a Spearman correlation ( $\rho \ge 0.886$ ) in model ranking even under significant distribution shifts (e.g., training on Python, testing on Java).

B. Behavioral Insights

Universal Over-Exploration: All agents exhibit extremely low precision across all stages. Agents examine 22x more functions than necessary. Read-stage precision is typically only 4–5%.
Recall Drives Success: Unlike precision, Recall (specifically Edit Recall) strongly correlates with task success. An agent that explores excessively but hits the right target succeeds; an efficient agent that misses the target fails.
Distinct Failure Modes:
- GPT-5: High search/read recall but low edit recall. It finds the code but targets the wrong modification locations.
- Qwen-32B: Low search recall. It fails to discover the relevant files entirely.
- Qwen-8B: Low read recall. It finds files but fails to comprehend the specific functions within them.

C. Actionable Intervention

The authors implemented a real-time feedback mechanism where the system signals the agent when it is viewing "golden context" (e.g., "✓ You are examining relevant code").

Performance Gain: Improved Pass@1 by 2.2 – 4.6 percentage points for state-of-the-art models (GPT-5 and Qwen3-Coder-480B).
Efficiency Gain: Reduced token usage by 20–29% and total cost by 20–31% by guiding agents to stop unnecessary exploration earlier.

5. Significance

Paradigm Shift: TRAJEVAL shifts the evaluation paradigm from "Did it work?" (Outcome) to "How did it work?" (Mechanism). This allows developers to pinpoint specific bottlenecks (e.g., file discovery vs. semantic understanding) rather than treating the agent as a black box.
Optimization Strategy: The findings suggest that current agents should prioritize improving recall (ensuring all relevant code is found) rather than improving precision (reducing exploration), as over-exploration is a common but non-fatal trait, whereas missing the target is fatal.
Deployment Readiness: The ability to predict success and provide real-time, low-cost feedback demonstrates that trajectory analysis can be integrated into production workflows to enhance both the accuracy and cost-efficiency of autonomous coding agents.

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis