TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

The paper introduces TRAJEVAL, a diagnostic framework that decomposes code agent trajectories into search, read, and edit stages to provide fine-grained failure analysis, revealing universal inefficiencies and distinct model-specific weaknesses while enabling actionable feedback that significantly improves agent performance and reduces costs.

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Shweta Garg, Baishakhi Ray, Terry Yue Zhuo, Rajdeep Mukherjee, Varun Kumar

Published 2026-03-27
📖 4 min read☕ Coffee break read

Imagine you hire a very smart, but sometimes clumsy, robot assistant to fix a broken machine in a giant warehouse (a software codebase). Your goal is for the robot to find the broken part and fix it.

Currently, when we test these robots, we only ask one question at the end: "Did it work?"

  • If the machine runs, we say "Good job!"
  • If it doesn't, we say "Fail."

This is like grading a student only on whether they got the final answer right on a math test, without looking at their work. If they got it wrong, you have no idea if they:

  1. Couldn't find the right page in the textbook.
  2. Found the page but didn't understand the formula.
  3. Understood the formula but wrote down the wrong numbers.

TRAJEVAL is a new "magnifying glass" that lets us watch the robot's entire journey step-by-step to see exactly where it got stuck.

The Three-Stage Journey

The authors break the robot's work into three simple stages, like a detective solving a mystery:

  1. The Search (Finding the Crime Scene):

    • The Metaphor: The robot has to find the specific room in the warehouse where the broken machine is.
    • The Problem: Sometimes the robot opens 100 doors just to find the one that matters. It's like searching the whole house for a lost key when it was just in the kitchen.
    • TRAJEVAL's View: It measures Recall (Did you find the right room?) and Precision (Did you waste time opening the wrong doors?).
  2. The Read (Reading the Blueprint):

    • The Metaphor: Once in the room, the robot has to read the instruction manual to understand how the machine works.
    • The Problem: The robot might open the manual but only read the cover, or read the wrong chapter.
    • TRAJEVAL's View: Did the robot actually read the specific paragraph that explains the fix?
  3. The Edit (Turning the Wrench):

    • The Metaphor: The robot tries to tighten the loose screw.
    • The Problem: The robot might understand the problem perfectly but try to tighten the wrong screw, or tighten the right screw in the wrong spot.
    • TRAJEVAL's View: Did the robot touch the exact part that needed fixing?

What They Discovered

By watching 16,000 of these robot journeys, the researchers found some surprising things:

  • The "Over-Explorers": Almost every robot is incredibly inefficient. They look at 22 times more code than they actually need to. They are like someone who reads the entire dictionary to find the definition of "cat."
  • Different Robots, Different Flaws:
    • Robot A (GPT-5): Is great at finding the right room and reading the manual, but it's clumsy with the wrench. It knows what to fix but where to fix it.
    • Robot B (Qwen-32B): Is terrible at finding the room. It wanders around the warehouse forever and never finds the broken machine.
  • The Secret to Success: The most important thing for a robot to succeed isn't being fast (Precision); it's being thorough enough to find the right stuff (Recall). If a robot reads every single file in the warehouse but eventually fixes the right screw, it wins. If it's super fast but fixes the wrong screw, it loses.

The "Magic Nudge"

The coolest part of the paper is that they didn't just watch the robots; they helped them.

Imagine the robot is searching the warehouse. Every time it walks into a room that actually contains the broken machine, a little voice whispers: "Hey, you're in the right place! Keep looking here!"

  • The Result: This simple nudge made the robots better (they fixed more bugs) and cheaper (they used less computer power) because they stopped wasting time in the wrong rooms.

Why This Matters

Before this, if a robot failed, we just knew it failed. Now, we have a diagnostic dashboard.

  • If a robot is failing, we can say, "Oh, it's bad at finding files," and give it a better map.
  • Or, "It's good at finding files but bad at editing," and give it better wrenches.

In short: TRAJEVAL turns the black box of AI coding into a transparent process. It stops us from guessing why an AI failed and starts giving us a roadmap to fix the AI itself.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →