Learning to Retrieve from Agent Trajectories

The Big Problem: Teaching a Robot with Human Books

Imagine you have a brilliant robot assistant (an AI Agent) whose job is to solve complex mysteries, like a detective. To solve a case, the robot needs to search through a massive library of books (the internet) to find clues.

For decades, the "librarians" (the search engines) have been trained by watching humans. They know that if a human clicks a link and stays on the page for a long time, that link is probably good. If they skip it, it's probably bad.

But here's the glitch: The robot detective doesn't think or browse like a human.

Humans search for fun facts or quick answers.
Robots search to build a logical chain of reasoning. They might click a link, read a tiny snippet, realize it's useless, and immediately move on. Or they might click a link that looks boring but contains the one crucial fact needed to solve the puzzle.

The paper argues that we are trying to teach a robot how to fish using a manual written for humans. It's a mismatch. The robot is getting bad results because the search engine is optimized for human habits, not robot logic.

The Solution: Let the Robot Teach Itself

The authors propose a new method called LRAT (Learning to Retrieve from Agent Trajectories). Instead of looking at human clicks, they look at the robot's own journey (its "trajectory").

Think of a robot's journey like a cooking show:

The Search: The robot asks for ingredients (Search).
The Browse: It picks a specific ingredient to inspect closely (Browse).
The Reasoning: It talks to itself: "Hmm, this tomato is too old. I'll toss it. But this onion? Perfect. I'll chop it and add it to my recipe."

The paper found three golden rules by watching these cooking shows:

1. If the Robot "Browses," It's a Good Clue

If the robot decides to open a book and read the full text, it's a strong signal that the book is useful.

Analogy: If a detective walks up to a specific file cabinet and opens a drawer, that file is likely important. Even if they don't read the whole thing, the act of opening it means they saw something promising.

2. If the Robot Ignores a Book, It's a Bad Clue

In human search, if you don't click a link, it might just be because it was at the bottom of the page (position bias). But robots are different. They scan the top 10 results quickly. If they don't pick a book to read, it's usually because they explicitly decided, "This isn't what I need."

Analogy: If a detective looks at 10 suspects and only handcuffs one, the other 9 aren't just "unseen"; they were actively cleared of suspicion. We can trust the robot's "no" just as much as its "yes."

3. The Length of the Robot's "Thinking" Matters

This is the most clever part. After the robot reads a document, it often writes a "thought trace" explaining what it found.

Short thought: "This doesn't help." (The document was a dud).
Long thought: "This document explains the merger date, which connects to the 2017 event mentioned earlier. I can now deduce the answer." (The document was a goldmine).
Analogy: Imagine a student studying. If they glance at a page and immediately turn the page, they didn't learn much. If they spend 10 minutes highlighting, taking notes, and connecting ideas, that page was incredibly valuable. The paper uses the length of the robot's thinking as a score: the longer the thinking, the more valuable the document was.

How LRAT Works (The Recipe)

The authors built a system called LRAT that acts like a smart filter:

Collect: It gathers thousands of robot journeys (trajectories).
Filter: It uses a "Judge" (another AI) to check the robot's thoughts. Did the robot actually use the info? If yes, keep it. If the robot opened the book but immediately said "useless," throw it out.
Weight: It gives extra points to documents that made the robot think hard and long.
Train: It teaches the search engine (the retriever) using these new, robot-specific rules.

The Results: A Supercharged Detective

When they tested this new training method:

Better Success: Robots solved more puzzles correctly.
Faster Work: Robots found the answers in fewer steps (they didn't waste time clicking bad links).
Works Everywhere: It worked whether the robot was small and simple or a giant, super-smart model.

The "Data Flywheel" (The Self-Improving Loop)

The most exciting part is the future potential.

Old Way: Humans click links $\rightarrow$ Search engine learns $\rightarrow$ Humans click more.
New Way: Robots search $\rightarrow$ Robots generate "thoughts" $\rightarrow$ Search engine learns from robots $\rightarrow$ Robots get smarter $\rightarrow$ They generate even better thoughts.

It's like a self-improving loop. As the search engine gets better at understanding robots, the robots get better at solving problems, which creates even better data to train the search engine again.

Summary

This paper says: Stop teaching robots to search like humans. Instead, watch how robots actually search, listen to their internal monologue, and use that to train a search engine that speaks "Robot." The result is a search system that helps AI agents solve complex problems faster and more accurately.

1. Problem Statement

The paper addresses a fundamental misalignment between traditional Information Retrieval (IR) systems and the emerging paradigm of Agentic Search.

The Shift: Traditional IR models are trained on human interaction logs (clicks, dwell time) to serve human users. However, modern search is increasingly driven by Large Language Model (LLM) agents that perform multi-turn reasoning, issue intermediate sub-queries, and consume results to solve complex tasks.
The Mismatch: Current retrievers (e.g., dense embeddings, BM25) are optimized for human behavior. They fail to account for how agents interact with search results. Agents do not just "click"; they browse, reject, and reason over documents.
The Gap: There is a lack of training data and supervision signals specifically derived from agent trajectories (the sequence of thoughts, actions, and observations during task execution). Existing methods rely on human-centric assumptions that break down when the primary user is an autonomous agent.

2. Methodology: The LRAT Framework

The authors propose LRAT (Learning to Retrieve from Agent Trajectories), a framework that mines supervision signals directly from agent execution logs to train retrieval models. The methodology consists of three core stages:

A. Analysis of Agent Trajectories

Before designing the framework, the authors analyzed deep research agent trajectories (using Tongyi-DeepResearch-30B on the InfoSeekQA dataset) to identify behavioral signals:

Browsing as a Positive Signal: Successful task completion is strongly correlated with the agent performing [Browse] actions. Unbrowsed documents are rare in successful runs.
Unbrowsed Documents as Negatives: Unlike human click logs where unclicked items suffer from position bias, agents actively evaluate candidates beyond the top ranks. Therefore, documents retrieved but not browsed are reliable negative signals (explicit rejections).
Post-Browse Reasoning as Relevance Intensity: The length and depth of the agent's reasoning trace after browsing a document correlate with the document's utility. Longer reasoning indicates the document was critical for problem-solving, while short reasoning suggests it was quickly discarded.

B. Signal Mining and Filtering

LRAT constructs a high-quality training dataset from these trajectories through a progressive refinement process:

Naive Relevance Mining:
- Positives: Documents that the agent chooses to [Browse].
- Negatives: All other documents in the retrieved top- $K$ list that were not browsed.
Reasoning-Aware Positive Filtering:
- Not all browsed documents are useful. The framework uses an LLM-as-a-judge to analyze the post-browse reasoning trace.
- If the reasoning trace explicitly indicates the document was unhelpful, it is filtered out as a false positive. This ensures only high-quality "useful" documents are treated as positives.

C. Intensity-Aware Training

Instead of treating all positive samples equally, LRAT incorporates relevance intensity:

Weighting Mechanism: The relevance weight ( $w$ ) for a document is calculated based on the token length of the reasoning trace following the browse action.
Formula: The weight follows an exponential saturation function:
$w = \frac{1}{\mu_{raw}} \left(1 - \exp\left(-\ln 2 \cdot \frac{l}{\beta}\right)\right)$
Where $l$ is the reasoning length and $\beta$ is a scaling parameter. This mimics the "dwell time" concept in human search but adapts it to agent reasoning depth.
Optimization: The retriever is trained using a Weighted InfoNCE Contrastive Loss, where the gradient contribution of each sample is scaled by its relevance weight. This forces the model to prioritize documents that drive significant agent progress.

3. Key Contributions

New Paradigm: Formulates "Learning to Retrieve from Agent Trajectories" as a distinct training paradigm, shifting the supervision source from human logs to agent interaction logs.
Behavioral Insights: Identifies that browsing is a necessary condition for success, unbrowsed items are reliable negatives (free of position bias), and post-browse reasoning length is a proxy for relevance intensity.
LRAT Framework: Proposes a simple, scalable framework that mines supervision without human annotation, utilizing LLMs for filtering and weighting.
Data Flywheel: Demonstrates that agent trajectories can create a self-sustaining data flywheel, where improved retrievers lead to better agent performance, which in turn generates better training data.

4. Experimental Results

The authors evaluated LRAT on InfoSeek-Eval (in-domain) and BrowseComp-Plus (out-of-domain) using diverse agent backbones (ranging from 4B to 358B parameters) and retrievers (Qwen3-Embedding, E5-Large).

Task Success: LRAT consistently improved the Success Rate (SR) across all agent architectures.
- Example: On InfoSeek-Eval, the success rate for the 30B Tongyi agent increased from 52.7% to 68.0% (+29.0%).
- Example: For the 358B GLM-4.7 agent, SR improved from 67.7% to 82.0%.
Evidence Recall: On BrowseComp-Plus, LRAT significantly boosted the ability to retrieve ground-truth evidence documents (Recall increased by up to 37.9%).
Efficiency: Agents using LRAT-trained retrievers required fewer interaction steps (up to ~30% reduction) to solve tasks, indicating more precise information retrieval.
Ablation Studies:
- Removing the "Filter" step (using all browsed docs) reduced performance, confirming the need to filter noise.
- Removing the "Reweight" step (treating all positives equally) reduced performance, confirming that reasoning length is a valuable signal.
Scalability: Performance continued to improve as the training dataset size increased (up to 30K trajectories), showing no early saturation.
Robustness: The method remained effective even when trained on trajectories containing incorrect final answers, proving that intermediate agent judgments are still valuable supervision.

5. Significance

Solving the "Agent Era" Bottleneck: As LLM agents become the primary users of search engines, this work provides the first systematic approach to aligning retrieval models with agent behaviors rather than human behaviors.
Scalability: The method requires no human annotation. It leverages the natural byproduct of agent execution (trajectories) to build a sustainable, self-improving data flywheel.
Generalizability: The framework works across different agent sizes (4B to 358B) and retriever architectures, suggesting it is a universal solution for agentic search optimization.
Future Direction: It points toward a future where retrieval systems are continuously updated by the agents they serve, creating a closed-loop system of continuous improvement.