Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Imagine you are trying to teach a robot butler how to do complex chores, like booking a flight, fixing a broken computer, or writing a novel. To teach it, you need to show it examples of how to do these things.

The problem is, right now, everyone who is building these robots is speaking a different language.

Researcher A writes their examples in a notebook using only English sentences.
Researcher B writes theirs in a spreadsheet with code snippets.
Researcher C records video screens of their robot clicking buttons.

If you want to train your robot using all these examples at once, you have to hire a translator for every single book, spreadsheet, and video. It's a nightmare. You end up spending 90% of your time translating and only 10% actually teaching the robot. Because it's so hard, most people just pick one source and ignore the rest, leaving a huge amount of knowledge on the table.

The Solution: The "Agent Data Protocol" (ADP)

This paper introduces a solution called the Agent Data Protocol (ADP). Think of ADP as a universal translator or a universal power adapter for robot training data.

Here is how it works, using simple analogies:

1. The "Universal Adapter" (The Protocol)

Imagine you have a pile of chargers from different countries: some have two flat pins, some have three round pins, some have square pins. They all plug into different outlets, but they all do the same thing: charge a phone.

ADP is like a universal travel adapter.

Step 1: You take any charger (dataset) from any country (researcher) and plug it into the ADP adapter.
Step 2: The adapter instantly converts it into a standard, universal plug shape.
Step 3: Now, any phone (robot framework) can plug into that standard shape and get charged.

In the paper's terms, they took 13 different datasets (some for coding, some for web browsing, some for tool use) and converted them all into this one standard "ADP format."

2. The "Recipe Book" (The Structure)

The paper explains that no matter what a robot is doing, its actions can always be broken down into two simple things:

Actions: What the robot does (e.g., "Click this button," "Write this code," "Call this API").
Observations: What the robot sees or hears back (e.g., "The page loaded," "The code ran successfully," "The error message appeared").

ADP forces everyone to write their data in this simple "Action + Observation" recipe format. It's like telling everyone to write their recipes using only "Ingredients" and "Steps," regardless of whether they are making soup or a cake.

3. The Result: A Super-Student

The researchers took this massive, unified pile of data (1.3 million examples!) and used it to train a robot.

Before ADP: If you trained a robot on just one type of data (like only coding), it was good at coding but terrible at browsing the web.
With ADP: Because they mixed everything together (coding, browsing, tools) into one big pot, the robot learned to be a generalist.

The Magic Numbers:

The robots trained with this mixed data got 20% better on average than robots trained on just one type of data.
They became so good that they matched or beat the most advanced robots in the world, even though they weren't specifically tuned for just one task.

Why This Matters

Before this paper, if you wanted to build a better robot, you had to be a "data janitor," spending months cleaning and translating different datasets.

With ADP:

One-time work: You convert a dataset to ADP once.
Plug-and-play: Any new robot framework can immediately use that data without you doing any extra work.
Community Power: Instead of 100 researchers each building their own tiny wall, they are all building one giant, shared library of knowledge.

The Bottom Line

The Agent Data Protocol is the "Rosetta Stone" for AI agents. It stops researchers from reinventing the wheel and wasting time on translation. By speaking a common language, they can combine all their knowledge to build smarter, more capable robots much faster.

In short: They built a universal translator for robot training data, allowing everyone to share their best lessons, resulting in robots that are significantly smarter and more versatile.

Here is a detailed technical summary of the paper "Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-Tuning of LLM Agents" (ICLR 2026).

1. Problem Statement

Despite the abundance of data for Large Language Model (LLM) pre-training, supervised fine-tuning (SFT) for AI agents remains a bottleneck in academic research. The primary challenge is not a lack of data sources, but rather the fragmentation and heterogeneity of existing agent datasets.

Heterogeneity: Existing datasets (e.g., for coding, web browsing, tool use) use inconsistent formats, action spaces, and observation structures. For example, one dataset might represent web interactions via raw HTML, while another uses accessibility trees.
Engineering Overhead: To train an agent on multiple datasets, researchers must write custom conversion pipelines for every Dataset $\times$ Agent Framework pair. This results in a quadratic complexity ( $O(D \times A)$ ), where $D$ is the number of datasets and $A$ is the number of agent frameworks, leading to duplicated effort and brittle integration.
Underutilization: Due to these integration costs, high-quality agent data remains siloed, preventing large-scale, cross-domain training that could improve generalization.

2. Methodology: The Agent Data Protocol (ADP)

The authors propose the Agent Data Protocol (ADP), a lightweight, expressive representation language designed to act as an "interlingua" between diverse agent datasets and downstream training pipelines.

Core Design Principles

Simplicity: A unified schema that eliminates the need for per-dataset engineering.
Standardization: Converts heterogeneous formats into a single, consistent representation.
Expressiveness: Capable of capturing complex agentic trajectories including API calls, code execution, browsing, and general tool use without information loss.

Technical Architecture

ADP is implemented using Pydantic schemas. An agent trajectory is represented as a Trajectory object containing:

ID: Unique trajectory identifier.
Content: An alternating sequence of Actions and Observations.
- Actions:
  - API Action: Structured function calls (e.g., goto(url=...)) with arguments and descriptions.
  - Code Action: Code blocks with language specification (e.g., Python) and content.
  - Message Action: Natural language communication between agent and user.
- Observations:
  - Text Observation: Text feedback from the environment or user.
  - Web Observation: Structured webpage state including HTML, accessibility trees (axtree), URLs, and optional screenshots.
Details: Flexible metadata for dataset-specific provenance.

The Conversion Pipeline

The authors implemented a Hub-and-Spoke pipeline to reduce engineering complexity from quadratic to linear ( $O(D + A)$ ):

Raw $\to$ ADP (Dataset Conversion): Each raw dataset is converted to the ADP schema once. This involves mapping specific actions/observations to the ADP standard.
ADP $\to$ SFT (Agent Conversion): Each agent framework (e.g., OpenHands, SWE-Agent) maintains a single script to convert ADP trajectories into its specific training format (e.g., specific prompt templates, action spaces).
Quality Assurance: Automated validation ensures tool call formats, reasoning traces, and conversation structures are correct.

3. Key Contributions

The ADP Schema: A standardized, Pydantic-based schema that unifies actions (API, Code, Message) and observations (Text, Web) across diverse domains.
The ADP Dataset V1: The largest publicly available agent training dataset, comprising 1.3 million trajectories unified from 13 existing datasets (including SWE-Gym, Mind2Web, AgentInstruct, etc.).
Efficiency Demonstration: A proof that converting 13 datasets to ADP and then to 3 different agent frameworks required significantly less code (Linear effort) compared to the traditional pairwise conversion approach (Quadratic effort).
Cross-Task Analysis: The unified format enabled systematic analysis revealing that high-quality datasets consistently include "function thoughts" (reasoning traces) and that different domains (coding vs. browsing) have distinct action distributions.

4. Experimental Results

The authors evaluated ADP by fine-tuning Qwen2.5-Coder-Instruct models (7B, 14B, and 32B parameters) on the unified ADP dataset and testing them across four major benchmarks: SWE-Bench (Software Engineering), WebArena (Web Browsing), AgentBench (Tool Use), and GAIA (General Reasoning).

Key Findings:

Significant Performance Gains: ADP-trained models showed an average performance improvement of ~20% over their base counterparts across all benchmarks.
- Example: On SWE-Bench (Verified), the 7B model improved from 0.4% (base) to 20.2% (ADP-tuned).
- Example: On WebArena, the 7B model improved from 4.5% to 21.0%.
State-of-the-Art (SOTA) Performance: The ADP-tuned models achieved SOTA or near-SOTA results without domain-specific tuning. For instance, the 32B ADP model on SWE-Bench reached 40.3%, matching or exceeding proprietary models like Claude 3.5 Sonnet.
Cross-Task Transfer: Training on the diverse ADP corpus outperformed training on single-domain datasets.
- Evidence: On SWE-Bench, a model trained only on SWE-specific data (SWE-smith) achieved 11.0%, while the same model trained on the mixed ADP corpus achieved 16.6%. This demonstrates that diverse data prevents negative transfer and improves generalization.
Scalability: Performance gains were monotonic with model size (7B $\to$ 14B $\to$ 32B), indicating the protocol scales effectively.

5. Significance and Impact

Lowering Barriers: ADP drastically reduces the engineering cost of integrating new datasets into agent training pipelines, making large-scale SFT accessible to the broader research community.
Reproducibility and Standardization: By providing a unified schema and open-source converters, ADP enables fair comparison of different agent architectures and datasets.
Community Growth: The release of the 1.3M trajectory dataset and the protocol encourages the community to contribute new datasets, fostering a scalable ecosystem for agent research.
Future Directions: The authors suggest extending ADP to multimodal data (images, screen recordings) and applying similar standardization to evaluation protocols.

In conclusion, the paper argues that the bottleneck in agent research is not data scarcity but data fragmentation. The Agent Data Protocol successfully unifies these resources, demonstrating that standardized, diverse data is the key to unlocking robust, general-purpose AI agents.