AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation

Imagine you are walking into a massive, futuristic library. But instead of books, the shelves are filled with AI Agents—digital assistants designed to do specific jobs. Some are great at writing code, others at planning travel, and some are experts at analyzing medical data.

The problem? There are 107,000 of these agents, and they are all different. You have a specific request, like "Plan a surprise birthday party for my dog," but you don't know which agent to pick. If you pick the wrong one, it might just say "I can't do that" or give you a terrible plan.

This paper introduces AgentSelect, a new "smart librarian" system designed to solve this exact problem. Here is the breakdown in simple terms:

1. The Problem: The "Jungle" of Choices

Currently, if you want an AI to do a complex task, you have to be a tech expert. You have to manually pick a "brain" (a Large Language Model), choose a set of "tools" (like a calculator, a search engine, or a calendar), and tell them how to talk to each other. It's like trying to build a custom car by buying the engine, the tires, and the steering wheel separately and hoping they fit together.

Existing lists (leaderboards) tell you which "engines" are fast or which "tires" are durable, but they don't tell you which combination works best for your specific trip.

2. The Solution: AgentSelect (The Smart Matchmaker)

The researchers built a massive dataset called AgentSelect. Think of it as a giant training manual for a recommendation engine.

The Input: You type a natural sentence (a "narrative query"), like "I need to find a cheap flight to Tokyo and book a hotel."
The Output: The system instantly recommends the perfect "Agent Configuration" (the right brain + the right tools) to get the job done.

3. How They Built the Data (The "Recipe Book")

To teach the computer how to make these matches, they couldn't just ask humans to test every possible combination (there are too many!). Instead, they used a clever three-part strategy:

Part 1: The "Brain" Testers. They looked at existing tests where AI models answered questions. If a model was great at math, they noted it as a "positive match" for math-related queries.
Part 2: The "Tool" Testers. They looked at tests where AI had to use specific tools (like a calculator). They noted which tools were needed for which tasks.
Part 3: The "Simulated" Matches (The Magic Sauce). This is the most creative part. Since they didn't have real-world data for every combination, they synthesized it. They took a query, asked a smart AI to guess the best tools and brain for it, and treated that guess as a "positive match" for training. It's like a chef tasting a dish and saying, "This needs more salt," and then teaching a robot that "Salt + Soup = Good."

4. The Big Discovery: It's Not About Popularity

The researchers found something surprising. In the past, recommendation systems (like Netflix or Spotify) worked well because they relied on popularity. "Everyone watched this movie, so you probably will too."

But with AI Agents, popularity doesn't work.

The Old Way: "This agent is popular, so it must be good."
The New Reality: Most agents are "one-offs." You might need a very specific agent to "translate a legal document into Spanish and then summarize it." No one else has asked for that exact combo before.

The paper shows that the new system works by understanding the content, not just counting votes. It reads your request, understands the skills you need, and matches them to the agent's capabilities, even if that agent has never been used before.

5. Why This Matters

For Regular People: Soon, you won't need to be a tech wizard. You'll just talk to your computer, and it will automatically build the perfect mini-AI to solve your problem.
For Developers: They now have a standard "test track" to see if their new recommendation algorithms actually work, rather than guessing.
The Future: The researchers tested their system on a real-world marketplace (MuleRun) and it worked better than existing tools. This proves that we are moving toward a future where AI agents are as easy to find and use as apps on your phone.

In a Nutshell

AgentSelect is the bridge between "I have a problem" and "Here is the perfect AI tool to fix it." It turns a chaotic jungle of 100,000+ AI options into a simple, smart recommendation, ensuring that the right tool is always in the right hands (or rather, the right chat window).

Here is a detailed technical summary of the paper "AgentSelect: Benchmark for Narrative Query-to-Agent Recommendation".

1. Problem Definition

The paper addresses a critical gap in the rapidly expanding ecosystem of Large Language Model (LLM) agents. While frameworks exist to build agents and leaderboards exist to evaluate individual components (LLMs or tools), there is no principled method to recommend a complete, deployable agent configuration based on a user's natural language (narrative) query.

The Challenge: The design space for agents is exploding. An agent is defined by a backbone LLM ( $M$ ) and a set of external tools ( $T$ ). Practitioners face a "jungle of configurations" where selecting the right combination for a specific task is difficult.
The Limitation of Existing Work: Current benchmarks evaluate components in isolation (e.g., "How well does Model X do on Math?" or "How well does Tool Y work?"). They lack query-conditioned supervision for learning to rank end-to-end compositional configurations. Furthermore, existing evaluation artifacts are heterogeneous (different metrics, tasks, and formats), making them difficult to unify for training a recommender system.
The Goal: Formulate the task of Narrative Query-to-Agent Recommendation: Given a free-form natural language query $Q$ and a catalog of candidate agents $A = \{(M, T)\}$ , rank the agents by their expected utility to solve the query.

2. Methodology: The AGENTSELECT Benchmark

The authors introduce AGENTSELECT, a unified benchmark and dataset that reframes agent selection as a recommendation problem.

A. Capability Profile Design

Instead of abstract labels, agents are represented as Capability Profiles $(M, T)$ :

$M$ (Backbone): The specific LLM model.
$T$ (Toolkit): The set of external tools (APIs, functions) the agent can invoke.
Representation: Each agent is stored as an executable YAML configuration, making the recommendation actionable (i.e., the system can directly instantiate the agent).

B. Data Construction (Three Parts)

The benchmark aggregates 111,179 queries and 107,721 agents from 40+ sources, structured into three parts to cover different supervision regimes:

Part I: LLM-only Agents (Dense Reuse)
- Source: Open LLM Leaderboards (e.g., MMLU, BBH).
- Signal: Uses per-query evaluation scores from leaderboards. Top-ranked models for a query are treated as positive interactions.
- Characteristics: High agent reuse (dense head), providing strong signals for popular models but limited diversity.
Part II: Toolkit-only Agents (Tool Adequacy)
- Source: Tool-use benchmarks (e.g., ToolBench, APIBank).
- Signal: Extracts the "gold" toolkit required for a task, treating the backbone as a null placeholder.
- Characteristics: Isolates the contribution of tool selection, independent of the LLM.
Part III: Compositional Agents (Long-tail Synthesis)
- Source: Synthesized interactions.
- Method: A three-stage pipeline:
  1. Select prototypical queries from Parts I & II.
  2. Retrieve relevant backbone models and tools using lightweight retrievers trained on Parts I/II.
  3. Compose $(M, T)$ configurations.
- Signal: These synthesized configurations are treated as pseudo-positives (implicit feedback). Other compositions for the same query are treated as negatives.
- Characteristics: Extremely sparse, long-tail distribution, simulating a realistic marketplace where most agents are rarely selected.

C. Learning Framework

The benchmark supports training Two-Tower architectures and other recommendation models (MF, LightFM, GNNs, Generative Recs) to map narrative queries to capability profiles. The supervision is positive-only, mimicking implicit feedback scenarios.

3. Key Contributions

First Unified Benchmark: AGENTSELECT is the first framework to standardize heterogeneous evaluation artifacts (leaderboards, tool benchmarks) into a single, query-conditioned, positive-only interaction dataset for agent recommendation.
Regime Shift Discovery: The analysis reveals a shift from dense head reuse (Part I) to long-tail, near one-off supervision (Parts II & III).
- Implication: Traditional Collaborative Filtering (CF) and Graph Neural Networks (GNNs) that rely on ID-based reuse fail in the long-tail regime.
- Solution: Content-aware capability matching (using text descriptions of models and tools) becomes essential.
Synthesized Supervision Validation: The paper validates that synthesized pseudo-positives in Part III are learnable. Models trained on Part III show sensitivity to counterfactual edits (e.g., removing a key tool lowers the score) and transfer effectively to real-world marketplaces.
Actionable Infrastructure: The benchmark provides not just data, but a reproducible infrastructure (YAML specs) that allows recommended agents to be instantly deployed in frameworks like Agno or LangGraph.

4. Experimental Results

The authors evaluated various baselines (MF, LightFM, NCF, Two-Tower, GNNs, Generative Recs) across the three parts.

Performance on Long-Tail (Parts II/III):
- ID-based methods (CF/GNN): Perform poorly on Parts II and III due to the lack of repeated agent IDs (sparse data).
- Content-aware methods (Two-Tower): Significantly outperform ID-based methods. They directly align narrative intent with textual capability descriptions.
- Embedding Quality: Zero-shot embeddings perform poorly. In-domain fine-tuning (e.g., BGE-M3, KaLM) is crucial, yielding massive gains in ranking quality (nDCG) for sparse, long-tail data.
Modality Attribution:
- Removing discrete IDs (Model/Tool IDs) and relying solely on text content results in only a marginal performance drop, proving the model learns true capability matching rather than memorizing IDs.
- Tool content is found to be more informative than backbone LLM content for ranking.
Real-World Transfer:
- Models fine-tuned on AGENTSELECT (specifically EasyRec*) were tested on the MuleRun public agent marketplace.
- Result: The fine-tuned model consistently outperformed the base model in hit rates (Top@1/5/10) and ranking metrics (nDCG/MRR) on an unseen catalog, demonstrating practical utility.
End-to-End Validation:
- Deployed recommended agents were executed using a simulated runtime (Agno + MIRRORAPI). The recommender's ranking correlated significantly with actual end-to-end task success, confirming the recommendations are not just textually plausible but functionally effective.

5. Significance and Impact

Bridging the "Last Mile": The work moves the field from "building agents" to "selecting the right agent," enabling zero-code, on-demand agent creation for non-experts.
New Paradigm for Evaluation: It shifts the focus from isolated component benchmarking to system-level capability matching, providing a prescriptive guide for which agent to use for a specific task.
Foundation for Ecosystems: By providing a unified data infrastructure, AGENTSELECT accelerates the development of agent marketplaces, routers, and tool retrievers, supporting the transition toward a democratized, automated agent ecosystem.
Open Resources: The dataset, code, and a live demo for generating runtime configurations are publicly available, fostering reproducibility and further research.

In summary, AgentSelect establishes that while traditional recommendation methods fail in the sparse, long-tail world of agent configurations, content-aware, capability-matching models trained on unified, synthesized interaction data can effectively recommend deployable agents, bridging the gap between user intent and automated execution.