MASEval: Extending Multi-Agent Evaluation from Models to Systems

Imagine you're trying to build the ultimate delivery service for a city. You have a fleet of incredibly smart drivers (the AI models) who know how to navigate, read maps, and talk to customers.

For a long time, researchers and companies have only been obsessed with asking: "Which driver is the fastest?" They would test Driver A, Driver B, and Driver C on the same route and declare a winner.

The Problem:
This paper argues that this approach is missing the biggest part of the puzzle. It's like saying, "Driver A is the best," when in reality, Driver A was driving a sleek sports car with a GPS, while Driver B was stuck in a broken-down wagon with no map.

The "car" and the "GPS" are the frameworks (like LangGraph, smolagents, etc.). The paper says that the type of vehicle you put the driver in matters just as much as the driver's own talent. If you pick a bad vehicle, even the best driver will fail.

Enter MASEval: The "Universal Test Track"

The authors built a new tool called MASEval. Think of it as a universal test track that doesn't care what kind of car or driver you have.

Here is how it works, using simple analogies:

1. The "Bring Your Own" Garage

Most testing tools are like a specific car dealership that only lets you test their brand of cars. If you want to test a Ford, you can't; you have to buy a Toyota first.

MASEval is like a massive, neutral garage. You can bring your own driver (AI model), your own car (framework), and your own route (benchmark).
It puts them all on the same track and measures the entire system, not just the driver.

2. The "Black Box" vs. The "Transparent Cockpit"

Old testing tools were like black boxes. They would give you a result: "Driver A finished the race in 10 minutes." But they didn't tell you why. Did the driver get lost? Did the car break down? Did the GPS give bad directions?

MASEval is like a transparent cockpit with a video camera on every part of the car.
It records every single conversation between the driver and the passenger, every time the engine sputters, and every wrong turn. This lets researchers see exactly where the system failed: Was it the driver's fault, or was the car's engine (the framework) the problem?

3. The Big Discovery: The Car Matters More Than You Think

The authors ran a massive experiment. They took three different "drivers" (AI models) and three different "cars" (frameworks) and mixed them up in every possible combination.

The Shocking Result:
They found that choosing the right car (framework) was just as important as choosing the right driver (model).

In some cases, a "good" driver driving a "bad" car performed worse than a "mediocre" driver in a "great" car.
One specific example: A top-tier driver got stuck in a loop because the car's GPS kept asking for the same thing over and over. The driver wasn't stupid; the car's instructions were confusing!

Why Should You Care?

For the "Practitioners" (The Business Owners):
If you are building an AI system for your company, this paper tells you: Don't just shop for the most expensive AI model. You need to shop for the best system to run it in. Picking the wrong framework could waste your money and ruin your results, even if you have the smartest AI available.

For the "Researchers" (The Mechanics):
It gives them a way to stop guessing. Instead of saying "Model X is better," they can finally say, "Model X works great in Framework A, but fails in Framework B because of how they handle errors." This helps them build better, more reliable AI systems.

The Bottom Line

MASEval is a new tool that stops us from judging a fish by its ability to climb a tree. It realizes that in the world of AI agents, the team (the system) is more important than the star player (the model).

It's an open-source toolkit (free to use) that helps everyone build better, more reliable AI teams by testing the whole team, not just the captain.

Here is a detailed technical summary of the paper "MASEval: Extending Multi-Agent Evaluation from Models to Systems."

1. Problem Statement

The rapid adoption of Large Language Model (LLM) based agentic systems has led to a proliferation of frameworks (e.g., LangGraph, AutoGen, smolagents, CAMEL). However, current evaluation practices suffer from a critical model-centric bias:

Conflated Performance: Existing benchmarks (e.g., GAIA, AgentBench) typically report results as "Model X achieves Y%," implicitly assuming a fixed agentic scaffold. This conflates the model's inherent capability with the specific framework's implementation, orchestration logic, and error handling.
Evaluation Gap: There is no principled way to compare system-level design decisions (topology, coordination strategies, tool-calling formats) across different frameworks.
Fragmentation: Researchers and practitioners face fragmented interfaces. Evaluating a system across multiple benchmarks requires significant boilerplate reimplementation, and existing evaluation libraries lack infrastructure for per-agent tracing and cross-framework comparison.

Core Question: Does the choice of framework impact multi-agent system performance as significantly as the choice of the underlying model?

2. Methodology: The MASEval Framework

The authors introduce MASEval, a framework-agnostic evaluation library designed to treat the entire system (agents, framework, and coordination logic) as the unit of analysis.

Design Principles

MASEval is built on five core principles to ensure flexibility and independence:

System as Unit of Analysis: Evaluates the complete system to compare architectural choices, not just model capabilities.
Bring Your Own (BYO): No framework, model provider, or logging backend is privileged. The core enforces strict separation, importing no framework-specific code.
Infrastructure, Not Implementation: Provides orchestration, tracing, and lifecycle management while leaving tool definitions and agent behavior to the user.
Separation of Concerns: Decouples task definition, environment (tools/state), agent logic, and evaluation metrics.
Trace-First Evaluation: All components log to a shared trace organized by component, enabling debugging of coordination patterns and partial observability.

System Architecture

MASEval utilizes a modular architecture with strict dependency boundaries:

Core Runtime: Orchestrates the benchmark lifecycle (Setup, Execute, Collect, Evaluate, Report) and manages tracing.
Adapters: Thin wrappers (e.g., AgentAdapter, ModelAdapter) allow any framework (smolagents, LangGraph, LlamaIndex) or model provider to be integrated without modifying the core.
Benchmark Lifecycle:
1. Setup: Instantiates environment, tools, user simulators, and agents.
2. Execute: Runs agents with customizable turn orchestration.
3. Collect: Gathers per-agent traces and system metadata (Git state, env versions).
4. Evaluate: Computes metrics from full traces (including intermediate steps).
5. Report: Stores structured results.

Key Features

Multi-Agent Tracing: Maintains independent message histories for each agent to debug coordination failures.
Adaptive Testing: Uses AdaptiveTaskQueue (based on Item Response Theory) to select the most informative tasks, reducing evaluation costs by up to 99% for frontier models.
Reproducibility: Automatically captures git state and environment configurations.

3. Key Contributions

First System-Level Evaluation Library: MASEval is the first library to enable framework-agnostic, system-level comparison across any agent framework and benchmark.
Unified Interface: Reduces integration overhead for benchmark consumers by 83–91% and implementation effort for producers by 35–57%.
Empirical Validation: Conducted the first systematic cross-framework comparison across 3 benchmarks, 3 frameworks, and 3 models (27 configurations).
Open Source Release: Released under the MIT license with adapters for major frameworks and benchmarks (MACS, CONVERSE, MultiAgentBench, $\tau^2$ -Bench).

4. Experimental Results

The authors conducted a full factorial experiment comparing smolagents, LangGraph, and LlamaIndex using GPT-5-mini, Gemini-3.0-Flash, and Claude-Haiku-4.5 across domains like Travel, Mortgage, and Security.

Key Findings:

Framework Choice Matters: Within a capability tier, the choice of framework impacts performance comparably to the choice of the model.
- Statistical Evidence: Across six domains, the mean performance range across models was 14.2 percentage points (pp), while the range across frameworks was 12.4 pp.
- Specific Example: On the MACS Travel task, Haiku 4.5 scored 90.4 with smolagents but only 59.5 with LlamaIndex—a 30.9 pp gap solely due to the framework.
Framework-Model Interactions: No single framework dominates all models.
- Case Study: smolagents forced a tool call at every step. GPT-5-mini struggled with this, misinterpreting error messages and retrying tool calls up to 23 times, leading to >10x higher token consumption compared to other models. This failure mode was specific to the interaction between that model's tendencies and the framework's conventions.
Conclusion: Optimizing only the model while ignoring the framework is suboptimal. System-level evaluation is necessary to identify these interactions.

5. Significance and Impact

Paradigm Shift: Challenges the "model-centric" status quo in AI research, arguing that for agentic systems, the orchestration harness is as critical as the LLM itself.
Practical Utility: Provides practitioners with data-driven guidance on framework selection, moving beyond ad-hoc implementations to principled system design.
Safety and Reproducibility: By standardizing evaluation infrastructure, MASEval enables the safety community to identify failure modes and compare mitigation strategies across frameworks reproducibly.
Future Direction: The paper highlights that as multi-agent systems scale, understanding the limits of tracing overhead and adapter flexibility for novel architectures will be crucial.

In summary, MASEval bridges the gap between agent frameworks and benchmarks, proving that implementation decisions (framework choice) are a primary driver of performance in multi-agent systems, necessitating a shift toward holistic system-level evaluation.