NetArena: Dynamic Benchmarks for AI Agents in Network Automation

Imagine you are hiring a new, super-smart robot assistant to manage a massive, complex city's traffic system. You wouldn't just ask it to solve a single math problem on a piece of paper; you'd want to see how it handles real traffic jams, unexpected road closures, and whether it accidentally causes a pile-up while trying to fix a detour.

This is exactly the problem NETARENA solves, but instead of city traffic, it's about computer networks (the internet, data centers, cloud servers).

Here is the story of the paper, broken down into simple concepts:

1. The Problem: The "Practice Exam" Trap

Right now, when we test AI agents (robots that can think and act) on network tasks, we use static benchmarks. Think of this like giving a student the exact same 50 math problems every time they take a test.

The Issue: If the student memorizes the answers to those 50 problems, they get an "A," but they can't actually do math. They haven't learned how to think.
The Risk: In the real world, networks are messy. A robot that memorized answers might try to "fix" a network by unplugging the wrong server, causing a massive outage. Existing tests are too small, too predictable, and don't catch these dangerous mistakes.

2. The Solution: NETARENA (The "Infinite Simulator")

The authors built NETARENA, which is like a video game simulator for network engineers. Instead of a fixed list of questions, it's a dynamic playground that generates infinite, unique scenarios on the fly.

The Analogy: Imagine a driving school.
- Old Way: You practice on a track with the same cones in the same spots every day.
- NETARENA Way: You are dropped into a virtual city where the traffic lights change, pedestrians appear randomly, and roads close unexpectedly. The simulator creates a new, unique driving test every single time you start the engine.

3. How It Works: The "State and Action" Game

NETARENA treats network tasks like a game of Chess or Sudoku, but with real consequences.

The Board (State): The current health of the network (is the internet working? is a server down?).
The Moves (Actions): The commands the AI can type (e.g., "Add a new cable," "Change a setting," "Restart a router").
The Goal: The AI has to move from a "Broken Board" to a "Working Board."

NETARENA has two main modes:

Constructive (The Architect): "Here is a blueprint for a new city. Build it perfectly." The AI has to design a solution.
Reactive (The Detective): "Oh no, the power went out in Sector 7! Find out why and fix it." The AI has to diagnose a mystery and repair it.

4. The Three Rules of the Game

NETARENA doesn't just check if the AI got the answer right; it checks three things:

Correctness: Did the network start working again? (The "Did you win?" check).
Safety: Did the AI break anything else while trying to fix it? (The "Did you crash the car?" check). This is crucial. An AI that fixes a problem but takes down the whole internet is a failure.
Latency: How fast did it happen? (The "How long did the traffic jam last?" check).

5. What They Discovered (The Plot Twist)

When they ran their "Infinite Simulator" tests on top AI models (like GPT-4o), the results were shocking:

The "Memorization" Myth: On small, old tests, AI looked smart. On NETARENA's massive, random tests, the AI's performance dropped drastically (often below 40% success). They realized the AI was mostly guessing or memorizing, not truly understanding.
The "Safety" Trap: Some AIs were great at fixing the problem but did it in a way that was dangerous (like using a sledgehammer to fix a watch). Others were so scared of breaking things that they did nothing at all.
The "Overfitting" Lesson: When they taught the AI to solve specific types of problems (training it), it got really good at those specific problems but failed miserably when the scenario changed slightly. It was like a student who only studied for the history test but failed the geography test.

6. Why This Matters

NETARENA is a stress test for the future of AI.

For Developers: It's a tool to train AI to be safer and more reliable before letting them touch real networks.
For the World: As we let AI manage our power grids, internet, and hospitals, we need to know they won't accidentally turn off the lights. NETARENA provides the "flight simulator" to ensure they are ready for the real world.

In a nutshell: NETARENA stops us from tricking AI with easy, memorized tests and forces it to prove it can handle the messy, unpredictable, and high-stakes reality of managing our digital infrastructure.

1. Problem Statement

As Large Language Models (LLMs) evolve into autonomous agents for high-stakes domains like network system operations, evaluating their real-world reliability has become critical. However, existing benchmarks suffer from three major limitations:

Static Design & Data Contamination: Current benchmarks rely on manually curated, static datasets (often <300 queries). This makes them prone to data contamination (models memorizing answers) and statistical bias.
Lack of Complexity & Generalizability: Static datasets fail to capture the heterogeneity of production environments. An agent succeeding on one topology may fail when conditions shift. They also struggle to surface rare but critical edge cases.
Insufficient Evaluation Metrics: Existing evaluations often focus solely on "correctness" (output matching ground truth). In network automation, agents must also adhere to safety (avoiding service disruption) and latency (efficiency), which static benchmarks miss.

2. Methodology: The NETARENA Framework

NETARENA is a dynamic benchmark generation framework designed to evaluate AI agents in interactive, executable network environments.

A. Unified State-Action Abstraction

NETARENA abstracts diverse network tasks into a Finite State Transition System $(S, A, E)$ :

State ( $S$ ): The current network/system topology (e.g., routing tables, device configurations).
Action ( $A$ ): Atomic operations (e.g., add_switch, ping, update_route).
Execution ( $E$ ): The function that transitions the system state based on actions.

This abstraction supports two task types:

Constructive Tasks: The agent generates a sequence of actions to transition a system from an initial state $s_0$ to a target state $s_T$ (e.g., capacity planning). Ground truth is a deterministic action sequence.
Reactive Tasks: The system starts in a faulty state $s_{faulty}$ (induced by hidden error injection). The agent must diagnose and restore the system to the healthy state $s_0$ . Ground truth is the restoration of the state, not a specific action path.

B. Dynamic Generation & Emulator Integration

Dynamic Query Generation: Instead of static datasets, NETARENA uses stochastic sampling to generate unlimited, diverse queries on demand. Users specify high-level parameters (e.g., complexity, topology size), and the system synthesizes unique instances.
High-Fidelity Emulators: NETARENA integrates with real-world emulators (Mininet for routing, Kubernetes for microservices, and custom simulators for datacenter topologies).
Automated Verification: Agents execute actions directly in the emulator. The system automatically verifies:
- Correctness: Does the final state match the target?
- Safety: Did any intermediate action violate constraints (e.g., breaking existing links, unauthorized changes)?
- Latency: How many iterations/commands were required to solve the task?

3. Key Contributions

Unified Interface: A novel abstraction allowing dynamic benchmarking across heterogeneous network workloads (planning, routing, policy troubleshooting).
Execution-Based Evaluation: Moving beyond text-based evaluation to execution-grounded evaluation, where agents interact with emulators, enabling the measurement of safety and latency alongside correctness.
Scalability & Anti-Contamination: The framework enables the generation of massive datasets (e.g., >9,000 queries), reducing confidence interval overlap between agents from 85% to 0% and virtually eliminating data contamination risks.
Fine-Grained Analysis: The ability to control task complexity (Level 1–3) allows for detailed analysis of model generalization, overfitting, and failure modes.

4. Experimental Results

The authors evaluated five agents (based on GPT-4o and QWen-72B) across three tasks: Datacenter Capacity Planning, Routing Misconfiguration, and Microservice Policy Troubleshooting.

Low Agent Performance: Current agents perform poorly on realistic, large-scale queries. Average correctness across tasks is only 24%, with the best agents staying below 60%.
Statistical Reliability: Small benchmarks (<200 queries) show high variance and overlapping confidence intervals. NETARENA's large-scale generation (e.g., 5,000 queries) eliminates overlap, providing statistically significant distinctions between models.
Safety vs. Correctness Trade-offs:
- Some models produce correct answers but violate safety constraints (e.g., deleting running pods).
- Others are overly conservative, failing to resolve issues within acceptable latency.
- NETARENA exposes these trade-offs, which static benchmarks miss.
Supervised Fine-Tuning (SFT) Insights:
- Overfitting: Models trained on specific difficulty levels fail to generalize to others. Only models trained on mixed-difficulty data generalize well.
- Safety Transfer: Interestingly, models trained on simpler tasks often generalize safety constraints better than complex ones, suggesting safety constraints are more robust to transfer than complex reasoning.
RL Fine-Tuning: Preliminary experiments using Reinforcement Learning (GRPO) in the Mininet environment showed that agents could learn from environment feedback, improving from random command generation to valid diagnostic sequences.

5. Significance and Future Impact

Rigorous Evaluation: NETARENA establishes a new standard for evaluating AI agents in safety-critical infrastructure, moving beyond "can it answer?" to "can it act safely and efficiently?"
Development Tool: It serves as a platform for SFT and RL fine-tuning, generating the large-scale, labeled data required to train agents for network operations.
Adversarial Testing: The framework can generate targeted adversarial examples to probe model weaknesses, helping developers identify failure modes before real-world deployment.
Sim-to-Real Bridge: While acknowledging the gap between emulators and production, NETARENA captures the core causal and structural properties of network operations, providing a necessary stress test before live deployment.

In summary, NETARENA addresses the critical gap in evaluating AI agents for network automation by providing a dynamic, scalable, and execution-based framework that reveals the true limitations of current models in high-stakes, complex environments.