Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Imagine a massive, high-tech library (a university computer network) with 8,000 rooms, thousands of books, and secret passages. The library has a security team whose job is to find the unlocked doors, broken windows, and hidden traps before the bad guys do.

This paper is the report card from a race between two groups of security experts trying to break into this library:

The Humans: 10 real-life cybersecurity professionals (the "Red Team").
The Robots: 6 different AI agents, including a new, super-smart robot team the researchers built called ARTEMIS.

Here is the breakdown of what happened, using simple analogies.

1. The Setup: The "Live Fire" Drill

Usually, when we test AI, we give it a video game or a puzzle book (like a Capture The Flag competition). But this study was different. They didn't use a simulation; they used the real, live university network.

The Rules: The humans and robots were given a "jumpbox" (a starting computer) and told to find as many security holes as possible in 10 hours.
The Safety Net: Because they were hacking a real place, there were strict safety guards. If a robot or human tried to delete files or crash the system, a human monitor would hit the "Emergency Stop" button immediately.

2. The Contenders

The Humans: Experienced hackers who know how to think creatively, spot patterns, and use tools like a detective.
The Old Robots: Existing AI tools (like Codex or CyAgent). Think of these as "Junior Interns" who are smart but get confused easily, give up quickly, or refuse to do "bad" things because they are programmed to be polite.
ARTEMIS (The New Robot): This is the researchers' new creation. Imagine a Swarm of Ants led by a General.
- It has a Supervisor (the General) who plans the big picture.
- It can spawn Sub-agents (the worker ants) to do specific tasks simultaneously.
- It has a Triage Team (the quality control) that checks if a found "hole" is real or just a false alarm before reporting it.
- It can keep working for hours without getting tired or needing a coffee break.

3. The Results: Who Won?

The scoreboard looked like this:

The Humans: They found a total of 49 valid security holes. They were thorough, but they worked one thing at a time. If they found a suspicious door, they checked it, then moved to the next room.
The Old Robots: They mostly failed. They found very few holes, often got stuck on the first step, or refused to try certain attacks because of their safety filters.
ARTEMIS: It came in 2nd place overall, beating 9 out of the 10 human experts!
- It found 9 valid, serious vulnerabilities.
- It was incredibly fast at checking many doors at once (parallel processing).
- It was much cheaper. A human expert costs about $60/hour. ARTEMIS cost about $18/hour.

4. The Strengths and Weaknesses (The "Superpowers" and "Kryptonite")

Where the Robots (ARTEMIS) Crushed the Humans:

The "Super-Scanner": Humans get tired. If a human finds a list of 100 computers to check, they might check 10 and take a break. ARTEMIS checked all 100 instantly and simultaneously.
The "Old Tech" Expert: Humans use modern web browsers. If a computer had an ancient, broken security system that modern browsers refused to load, humans gave up. ARTEMIS used command-line tools to bypass the browser and hack the ancient system anyway.
Cost: You can run ARTEMIS 24/7 for the price of one human's lunch break.

Where the Humans Crushed the Robots:

The "GUI" Problem: This is the robot's biggest weakness. If a security hole required clicking through a graphical interface (like a weird web page with buttons and menus), the robots struggled. They are great at reading code (text), but bad at "seeing" and clicking buttons like a human does.
False Alarms: The robots were a bit paranoid. They sometimes reported a "broken lock" when it was actually just a reflection in the glass. Humans are better at knowing the difference between a real threat and a glitch.
Creativity: Humans could "pivot." If they found a small clue, they could guess, "Hmm, maybe this leads to the main vault?" and try a creative, non-standard path. The robots were more rigid; they followed the map they were given.

5. The Big Takeaway

This study proves that AI is no longer just a toy; it is a dangerous competitor.

The Good News: We now have a tool (ARTEMIS) that can help companies find their own security holes for a fraction of the cost of hiring humans. It's like having a security guard who never sleeps and checks every door in the building in seconds.
The Bad News: If bad actors (hackers) get access to this same technology, they could scan and hack entire networks faster than any human security team could stop them.

In a nutshell: The robots are becoming so good at "hunting" that they are beating most human hunters in a real-world environment. However, they still need a human to hold their hand when the task requires clicking a mouse or thinking outside the box. The future of cybersecurity isn't just humans or robots; it's humans using robots to stay safe.

1. Problem Statement

The rapid advancement of AI capabilities has raised significant concerns regarding their potential misuse in offensive cybersecurity. While numerous benchmarks exist to evaluate AI agents (e.g., CTF challenges, static code analysis, CVE reproduction), they often lack operational realism. These abstractions fail to capture the complexity of live enterprise environments, including network noise, interactive user systems, and the chaining of multiple vulnerabilities.

There is a critical gap in understanding how autonomous AI agents perform against human professionals in a live, production-grade environment. Existing benchmarks do not adequately measure the speed, efficiency, and scalability of autonomous agents, nor do they provide a realistic baseline for the risks posed by AI-driven cyberattacks.

2. Methodology

The authors conducted the first comprehensive, head-to-head comparison between human cybersecurity professionals and AI agents in a live enterprise setting.

Target Environment

Scope: A large research university's Computer Science network.
Scale: Approximately 8,000 hosts across 12 subnets (7 public, 5 private/VPN-only).
Heterogeneity: Primarily Unix-based systems, IoT devices, embedded systems, and a small number of Windows machines.
Security Posture: The environment had active defenses, including intrusion detection systems (IDS), endpoint detection and response (EDR), centralized logging, and monthly vulnerability management (Qualys).

Participants

Humans: 10 cybersecurity professionals recruited via industry referrals. They were compensated $2,000 each and given 10 hours of active engagement time.
AI Agents: Six existing agent frameworks (Codex, CyAgent, Claude Code, Incalmo, MAPTA) and a new custom scaffold, ARTEMIS.
Constraints: All participants operated from a standardized Kali Linux VM. Destructive actions (DoS, data deletion) were strictly prohibited.

Evaluation Framework

To move beyond simple "vulnerability count," the authors developed a unified scoring metric ( $S_{total}$ ) combining:

Technical Complexity ( $TC_i$ ): A weighted sum of Detection Complexity (DC) and Exploit Complexity (EC). Full credit is given for successful exploitation; verification-only findings receive a penalty.
Business Impact ( $W_i$ ): Weighted by severity (Critical=8, High=5, Medium=3, Low=2, Info=1), mirroring bug bounty payout structures.
MITRE ATT&CK Mapping: Used to categorize techniques used by both humans and agents.

The ARTEMIS Scaffold

The authors introduced ARTEMIS (Automated Red Teaming Engine with Multi-agent Intelligent Supervision), a novel multi-agent framework designed to overcome limitations in existing scaffolds:

Architecture: A high-level Supervisor manages a swarm of arbitrary Sub-agents and a Triage module.
Dynamic Prompting: Generates task-specific system prompts for sub-agents to avoid tool misuse and ensure scope adherence.
Long-Horizon Execution: Uses a recursive TODO system, note-taking, and smart summarization to maintain context over extended sessions (10+ hours), unlike existing agents that often stall or refuse tasks.
Triage: A dedicated module verifies relevance, reproducibility, and severity before submission to reduce false positives.

3. Key Contributions

First Live Enterprise Comparison: Provided the first rigorous, real-world evaluation of AI agents vs. human pentesters in a production environment with active defenses.
ARTEMIS Framework: Demonstrated that sophisticated scaffolding (multi-agent orchestration, dynamic prompting, and triage) significantly enhances the performance of foundation models in offensive security tasks.
Cost-Performance Analysis: Quantified the economic viability of AI agents, showing they can outperform humans at a fraction of the cost.
Open Source Release: Released the ARTEMIS code, study artifacts, and detailed logs to enable reproducibility and further research in AI cybersecurity risk.

4. Key Results

Performance Rankings

ARTEMIS (A1 configuration): Placed 2nd overall on the leaderboard, outperforming 9 out of 10 human participants.
- Findings: Discovered 9 valid vulnerabilities.
- Validity Rate: 82% (significantly higher than other agents).
- Total Score: 95.2 (vs. top human score of 111.4).
Existing Scaffolds: Most existing agents (Codex, CyAgent, Incalmo, MAPTA) underperformed significantly.
- Refusals: Claude Code and MAPTA refused the offensive task "out of the box."
- Stalling: Incalmo stalled during early reconnaissance due to rigid task graphs.
- Performance: Even when running, they found fewer vulnerabilities and had lower validity rates than most humans.

Comparative Strengths & Weaknesses

AI Strengths:
- Systematic Enumeration: Agents excelled at scanning and identifying low-hanging fruit (e.g., default credentials, open ports).
- Parallelism: ARTEMIS could spawn multiple sub-agents to probe targets simultaneously, whereas humans worked sequentially.
- CLI Proficiency: Agents excelled at command-line interactions, successfully exploiting older servers with outdated TLS ciphers that modern browsers (used by humans) refused to load.
AI Weaknesses:
- GUI Limitations: Agents struggled with GUI-based tasks. For example, they missed a Remote Code Execution (RCE) vulnerability on a TinyPilot KVM device that required browser interaction, while 80% of humans found it.
- False Positives: ARTEMIS had a higher false-positive rate than humans (e.g., misinterpreting HTTP 200 redirects as successful authentication).
- Lack of Deep Pivot: Agents tended to submit findings immediately upon discovery rather than "pivoting" to deeper lateral movement, a common human strategy.

Cost Analysis

ARTEMIS A1: Cost $18.21/hour ($37,876/year annualized).
ARTEMIS A2 (Ensemble): Cost $59/hour ($122,720/year).
Human Professionals: Average US penetration tester salary is ~$125,034/year.
Conclusion: ARTEMIS offers a competitive cost-to-performance ratio, capable of finding critical vulnerabilities at roughly 1/4 to 1/7th the cost of human experts.

5. Significance and Implications

Risk Assessment: The study confirms that capable AI agents, when properly scaffolded, can autonomously identify and exploit critical vulnerabilities in live environments, posing a tangible risk to global cybersecurity.
Defensive Utility: The same tools can be leveraged by defenders to continuously audit systems at a scale and cost previously unattainable.
Benchmark Evolution: The paper argues that current benchmarks (CTFs, static analysis) are insufficient. Future evaluations must prioritize live, interactive, and long-horizon environments to accurately measure AI risk.
Future Directions: The authors highlight the need for better "computer-use" agents to handle GUIs and improved context management to reduce false positives. They also call for the development of runnable environment replicas to allow for longer-term, replicable evaluations.

In summary, the paper demonstrates that while AI agents currently lag behind top-tier human experts in strategic depth and GUI interaction, they have already surpassed the majority of human professionals in systematic enumeration and cost-efficiency, marking a pivotal shift in the landscape of offensive and defensive cybersecurity.