Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

This paper presents the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment, demonstrating that the proposed multi-agent framework ARTEMIS outperformed nine of ten human participants in discovering valid vulnerabilities while offering significant cost advantages, despite current limitations in handling GUI-based tasks and higher false-positive rates.

Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine a massive, high-tech library (a university computer network) with 8,000 rooms, thousands of books, and secret passages. The library has a security team whose job is to find the unlocked doors, broken windows, and hidden traps before the bad guys do.

This paper is the report card from a race between two groups of security experts trying to break into this library:

  1. The Humans: 10 real-life cybersecurity professionals (the "Red Team").
  2. The Robots: 6 different AI agents, including a new, super-smart robot team the researchers built called ARTEMIS.

Here is the breakdown of what happened, using simple analogies.

1. The Setup: The "Live Fire" Drill

Usually, when we test AI, we give it a video game or a puzzle book (like a Capture The Flag competition). But this study was different. They didn't use a simulation; they used the real, live university network.

  • The Rules: The humans and robots were given a "jumpbox" (a starting computer) and told to find as many security holes as possible in 10 hours.
  • The Safety Net: Because they were hacking a real place, there were strict safety guards. If a robot or human tried to delete files or crash the system, a human monitor would hit the "Emergency Stop" button immediately.

2. The Contenders

  • The Humans: Experienced hackers who know how to think creatively, spot patterns, and use tools like a detective.
  • The Old Robots: Existing AI tools (like Codex or CyAgent). Think of these as "Junior Interns" who are smart but get confused easily, give up quickly, or refuse to do "bad" things because they are programmed to be polite.
  • ARTEMIS (The New Robot): This is the researchers' new creation. Imagine a Swarm of Ants led by a General.
    • It has a Supervisor (the General) who plans the big picture.
    • It can spawn Sub-agents (the worker ants) to do specific tasks simultaneously.
    • It has a Triage Team (the quality control) that checks if a found "hole" is real or just a false alarm before reporting it.
    • It can keep working for hours without getting tired or needing a coffee break.

3. The Results: Who Won?

The scoreboard looked like this:

  • The Humans: They found a total of 49 valid security holes. They were thorough, but they worked one thing at a time. If they found a suspicious door, they checked it, then moved to the next room.
  • The Old Robots: They mostly failed. They found very few holes, often got stuck on the first step, or refused to try certain attacks because of their safety filters.
  • ARTEMIS: It came in 2nd place overall, beating 9 out of the 10 human experts!
    • It found 9 valid, serious vulnerabilities.
    • It was incredibly fast at checking many doors at once (parallel processing).
    • It was much cheaper. A human expert costs about $60/hour. ARTEMIS cost about $18/hour.

4. The Strengths and Weaknesses (The "Superpowers" and "Kryptonite")

Where the Robots (ARTEMIS) Crushed the Humans:

  • The "Super-Scanner": Humans get tired. If a human finds a list of 100 computers to check, they might check 10 and take a break. ARTEMIS checked all 100 instantly and simultaneously.
  • The "Old Tech" Expert: Humans use modern web browsers. If a computer had an ancient, broken security system that modern browsers refused to load, humans gave up. ARTEMIS used command-line tools to bypass the browser and hack the ancient system anyway.
  • Cost: You can run ARTEMIS 24/7 for the price of one human's lunch break.

Where the Humans Crushed the Robots:

  • The "GUI" Problem: This is the robot's biggest weakness. If a security hole required clicking through a graphical interface (like a weird web page with buttons and menus), the robots struggled. They are great at reading code (text), but bad at "seeing" and clicking buttons like a human does.
  • False Alarms: The robots were a bit paranoid. They sometimes reported a "broken lock" when it was actually just a reflection in the glass. Humans are better at knowing the difference between a real threat and a glitch.
  • Creativity: Humans could "pivot." If they found a small clue, they could guess, "Hmm, maybe this leads to the main vault?" and try a creative, non-standard path. The robots were more rigid; they followed the map they were given.

5. The Big Takeaway

This study proves that AI is no longer just a toy; it is a dangerous competitor.

  • The Good News: We now have a tool (ARTEMIS) that can help companies find their own security holes for a fraction of the cost of hiring humans. It's like having a security guard who never sleeps and checks every door in the building in seconds.
  • The Bad News: If bad actors (hackers) get access to this same technology, they could scan and hack entire networks faster than any human security team could stop them.

In a nutshell: The robots are becoming so good at "hunting" that they are beating most human hunters in a real-world environment. However, they still need a human to hold their hand when the task requires clicking a mouse or thinking outside the box. The future of cybersecurity isn't just humans or robots; it's humans using robots to stay safe.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →