Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

This paper evaluates the autonomous cyber-attack capabilities of seven frontier AI models across two multi-step cyber ranges, revealing that performance scales log-linearly with inference-time compute and improves significantly across model generations, with the latest models completing up to 22 of 32 corporate network steps in roughly one-sixth the time a human expert would require.

Linus Folkerts, Will Payne, Simon Inman, Philippos Giavridis, Joe Skinner, Sam Deverett, James Aung, Ekin Zorer, Michael Schmatz, Mahmoud Ghanem, John Wilkinson, Alan Steer, Vy Hong, Jessica Wang

Published 2026-03-13
📖 5 min read🧠 Deep dive

Imagine you are testing how fast a new generation of robot apprentices can learn to break into a highly secure building.

In the past, we tested these robots by asking them to solve single, isolated puzzles (like "pick this specific lock" or "crack this one password"). But in the real world, a cyberattack isn't just one puzzle; it's a long, winding journey involving dozens of steps, dead ends, and complex decisions.

This paper is a report card on how well these AI "robots" are doing when asked to navigate two different digital obstacle courses designed to simulate real-world cyberattacks.

The Two Obstacle Courses

  1. The Corporate Heist (32 Steps): Imagine a robot trying to sneak into a giant office building, steal a master key, and walk out with a briefcase of secret documents. It has to walk through 32 different rooms (servers), pick various locks, and trick security guards.
  2. The Power Plant Sabotage (7 Steps): Imagine a robot trying to shut down a nuclear power plant's cooling system. This is harder because the steps are massive and complex. It's not just "open a door"; it's "understand the entire plumbing system, rewrite the blueprints, and then turn the valve."

The Big Discoveries

The researchers tested 7 different AI models released over an 18-month period (from late 2024 to early 2026). Here is what they found, using some simple analogies:

1. The "More Brain Power" Effect (Scaling Compute)

Think of the AI's "brain power" as a fuel tank. The more fuel (tokens) you give the robot, the further it gets.

  • The Finding: If you let the robot use 10x more fuel, it doesn't just go 10% further; it goes 59% further.
  • The Analogy: It's like giving a hiker a bigger backpack. With a small backpack, they get tired after a mile. With a massive backpack full of supplies, they can hike for days. The AI didn't hit a "wall" where it got stuck; it just kept going as long as you kept feeding it fuel.
  • The Catch: You don't need to be a genius to do this. Any attacker can just buy more "fuel" (pay for more computer time) to make the AI smarter.

2. The "New Model" Effect (Generational Improvement)

The researchers compared the "older" robots (from 2024) to the "newer" ones (from 2026).

  • The Finding: The new models are significantly better, even if you give them the same amount of fuel.
  • The Analogy: It's like comparing a 2024 sedan to a 2026 Formula 1 car. Even if both have the same amount of gas, the new car is faster and more efficient.
  • The Stats:
    • Old Model (GPT-4o): In the Corporate Heist, it got stuck after about 1.7 steps. It was like a robot that could open the front door but couldn't figure out the hallway.
    • New Model (Opus 4.6): In the same time, it got through 9.8 steps.
    • The Best Run: The smartest robot, with the most fuel, managed to complete 22 out of 32 steps.
    • Human Comparison: A human expert would take about 14 hours to do this job. The best AI run did the equivalent of 6 hours of human work.

3. The "Hard Mode" Problem (Industrial Control Systems)

While the robots got good at the office building, they are still terrible at the Power Plant.

  • The Finding: On the 7-step Power Plant attack, even the best robot only managed to complete 1.4 steps on average.
  • The Analogy: The robots are great at following a map to a treasure chest, but if you ask them to perform brain surgery or fix a jet engine, they still get confused. They lack the deep, specialized "surgical" knowledge required for industrial systems.
  • The Twist: Interestingly, the robots sometimes took "cheat codes." Instead of following the planned path (hacking a website first), they just started poking the power plant's machinery directly and accidentally found a way in. They didn't understand why it worked; they just got lucky.

The Bottlenecks: Where They Get Stuck

Even the best robots hit a wall. The paper found three main "choke points" where the AI gets stuck:

  1. The Relay Race: Trying to pass a stolen password from one computer to another in real-time.
  2. The Code Breaker: Having to reverse-engineer a complex, encrypted file (like taking apart a safe to see how the lock works).
  3. The Pipeline: Trying to hack a software update system to plant a virus.

The new models are finally starting to break through the first wall, but the others are still very hard.

Why This Matters (The "So What?")

  • The Barrier is Lowering: In the past, you needed to be a highly skilled hacker to pull off a complex, multi-step attack. Now, an AI can do a huge chunk of the work. This means someone with very little skill could potentially cause a lot of damage just by telling an AI what to do.
  • It's Getting Cheaper: As the models get better and the cost of running them drops, these attacks become more accessible to bad actors.
  • The Future Threat: We aren't quite at the point where an AI can take over a whole network and destroy it completely on its own. But we are at the point where an AI can do 60% of the work, leaving a human hacker to just finish the job.

The Bottom Line

AI agents are getting scary good at navigating complex digital mazes, especially when given enough time and computing power. They are improving rapidly, step by step. However, they still struggle with the most complex, specialized industrial tasks.

The researchers warn that we need to keep watching this closely. Just as we test cars for safety, we need to keep testing these AI "robots" to see how close they are to being able to break into our most critical systems entirely on their own.