Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios

Imagine you are testing how fast a new generation of robot apprentices can learn to break into a highly secure building.

In the past, we tested these robots by asking them to solve single, isolated puzzles (like "pick this specific lock" or "crack this one password"). But in the real world, a cyberattack isn't just one puzzle; it's a long, winding journey involving dozens of steps, dead ends, and complex decisions.

This paper is a report card on how well these AI "robots" are doing when asked to navigate two different digital obstacle courses designed to simulate real-world cyberattacks.

The Two Obstacle Courses

The Corporate Heist (32 Steps): Imagine a robot trying to sneak into a giant office building, steal a master key, and walk out with a briefcase of secret documents. It has to walk through 32 different rooms (servers), pick various locks, and trick security guards.
The Power Plant Sabotage (7 Steps): Imagine a robot trying to shut down a nuclear power plant's cooling system. This is harder because the steps are massive and complex. It's not just "open a door"; it's "understand the entire plumbing system, rewrite the blueprints, and then turn the valve."

The Big Discoveries

The researchers tested 7 different AI models released over an 18-month period (from late 2024 to early 2026). Here is what they found, using some simple analogies:

1. The "More Brain Power" Effect (Scaling Compute)

Think of the AI's "brain power" as a fuel tank. The more fuel (tokens) you give the robot, the further it gets.

The Finding: If you let the robot use 10x more fuel, it doesn't just go 10% further; it goes 59% further.
The Analogy: It's like giving a hiker a bigger backpack. With a small backpack, they get tired after a mile. With a massive backpack full of supplies, they can hike for days. The AI didn't hit a "wall" where it got stuck; it just kept going as long as you kept feeding it fuel.
The Catch: You don't need to be a genius to do this. Any attacker can just buy more "fuel" (pay for more computer time) to make the AI smarter.

2. The "New Model" Effect (Generational Improvement)

The researchers compared the "older" robots (from 2024) to the "newer" ones (from 2026).

The Finding: The new models are significantly better, even if you give them the same amount of fuel.
The Analogy: It's like comparing a 2024 sedan to a 2026 Formula 1 car. Even if both have the same amount of gas, the new car is faster and more efficient.
The Stats:
- Old Model (GPT-4o): In the Corporate Heist, it got stuck after about 1.7 steps. It was like a robot that could open the front door but couldn't figure out the hallway.
- New Model (Opus 4.6): In the same time, it got through 9.8 steps.
- The Best Run: The smartest robot, with the most fuel, managed to complete 22 out of 32 steps.
- Human Comparison: A human expert would take about 14 hours to do this job. The best AI run did the equivalent of 6 hours of human work.

3. The "Hard Mode" Problem (Industrial Control Systems)

While the robots got good at the office building, they are still terrible at the Power Plant.

The Finding: On the 7-step Power Plant attack, even the best robot only managed to complete 1.4 steps on average.
The Analogy: The robots are great at following a map to a treasure chest, but if you ask them to perform brain surgery or fix a jet engine, they still get confused. They lack the deep, specialized "surgical" knowledge required for industrial systems.
The Twist: Interestingly, the robots sometimes took "cheat codes." Instead of following the planned path (hacking a website first), they just started poking the power plant's machinery directly and accidentally found a way in. They didn't understand why it worked; they just got lucky.

The Bottlenecks: Where They Get Stuck

Even the best robots hit a wall. The paper found three main "choke points" where the AI gets stuck:

The Relay Race: Trying to pass a stolen password from one computer to another in real-time.
The Code Breaker: Having to reverse-engineer a complex, encrypted file (like taking apart a safe to see how the lock works).
The Pipeline: Trying to hack a software update system to plant a virus.

The new models are finally starting to break through the first wall, but the others are still very hard.

Why This Matters (The "So What?")

The Barrier is Lowering: In the past, you needed to be a highly skilled hacker to pull off a complex, multi-step attack. Now, an AI can do a huge chunk of the work. This means someone with very little skill could potentially cause a lot of damage just by telling an AI what to do.
It's Getting Cheaper: As the models get better and the cost of running them drops, these attacks become more accessible to bad actors.
The Future Threat: We aren't quite at the point where an AI can take over a whole network and destroy it completely on its own. But we are at the point where an AI can do 60% of the work, leaving a human hacker to just finish the job.

The Bottom Line

AI agents are getting scary good at navigating complex digital mazes, especially when given enough time and computing power. They are improving rapidly, step by step. However, they still struggle with the most complex, specialized industrial tasks.

The researchers warn that we need to keep watching this closely. Just as we test cars for safety, we need to keep testing these AI "robots" to see how close they are to being able to break into our most critical systems entirely on their own.

Here is a detailed technical summary of the paper "Measuring AI Agents' Progress on Multi-Step Cyber Attack Scenarios."

1. Problem Statement

As AI models become more capable, there is a critical need to understand their ability to conduct autonomous, multi-step cyberattacks. While existing evaluations rely on isolated Capture-the-Flag (CTF) challenges or question-answer benchmarks, these fail to capture the complex reasoning, state tracking, error recovery, and long-horizon planning required for real-world offensive operations. The paper addresses the gap in understanding:

Whether frontier AI models can execute extended attack chains across complex network environments.
How rapidly these capabilities are improving over time.
The impact of inference-time compute budgets on autonomous performance.

2. Methodology

The authors evaluated seven frontier AI models released between August 2024 and February 2026 (including GPT-4o, Claude Sonnet/Opus series, and GPT-5 variants) on two purpose-built cyber ranges.

A. Cyber Ranges

The environments are simulated networks containing multiple hosts, services, and vulnerabilities, arranged into sequential attack chains. Crucially, these ranges lack active defenders (no blocking or detection penalties), isolating the model's offensive capability.

"The Last Ones" (Corporate Network): A 32-step attack chain aimed at exfiltrating sensitive data from a protected internal database. It involves reconnaissance, lateral movement, credential theft, and data exfiltration.
- Estimated Human Time: ~14 hours.
"Cooling Tower" (Industrial Control System - ICS): A 7-step attack chain aimed at disrupting a simulated power plant's cooling tower. This requires interacting with Programmable Logic Controllers (PLCs) and involves complex dependencies and protocol analysis.
- Estimated Human Time: ~15 hours.

B. Agent Design & Experimental Setup

Agent Architecture: Models run on a standard Kali Linux environment using a ReAct (Reason + Act) paradigm. The agent cycles through reasoning, executing actions (Bash, Python, Mythic C2 framework), and observing results.
Context Management: To handle long trajectories exceeding context windows, the authors employed context compaction. When the window reaches ~80% capacity, the model summarizes the conversation history (retaining credentials, topology, and progress) to form a new context window.
Compute Budgets: Models were tested at two token budgets: 10M tokens and 100M tokens.
Metrics: Performance is measured by the number of steps completed. A step is verified by the submission of a specific "flag."

3. Key Contributions

Longitudinal Evaluation: The first study to track AI cyber capabilities across an 18-month period (Aug 2024–Feb 2026) using consistent, complex multi-step environments.
Scaling Laws for Cyber: Demonstrating that autonomous cyber performance scales log-linearly with inference-time compute (token budget) with no observed plateau up to 100M tokens.
Generation-to-Generation Improvement: Quantifying the rapid improvement in model generations, showing that newer models outperform predecessors even at fixed token budgets.
Identification of Bottlenecks: Pinpointing specific technical barriers where models struggle, such as NTLM relay attacks, reverse engineering, and cryptography.
Divergent Attack Strategies: Observing that AI agents sometimes bypass intended human-designed paths (e.g., in the ICS range) by discovering alternative vulnerabilities (e.g., brute-forcing protocol identifiers) that humans might not prioritize.

4. Key Results

A. Scaling with Compute (Inference-Time)

Log-Linear Growth: Increasing the token budget from 10M to 100M yields performance gains of up to 59%.
Accessibility: This improvement requires no technical sophistication from the operator; simply increasing the budget allows actors of all skill levels to achieve better results.
No Plateau: Models were still actively executing goal-directed commands when hitting the 100M token limit, suggesting further gains are possible with higher budgets.

B. Model Generation Progress

Corporate Network ("The Last Ones"):
- At 10M tokens, average steps completed rose from 1.7 (GPT-4o, Aug 2024) to 9.8 (Opus 4.6, Feb 2026).
- At 100M tokens, the best run (Opus 4.6) completed 22 of 32 steps.
- Human Comparison: The best Opus 4.6 run covered steps equivalent to roughly 6 hours of human expert effort (vs. the estimated 14 hours total), taking ~10 hours of wall-clock time.
Industrial Control System ("Cooling Tower"):
- Performance remains significantly lower. At 100M tokens, the best model (Opus 4.6) averaged 1.4 of 7 steps (max 2), with GPT 5.3 Codex reaching a max of 3 steps in a single run.
- The complexity of ICS steps (larger units of work, complex dependencies) presents a higher barrier than corporate networks.

C. Critical Bottlenecks

Performance drops sharply after Milestone 4 in the corporate range, which transitions from reconnaissance to tasks requiring specialist knowledge (reverse engineering Windows binaries, token impersonation, cryptographic key recovery).

NTLM Relay: A major bottleneck requiring real-time coordination of concurrent processes.
CI/CD Pipeline Attacks: Requires discovering credentials and manipulating build configurations.
Opus 4.6 was the first model to reliably overcome the first NTLM relay bottleneck.

D. Unexpected Behaviors

In the "Cooling Tower" range, models did not follow the intended path (HMI exploitation $\to$ reverse engineering $\to$ PLC interaction). Instead, they:

Directly probed proprietary PLC protocols.
Deduced protocol structures from network traffic.
Exploited an unintended bug: Brute-forced session identifiers to bypass authentication entirely, a method not anticipated by the range designers.

5. Significance and Implications

Threat Landscape Shift: The ability of AI to chain heterogeneous capabilities reduces the skill barrier for unsophisticated actors and increases the scale of attacks for skilled actors.
Cost Efficiency: A 100M-token attempt with Opus 4.6 costs approximately $80 USD, making high-level autonomous attacks economically feasible.
Governance Challenges: The rapid pace of improvement (e.g., a 42% jump in performance between Opus 4.5 and 4.6 in just two months) suggests that safety measures must evolve faster than model capabilities.
Limitations: The study acknowledges that current ranges lack active defenders and that real-world attacks involve noise and persistence. However, the results represent a lower bound on capability, as the models were not given specialized scaffolding or human-in-the-loop guidance.

Conclusion

The paper concludes that frontier AI models are rapidly approaching the capability to execute complex, multi-step cyberattacks autonomously. While they still struggle with highly specialized domains like ICS and deep reverse engineering, the combination of increased compute budgets and newer model generations is driving consistent, significant progress. The authors emphasize the need for rigorous, continuous evaluation to track these capabilities as they evolve toward full autonomy.