AttackSeqBench: Benchmarking the Capabilities of LLMs for Attack Sequences Understanding

This paper introduces AttackSeqBench, a comprehensive benchmark designed to systematically evaluate and analyze the capabilities of large language models in understanding and reasoning about adversarial behavioral sequences within Cyber Threat Intelligence reports.

Haokai Ma, Javier Yong, Yunshan Ma, Kuei Chen, Anis Yusof, Zhenkai Liang, Ee-Chien Chang

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you are a detective trying to solve a complex crime. You have a thick, messy file full of witness statements, police reports, and scattered notes (these are the Cyber Threat Intelligence reports). Your job is to piece together the story: Who did what, when, and why?

For years, human detectives have struggled to read these files quickly. They are too long, too messy, and full of jargon. Recently, we've given these detectives a super-smart AI assistant (a Large Language Model or LLM) to help. These AI assistants are great at summarizing stories or finding names in a text. But can they actually understand the sequence of events? Can they figure out that the thief picked the lock before they stole the jewels, and after they disabled the alarm?

This paper introduces AttackSeqBench, a new "exam" designed to test if these AI detectives are actually good at understanding the storyline of a cyberattack, not just the individual words.

The Problem: The "Word Search" vs. The "Movie Plot"

Think of most current AI tests as a Word Search. They ask the AI: "Find the word 'phishing' in this text." The AI is great at this.

But real cyberattacks are more like a Movie Plot.

  • Scene 1: The bad guy sends a fake email (Phishing).
  • Scene 2: The victim clicks, and a virus downloads.
  • Scene 3: The virus hides in the system (Persistence).
  • Scene 4: The bad guy takes control of the computer (Command and Control).

If you ask the AI, "What happened after the virus downloaded?" it needs to understand the flow of the movie, not just find the word "download." The authors found that while AI is getting smarter, it often gets confused about the order of these scenes.

The Solution: The "AttackSeqBench" Exam

The researchers built a special test called AttackSeqBench. Here's how they made it, using a simple analogy:

  1. The Source Material: They took 408 real-world "crime scene reports" (CTI reports) from security companies.
  2. The Storyboard: They used AI to automatically turn these messy reports into clean "storyboards" (Attack Sequences), listing exactly what the hackers did step-by-step.
  3. The Questions: They asked the AI to generate quiz questions based on these storyboards.
    • Example Question: "The hacker sent a fake email, then installed a backdoor. What likely happened before they installed the backdoor?"
    • The Trap: They also created "trick questions" where the order was swapped (e.g., "Did they install the backdoor before sending the email?"). The AI has to say "No" and explain why that doesn't make sense.

The "Students" Taking the Exam

The researchers tested three types of "students" (AI models):

  1. Standard LLMs: The regular smart assistants (like the ones in your phone or chat).
  2. LRMs (Large Reasoning Models): The "super-thinkers." These are newer AI models that pause to "think" longer before answering, similar to a student who takes a deep breath and works through a math problem step-by-step.
  3. Trained Models: Standard models that were given extra homework (post-training) specifically on cybersecurity books.

They tested them in three different ways:

  • Zero-Shot: "Here's the question, use your brain." (No help allowed).
  • Context: "Here's the question and the storybook." (They can read the report).
  • RAG (Retrieval): "Here's the question. Go look up the answer in our library of security facts."

The Surprising Results

The results were a mix of "Great job!" and "Oh no, not yet."

  • The "Super-Thinkers" (LRMs) Didn't Win: You might think the models that "think longer" would be better at solving a complex story. Surprise! They often did worse than the regular models.
    • Why? Imagine a student who overthinks a simple riddle. They start doubting themselves, creating imaginary scenarios, and getting lost in their own thoughts. The "Reasoning Models" got confused by the specific details of the cyberattack and talked themselves into the wrong answer.
  • Context is King: When the AI was allowed to read the report (Context setting), it got much better. But when they tried to use the "Library" (RAG), it often got confused by too much information. It's like giving a detective a library of 1,000 books when they only needed to read one specific page; they got overwhelmed.
  • The "Thinking" Gap: The AI is still bad at understanding the logic of time. It struggles to say, "You can't steal the jewels before you pick the lock."

Why This Matters

This paper is a wake-up call. It tells us that while AI is amazing at writing emails or summarizing news, it isn't quite ready to be the lead detective in a cybercrime investigation yet. It needs to learn how to understand the sequence of events better.

The authors have made their "exam" and the "storyboards" public. This means other researchers can use this test to build better AI detectives that won't get confused by the order of a hacker's moves, helping us stay safer in the digital world.

In short: We built a test to see if AI can understand the story of a cyberattack. We found that the "over-thinkers" actually get confused, and that AI needs to learn to pay closer attention to the timeline of events, not just the words.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →