AttackSeqBench: Benchmarking the Capabilities of LLMs for Attack Sequences Understanding

Imagine you are a detective trying to solve a complex crime. You have a thick, messy file full of witness statements, police reports, and scattered notes (these are the Cyber Threat Intelligence reports). Your job is to piece together the story: Who did what, when, and why?

For years, human detectives have struggled to read these files quickly. They are too long, too messy, and full of jargon. Recently, we've given these detectives a super-smart AI assistant (a Large Language Model or LLM) to help. These AI assistants are great at summarizing stories or finding names in a text. But can they actually understand the sequence of events? Can they figure out that the thief picked the lock before they stole the jewels, and after they disabled the alarm?

This paper introduces AttackSeqBench, a new "exam" designed to test if these AI detectives are actually good at understanding the storyline of a cyberattack, not just the individual words.

The Problem: The "Word Search" vs. The "Movie Plot"

Think of most current AI tests as a Word Search. They ask the AI: "Find the word 'phishing' in this text." The AI is great at this.

But real cyberattacks are more like a Movie Plot.

Scene 1: The bad guy sends a fake email (Phishing).
Scene 2: The victim clicks, and a virus downloads.
Scene 3: The virus hides in the system (Persistence).
Scene 4: The bad guy takes control of the computer (Command and Control).

If you ask the AI, "What happened after the virus downloaded?" it needs to understand the flow of the movie, not just find the word "download." The authors found that while AI is getting smarter, it often gets confused about the order of these scenes.

The Solution: The "AttackSeqBench" Exam

The researchers built a special test called AttackSeqBench. Here's how they made it, using a simple analogy:

The Source Material: They took 408 real-world "crime scene reports" (CTI reports) from security companies.
The Storyboard: They used AI to automatically turn these messy reports into clean "storyboards" (Attack Sequences), listing exactly what the hackers did step-by-step.
The Questions: They asked the AI to generate quiz questions based on these storyboards.
- Example Question: "The hacker sent a fake email, then installed a backdoor. What likely happened before they installed the backdoor?"
- The Trap: They also created "trick questions" where the order was swapped (e.g., "Did they install the backdoor before sending the email?"). The AI has to say "No" and explain why that doesn't make sense.

The "Students" Taking the Exam

The researchers tested three types of "students" (AI models):

Standard LLMs: The regular smart assistants (like the ones in your phone or chat).
LRMs (Large Reasoning Models): The "super-thinkers." These are newer AI models that pause to "think" longer before answering, similar to a student who takes a deep breath and works through a math problem step-by-step.
Trained Models: Standard models that were given extra homework (post-training) specifically on cybersecurity books.

They tested them in three different ways:

Zero-Shot: "Here's the question, use your brain." (No help allowed).
Context: "Here's the question and the storybook." (They can read the report).
RAG (Retrieval): "Here's the question. Go look up the answer in our library of security facts."

The Surprising Results

The results were a mix of "Great job!" and "Oh no, not yet."

The "Super-Thinkers" (LRMs) Didn't Win: You might think the models that "think longer" would be better at solving a complex story. Surprise! They often did worse than the regular models.
- Why? Imagine a student who overthinks a simple riddle. They start doubting themselves, creating imaginary scenarios, and getting lost in their own thoughts. The "Reasoning Models" got confused by the specific details of the cyberattack and talked themselves into the wrong answer.
Context is King: When the AI was allowed to read the report (Context setting), it got much better. But when they tried to use the "Library" (RAG), it often got confused by too much information. It's like giving a detective a library of 1,000 books when they only needed to read one specific page; they got overwhelmed.
The "Thinking" Gap: The AI is still bad at understanding the logic of time. It struggles to say, "You can't steal the jewels before you pick the lock."

Why This Matters

This paper is a wake-up call. It tells us that while AI is amazing at writing emails or summarizing news, it isn't quite ready to be the lead detective in a cybercrime investigation yet. It needs to learn how to understand the sequence of events better.

The authors have made their "exam" and the "storyboards" public. This means other researchers can use this test to build better AI detectives that won't get confused by the order of a hacker's moves, helping us stay safer in the digital world.

In short: We built a test to see if AI can understand the story of a cyberattack. We found that the "over-thinkers" actually get confused, and that AI needs to learn to pay closer attention to the timeline of events, not just the words.

1. Problem Statement

Cyber Threat Intelligence (CTI) reports are critical for proactive defense but are often unstructured, verbose, and complex. While Large Language Models (LLMs) have shown promise in extracting entities (e.g., TTPs) and building knowledge graphs, their ability to understand and reason about adversarial behavioral sequences remains underexplored.

The Gap: Existing benchmarks focus on entity extraction or static attribution. They fail to evaluate whether models can comprehend the temporal dependencies, logical flow, and multi-stage progression of attacks (e.g., how an initial access leads to persistence and exfiltration).
The Challenge: Real-world attacks (especially by Advanced Persistent Threats) unfold as complex workflows. Understanding these sequences is vital for anticipating future actions, yet current models struggle with the domain-specific epistemic requirements and the logical reasoning needed to connect disjointed events in a narrative.

2. Methodology: AttackSeqBench

The authors introduce AttackSeqBench, a comprehensive benchmark designed to evaluate LLMs, Large Reasoning Models (LRMs), and post-training strategies across three dimensions: Extensibility, Reasoning Scalability, and Domain-Specific Epistemic Expandability.

A. Dataset Construction

Source: 408 real-world CTI reports from various security vendors.
Structure: The authors define an Attack Sequence ( $S$ ) as a 4-tuple:
1. Tactic Sequence ( $T$ ): Ordered list of MITRE ATT&CK tactics.
2. Technique Mappings ( $E$ ): Techniques/sub-techniques associated with each tactic.
3. Procedure Mappings ( $P$ ): Triplets (Subject, Action, Object) describing specific steps.
4. CTI Outline ( $O$ ): Textual summary of the TTPs.
Pipeline: An automated pipeline uses an LLM-based Knowledge Graph (KG) framework to extract TTPs, generate outlines, and construct attack sequences.
Question Generation: A "Self-Refine" pipeline generates Q&A pairs using GPT-4o. It employs a masking strategy (hiding the target tactic/technique) and negation strategies (converting Yes/No questions) to create challenging inference tasks.
Refinement: Questions undergo iterative refinement based on six criteria: Answerability, Clarity, Logical, Relevance, Consistency, and Answer Consistency.
Final Dataset: After filtering, the dataset contains ~6,200 high-quality Q&A pairs across three tasks.

B. Benchmark Tasks

The benchmark evaluates models on three hierarchical tasks:

AttackSeqBench-Tactic: Infer the correct MITRE ATT&CK tactic given the surrounding sequence.
AttackSeqBench-Technique: Infer the specific technique within a tactic.
AttackSeqBench-Procedure: Determine the likelihood (Yes/No) of a specific procedural step occurring in the sequence.

C. Experimental Settings

The study evaluates models under three distinct settings to test Epistemic Expandability:

Zero-Shot: Models rely solely on internal pre-trained knowledge.
Context Setting: Models are provided with a "masked" CTI outline (missing the target section), requiring abductive reasoning to fill the gap.
RAG-empowered Setting: Models use Retrieval-Augmented Generation, retrieving relevant TTPs from the MITRE ATT&CK knowledge base to assist in answering.

D. Models Evaluated

7 LLMs: Including LLaMa3.1/3.3, ChatGLM4, Qwen2.5 (various sizes), and GPT-4o.
5 LRMs: Including DeepSeek-R1 (distilled versions), QWQ-32B, and GPT-o3-mini.
4 Post-Training Strategies: Supervised Fine-Tuning (SFT), Reasoning Distillation (RD), Reinforcement Learning from Internal Feedback (RLIF), and Reinforcement Learning with Verifiable Rewards (RLVR).

3. Key Contributions

Novel Benchmark: The first benchmark specifically designed to evaluate the reasoning capabilities of AI models over adversarial attack sequences rather than static entity extraction.
Automated Pipeline: A scalable, extensible pipeline for converting raw CTI reports into structured, reasoning-heavy Q&A datasets.
Comprehensive Evaluation: A systematic analysis of how parameter scale, reasoning models (LRMs), and knowledge injection (RAG/Post-training) affect performance in cybersecurity.

4. Key Results & Findings

A. Performance of LRMs vs. LLMs

Counter-Intuitive Finding: Unlike in math or coding domains, LRMs (Large Reasoning Models) do not consistently outperform standard LLMs in attack sequence analysis. In many cases, LRMs performed worse.
Reasoning Path Analysis: The authors found that LRMs often suffer from "over-thinking." They tend to construct redundant reasoning loops, misinterpret temporal constraints (e.g., "only before"), and over-emphasize the plausibility of individual steps while losing the holistic sequence logic. Standard LLMs often relied on more direct sequence matching, which proved more effective.

B. Impact of Context and RAG

Context Setting: Providing context (masked outlines) generally improved performance, especially for larger models, as it aided abductive reasoning.
RAG Limitations: The RAG-empowered setting failed to improve performance and often degraded it.
- Error Analysis: 59% of RAG errors were Factual Errors (the model failed to integrate retrieved evidence correctly), and 32% were Over-reliance (the model blindly followed retrieved chunks even when they contradicted the question's intent).
- Noise: Retrieved knowledge often acted as noise, distorting the model's output distribution rather than clarifying the answer.

C. Parameter Sensitivity

Temperature: Increasing temperature hurt smaller models significantly but had less impact on larger models.
Token Budget: LRMs showed significant gains in accuracy when given more output tokens (up to a point), but diminishing returns occurred beyond 4,096 tokens.

D. Post-Training

Post-training strategies (SFT, RD, RL) improved Zero-Shot performance but generally lagged behind instructive LLMs equipped with task-adapted prompts. This suggests that specialized domain knowledge needs to be embedded more effectively than current strategies allow.

5. Significance and Future Directions

Domain Insight: The paper reveals that general reasoning capabilities (as seen in LRMs) do not directly translate to cybersecurity domain reasoning. The specific nature of attack sequences requires a balance of factual knowledge and strict logical adherence that current "over-thinking" models struggle to maintain.
Operational Value: The benchmark provides a tool for security practitioners to assess the readiness of AI models for automating CTI analysis, highlighting that current models are not yet reliable for complex, multi-stage attack reconstruction without significant refinement.
Future Work: The authors propose moving beyond simple Q&A to complex reasoning/completion tasks, developing advanced RAG strategies that better handle domain-specific ambiguities, and dynamically updating the corpus to keep pace with evolving threats.

In conclusion, AttackSeqBench establishes that while LLMs are powerful tools for CTI, their ability to reason about the sequence of attacks is a distinct and challenging capability that current state-of-the-art models (including LRMs) have not yet mastered, necessitating specialized training and evaluation frameworks.