Before You Hand Over the Wheel: Evaluating LLMs for Security Incident Analysis

Imagine you are the captain of a massive ship (a Security Operations Center, or SOC). Your job is to keep the ship safe from pirates, storms, and mechanical failures. But here's the problem: your ship is covered in thousands of blinking warning lights every single day. Some lights mean a real pirate is boarding; most are just a seagull landing on the radar.

You have a small crew of human analysts. They are tired, overwhelmed, and can't possibly check every single light. So, you decide to hire a super-smart, tireless robot assistant (an LLM, or Large Language Model) to help you sort through the noise and find the real threats.

But before you hand over the wheel to the robot, you need to know: Is it actually smart enough to do the job?

This is exactly what the paper "Before You Hand Over the Wheel" is about. The authors built a giant, realistic test drive called SIABENCH to see if these AI robots can actually handle the complex job of a security analyst.

Here is the breakdown of their adventure:

1. The Problem: The "Black Box" Dilemma

Right now, companies are rushing to buy these AI assistants. But nobody has a standardized "driver's license test" for them in the security world.

The Risk: If you hire a robot that thinks a seagull is a pirate, you might panic and shut down your whole ship. If it thinks a real pirate is just a seagull, your ship gets robbed.
The Gap: There was no standard dataset (a set of practice problems) that covered the real messy, confusing work security analysts do. Most tests were too simple, like asking the robot to solve a math problem, rather than asking it to investigate a crime scene.

2. The Solution: Building the "Driving Range" (SIABENCH)

The authors built a massive training ground to test these robots. They created two main types of tests:

The "Deep Dive" Investigation (25 Scenarios): Imagine a detective story. The robot is given a messy crime scene (a hacked computer, a stolen file, a suspicious email). It has to use digital tools (like a magnifying glass or a fingerprint scanner) to answer questions like: Who did this? How did they get in? What tools did they use?
- The Twist: The robot has to do this step-by-step, just like a human. It can't just guess; it has to open files, run code, and look at logs.
The "Alert Triage" Test (135 Scenarios): This is the "Seagull vs. Pirate" test. The robot is shown 135 warning lights. It has to quickly decide: Is this a real attack (True Positive) or a false alarm (False Positive)?

Crucial Step: The authors made sure the test questions were "de-biased." They didn't ask, "Find the hacker's IP address." Instead, they asked, "Is there any evidence of hacking? If so, what is the IP?" This forces the robot to actually think and look for evidence, rather than just guessing because the question told it what to find.

3. The Robot Assistant (The Agent)

The authors didn't just ask the AI to "write an answer." They built a Robot Agent that acts like a human analyst.

The Loop: The robot gets a task, thinks about what tool to use, runs the tool, reads the messy output, summarizes the important parts, and then decides what to do next.
The Memory: If the log file is 100 pages long, the robot has to summarize the first 50 pages to remember the key points before reading the next 50. This prevents the robot from getting "brain fog" (running out of memory).

4. The Results: The Report Card

They tested 11 different AI models (some free and open, some expensive and closed) on this driving range. Here is what they found:

The Stars: The newest, most powerful models (like Claude-4.5-Sonnet and GPT-5) are getting really good. They can solve about 80% of the "Easy" and "Medium" cases. They are great at spotting simple patterns, like a hacker scanning for open doors.
The Struggles: Even the best robots fail at the hardest stuff. When the investigation requires deep, complex reasoning (like decoding a hidden message inside a PDF or analyzing a memory dump), they often get stuck or give up.
The "Hallucination" Problem: Some robots, when they don't know the answer, just make things up. They might say, "The hacker used a red laser," when there was no red laser. This is dangerous in security.
The "Give Up" Problem: Some older or smaller models get frustrated after a few wrong turns and just quit the investigation, leaving the job unfinished.

5. The "Live Fire" Test

To make sure the robots weren't just memorizing the answers from their training data, the authors tested them on brand new, real-world cases that were published after the robots were trained.

Result: The top robots still performed well, proving they are actually learning to think, not just memorizing. However, they still struggled with the hardest, most complex new cases.

The Big Takeaway

The paper concludes that we are not quite ready to hand over the wheel yet.

The Good News: AI is becoming a very powerful "junior analyst." It can handle the boring, repetitive work (like sorting through thousands of alerts) and help human experts focus on the hard stuff.
The Bad News: If you let the AI run the whole show without a human watching, it might miss critical clues or get confused by complex attacks.
The Future: We need to keep testing these models. As they get smarter, they will need less "training wheels" (human guidance). But for now, the best setup is a Human + AI Team, where the AI does the heavy lifting and the human makes the final call.

In short: The paper built the ultimate "driver's ed" course for security AI. It showed us that while the students are passing the easy tests, they still need a lot of practice before they can drive the ship alone.

Here is a detailed technical summary of the paper "Before You Hand Over the Wheel: Evaluating LLMs for Security Incident Analysis."

1. Problem Statement

Security Operations Centers (SOCs) face an overwhelming volume of alerts, diverse data sources, and complex toolchains, leading to analyst fatigue and limited expertise. While organizations are eager to adopt Large Language Models (LLMs) to automate Security Incident Analysis (SIA), there is a critical lack of rigorous benchmarking. This absence creates significant risks:

Lack of Data: No existing LLM-ready datasets cover the full spectrum of SIA tasks (both deep investigation and alert triage).
Lack of Standards: There are no established best practices for evaluating LLMs in dynamic, multi-step SIA workflows, unlike other security domains (e.g., CTFs or vulnerability management).
Rapid Obsolescence: The fast pace of LLM releases requires an evaluation framework that is extensible and can track progress over time.
Real-world Complexity: SIA involves non-linear, multi-objective reasoning (e.g., determining "who," "what," "how," and "when") across heterogeneous artifacts (network logs, memory dumps, malware binaries), which current static benchmarks fail to capture.

2. Methodology: SIABENCH

The authors propose SIABENCH, an agentic evaluation framework designed to autonomously assess LLM capabilities in SIA. The methodology consists of three core components:

A. The Dataset

The authors constructed the first-of-its-kind dataset comprising two distinct parts:

Deep Analysis Workflows (Part I):
- Scale: 25 incident scenarios with 229 investigative questions.
- Domains: Network Forensics, Memory Forensics, Malware Analysis (binary, code, PDF), and Miscellaneous (phishing, logs).
- Design: Scenarios were curated from platforms like CyberDefenders and Blue Team Labs. To prevent data contamination (memorization), the authors implemented a rigorous pre-processing pipeline:
  - Content Paraphrasing: Rewriting scenario descriptions.
  - Identifier Neutralization: Replacing specific names (companies, IPs) with generic terms.
  - Question Reformulation: Converting leading questions (e.g., "What is the IP?") into open-ended analytical prompts (e.g., "Is there evidence of scanning? If so, what is the IP?").
  - Tactic Mapping: Questions are labeled with MITRE ATT&CK tactics.
Alert Triage Tasks (Part II):
- Scale: 135 alert scenarios (50 True Positives, 50 False Positives from TII-SSRC-23; 30 FP, 5 TP from CIC-IDS2017).
- Purpose: To evaluate the model's ability to distinguish benign traffic from malicious attacks, a critical task for reducing alert fatigue.

B. The SIA Agent

An autonomous agent was developed to execute tasks without human intervention. Key design features include:

Multi-State Workflow: Unlike single-state prompts, the agent operates in a loop: Init (receive context) $\rightarrow$ Incident Analysis (Plan $\rightarrow$ Act $\rightarrow$ Observe) $\rightarrow$ Solved. This mimics the iterative nature of human investigation.
ReAct Framework: The agent uses "Reason + Act" cycles, generating thoughts, executing tool commands (e.g., tshark, volatility, oledump), and observing outputs.
Summarization Module: To handle context window limits and verbose tool outputs, the agent employs an iterative summarization step that extracts Key Security Insights (KSI) from logs, feeding them back into the reasoning loop.
Tool Integration: The agent has dynamic access to a Kali Linux environment with forensic tools and can install new tools as needed.

C. Evaluation Protocol

The framework benchmarks 11 major LLMs (4 open-weight, 7 closed-weight) including GPT-4o, GPT-5, Claude-3.5/4.5-Sonnet, Llama-3.1 variants, and DeepSeek-Reasoner.

Metrics: Fully Solved (FS) scenarios and Partially Solved (PS) accuracy.
Ablation Studies: The impact of multi-state workflows, summarization, and reasoning frameworks (ReAct vs. Act-Only) was tested.
Live Validation: Models were tested on challenges published after their training cutoff dates to verify robustness against data contamination.

3. Key Contributions

SIABENCH Dataset: A comprehensive, de-biased dataset covering 25 deep-dive scenarios and 135 alert triage cases, aligned with real-world SOC workflows and MITRE ATT&CK tactics.
Autonomous SIA Agent: A novel agent architecture capable of multi-step reasoning, tool interaction, and context management, designed specifically for the iterative nature of incident analysis.
Systematic Benchmarking: The first large-scale evaluation of 11 LLMs on SIA tasks, providing baseline performance metrics and identifying specific failure modes.
Insights on Scaffolding: Demonstrated that advanced reasoning frameworks (ReAct) and context management (Summarization) are critical for performance, though newer models (e.g., GPT-5) show reduced dependency on explicit scaffolding due to larger context windows.

4. Key Results

Overall Performance: Even the best-performing models (Claude-4.5-Sonnet and GPT-5) only fully solved 8 out of 25 complex scenarios. However, they achieved high Partially Solved (PS) rates (approx. 80%+ for top models).
Model Hierarchy:
- Top Tier: Claude-4.5-Sonnet and GPT-5 significantly outperformed others, especially in complex tasks like Memory Forensics and Malware Analysis.
- Mid Tier: Claude-3.5-Sonnet and GPT-4o showed moderate performance but struggled with deep technical analysis.
- Low Tier: Smaller open-weight models (Llama-3.1-8B/70B) and lightweight models (GPT-4o-mini) performed poorly, often failing to execute tools correctly or getting stuck in loops.
Tactic-Level Performance:
- Strong: Models excel at Reconnaissance (100% for Claude-4.5) and Exfiltration detection.
- Weak: Models struggle significantly with Defense Evasion (obfuscation, encoding) and Execution analysis (reverse engineering), where performance drops below 55% for older models.
Failure Modes:
- Llama Models: Prone to "Infinite Loops" (repeating commands) and "Round Exceed."
- Smaller/Reasoning Models: Prone to "Shallow Investigation" (relying on keywords) and "Hallucination."
- Top Models: Rarely give up; they explore alternative strategies but still suffer from "Question Dependency" (failing one step cascades to failure in subsequent steps).
Alert Triage: Top models (GPT-5, Claude-4.5) achieved near-perfect accuracy (97-98%) in distinguishing True vs. False positives, highlighting their immediate utility for triage.
Live Task Validation: Models performed consistently on post-cutoff challenges, validating that the benchmark results are not due to training data memorization.

5. Significance

Risk Mitigation: The study provides a "reality check" for SOCs considering LLM adoption, demonstrating that while LLMs are powerful for triage and initial analysis, they are not yet fully autonomous for complex incident response without human oversight.
Guidance for Implementation: The results suggest that for complex SIA tasks, organizations should prioritize models with large context windows and strong reasoning capabilities (e.g., GPT-5, Claude-4.5) and implement scaffolding (ReAct, Summarization) for older or smaller models.
Future-Proofing: The modular design of SIABENCH allows for the continuous integration of new models and tasks, establishing a standard for tracking the evolution of AI in cybersecurity.
Bridging the Gap: By focusing on investigative workflows rather than just offensive (CTF) or static code tasks, SIABENCH fills a critical gap in security AI research, offering a more realistic assessment of LLM utility in defensive operations.