CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

Imagine you are a detective trying to solve a crime that happened in a digital city. The crime scene is a messy pile of evidence: millions of tiny digital footprints (network traffic) left behind by a hacker.

In the past, a human detective had to sift through this mountain of paper, read every single note, cross-reference them with old case files, and write a report. It took days, was exhausting, and humans often missed clues or got tired.

CyberSleuth is a new kind of "AI Detective" designed to do this job automatically. It's not just a chatbot that answers questions; it's an agent that can think, use tools, and investigate on its own.

Here is how the paper explains this new detective, broken down into simple concepts:

1. The Problem: The "Needle in a Haystack"

When a hacker attacks a website, they leave behind a massive log of data (called a PCAP file). It's like a 100-page transcript of a conversation where the hacker and the server are talking.

The Old Way: A human reads the whole transcript, highlights the suspicious parts, looks up what those words mean in a dictionary of known crimes (CVEs), and writes a report.
The New Way: CyberSleuth reads the transcript, figures out who the criminal is, what tool they used, and whether they succeeded, all in minutes.

2. The Detective's Toolkit: Three Different Architectures

The researchers tried three different ways to build this AI detective to see which one worked best. Think of these as different management styles for a police team:

The "Lone Ranger" (Single Agent): One smart detective tries to do everything alone. They read the whole file, search the internet, and write the report.
- Result: This detective gets overwhelmed. They read too much, forget the beginning of the file by the time they reach the end, and get confused. They often guess wrong.
The "Bureaucratic Boss" (Tshark Expert Agent): The main detective is a boss who gives orders to a specialist (a "Tshark" expert) to look at specific parts of the file.
- Result: This is better, but the boss and the specialist often misunderstand each other. The boss asks for "everything," and the specialist gets lost in the details. They talk past each other.
The "Specialized Task Force" (Flow Reporter Agent - FRA): This is the winner. The team is split up perfectly:
1. The Summarizer: A fast worker who scans the whole file and creates a short, easy-to-read summary of the "suspicious" parts.
2. The Investigator: The main detective reads only that summary. They don't get bogged down in the raw data.
3. The Librarian: A tool that instantly looks up the summary on the internet to match it with known criminal profiles (CVEs).
- Result: This team works like a well-oiled machine. The Investigator stays focused, the Summarizer handles the heavy lifting, and the Librarian provides the facts.

3. The "Memory" Problem: The Sticky Note vs. The Filing Cabinet

AI models have a limit on how much text they can remember at once (like a sticky note that only fits 5 words). If a case is long, the AI forgets what happened at the start.

The Solution: CyberSleuth uses a "Filing Cabinet" (a vector database). As it investigates, it writes down key clues on index cards and files them away. When it needs to remember something from 10 steps ago, it pulls the right card out of the cabinet. This allows it to solve long, complex cases without losing its train of thought.

4. The "Web Search" Skill: Not Just Guessing

A common mistake AI makes is "hallucinating" (making things up). If the AI sees a strange code, it might guess, "Oh, that's probably the 'Great Firewall' bug!" even if it's wrong.

CyberSleuth's Trick: It is programmed to say, "I don't know yet. Let me check the internet." It searches for the specific service and the type of attack to find the exact match. It treats the internet like a giant library of criminal records to verify its findings.

5. The Results: How Good is It?

The researchers tested CyberSleuth on 30 real-world scenarios (some old, some brand new from 2025).

Accuracy: It correctly identified the hacker's target and the specific "weapon" (vulnerability) used in 80% of the cases.
Human Approval: They showed the reports to 25 real cybersecurity experts. The experts rated the reports as complete, useful, and logical. They said the AI sounded like a competent junior analyst.
Adaptability: The best part? The researchers didn't have to rebuild the detective. They just gave it a new instruction: "Now, look at this traffic from a virus-infected computer instead of a website." The same team of AI agents successfully solved those cases too!

The Big Takeaway

The paper proves that AI can be a great partner for cybersecurity, but only if you design it right.

Don't give one AI too many jobs (it gets confused).
Give it a team of specialists (it works better).
Give it a filing cabinet for its memory (it doesn't forget).

CyberSleuth is the first step toward a future where AI handles the boring, tedious digging through data, leaving human experts free to focus on the big picture and stopping the next big attack.

Here is a detailed technical summary of the paper "CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics."

1. Problem Statement

Post-mortem analysis of compromised systems is a critical component of cyber forensics but remains a highly manual, time-consuming, and error-prone task. While Large Language Models (LLMs) have shown promise in automating reasoning-heavy tasks, applying them to cybersecurity forensics is challenging due to the domain's specific requirements:

Long-term reasoning: The need to maintain context over long investigation pipelines.
Contextual memory: The ability to correlate unstructured evidence across multiple events.
Evidence correlation: Connecting raw network traces to specific vulnerabilities (CVEs) and attack outcomes.

Existing LLM agents often struggle with memory management, context switching, and coordinating sub-agents, leading to hallucinations or loss of focus during complex forensic investigations.

2. Methodology

The authors propose CyberSleuth, an autonomous LLM agent designed to automate the forensic investigation pipeline. The methodology involves three core components:

A. System Architecture

The authors systematically evaluated three distinct agent architectures using a benchmark of 30 controlled web attack scenarios (15 successful, 15 failed) and 10 benign traces:

Single Agent (SA): A monolithic agent that processes packet summaries and iteratively inspects payloads using a tshark wrapper. It relies on a "scratchpad" for memory.
Tshark Expert Agent (TEA): A multi-agent design where a main agent delegates specific packet inspection tasks to a specialized tshark sub-agent. This aims to reduce the cognitive load on the main agent but suffers from coordination gaps (e.g., the main agent issuing vague instructions).
Flow Reporter Agent (FRA) – The Winner: A sequential multi-agent design that decouples evidence extraction from reasoning.
- Flow Summariser Sub-agent: Systematically inspects TCP/UDP flows, reconstructs application-layer payloads, and generates a structured forensic report (identifying services, anomalies, and attack patterns).
- Main Agent: Analyzes the pre-processed summary, performs targeted web searches to map evidence to CVEs, and generates the final report.
- Key Innovation: This architecture prevents the main agent from getting lost in raw data and ensures high-quality, structured input for reasoning.

B. Memory Management

To overcome the context window limitations of LLMs, the system implements a MemGPT-style memory architecture:

FIFO Queue: Short-term memory for recent actions and observations.
Vector Database: Long-term memory where relevant past experiences are embedded and stored. The agent retrieves semantically similar past insights to maintain coherence during long investigations.

C. Tool Integration

Web Search: A specialized tool that allows the agent to query external knowledge bases (CVEs, TTPs). The system enforces strict prompting guidelines to prevent hallucination (e.g., forcing the agent to describe the attack type rather than guessing the CVE code).
Tshark: Used for packet parsing and flow reconstruction.

D. Evaluation Datasets

Web Attack Benchmark: 30 incidents involving known CVEs (ranging from legacy to 2025 vulnerabilities) against services like Apache, Jenkins, and GitLab.
Malware Traffic Portability: 10 traces of malware infections (e.g., RATs, Stealers) to test if the design generalizes beyond web attacks.
Benign Traffic: 10 traces of normal browsing to test for false positives.

3. Key Contributions

First Systematic Study: The paper presents the first comprehensive evaluation of LLM agents for automating post-mortem forensic investigations, moving beyond red-team (attack) simulations to blue-team (defense) analysis.
Architectural Insights:
- Multi-agent Specialization: Specialized sub-agents (like the Flow Summariser) outperform single all-in-one agents by maintaining focus.
- Simple Orchestration: A sequential pipeline (Summariser $\to$ Main Agent) outperforms deeply nested hierarchical designs, which often suffer from coordination failures.
- Generalizability: The design principles transfer effectively from web service attacks to malware traffic analysis with minimal prompt engineering changes.
Open Benchmark & Platform: The authors release CyberSleuth and an expanded benchmark (including 2025 vulnerabilities) as an open platform to support rigorous, reproducible evaluation.

4. Results

The Flow Reporter Agent (FRA) combined with advanced LLM backends (specifically OpenAI GPT-5 and o3) achieved the best performance:

Accuracy:
- Service Identification: 90% accuracy on 2025 incidents.
- CVE Detection: 80% accuracy on 2025 incidents (a significant improvement over the 14% baseline of previous works).
- Attack Success Evaluation: 80-90% accuracy.
Efficiency: FRA converged in an average of ~5 reasoning steps, significantly reducing token usage and cost compared to Single Agent approaches (which averaged ~18 steps).
Cost: Open-source models (DeepSeek R1, Kimi K2) achieved performance comparable to proprietary models (GPT-4o, GPT-5) at roughly half the cost.
Human Evaluation: A panel of 25 experts rated the reports generated by CyberSleuth as complete, useful, and logically coherent (average score >4.2/5).
Portability: When applied to malware traffic, the agent successfully extracted victim details and Indicators of Compromise (IOCs) in 9/10 cases, demonstrating strong transferability.

5. Significance

Operational Feasibility: The study proves that autonomous AI agents can effectively support human analysts in complex security workflows, reducing the time from incident detection to root cause analysis.
Design Guidelines: It establishes best practices for building forensic agents, emphasizing the need for specialized sub-agents and robust memory management over complex, nested orchestration.
Future of Blue-Teaming: By demonstrating that agents can handle raw PCAP data and correlate it with external threat intelligence, the paper paves the way for fully automated incident response and forensic reporting systems.
Resilience to New Threats: The system's ability to correctly identify 2025 vulnerabilities (post-training data) suggests that agentic workflows relying on external search tools are more robust to knowledge cutoffs than static models.

In conclusion, CyberSleuth represents a significant step forward in automating cyber forensics, offering a scalable, accurate, and cost-effective solution for post-mortem analysis that generalizes across different types of network incidents.