What Are Adversaries Doing? Automating Tactics, Techniques, and Procedures Extraction: A Systematic Review

Imagine the world of cybersecurity as a massive, chaotic library where millions of books, diaries, and news clippings are being written every day by both the good guys (defenders) and the bad guys (adversaries). The bad guys are constantly inventing new ways to break into houses (networks), but they write about their crimes in messy, unorganized notes.

The goal of this research paper is to figure out how to build a super-smart librarian who can read all these messy notes, understand exactly what the bad guys are planning, how they are doing it, and why they are doing it, and then organize that information into a neat, searchable encyclopedia.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Needle in a Haystack"

The authors explain that cyberattacks are exploding in number. Security experts are drowning in reports. Trying to manually read thousands of these reports to find specific details about how a hacker stole data is like trying to find a specific needle in a haystack while wearing blindfolded gloves. It's slow, tiring, and prone to mistakes.

They need a way to automatically extract TTPs:

Tactics: The Goal (e.g., "I want to steal your bank account").
Techniques: The Method (e.g., "I will guess your password").
Procedures: The Specific Steps (e.g., "I will type '1234' then '5678'").

2. The Mission: The "Detective's Review"

The authors didn't just write a new tool; they acted like detectives reviewing other detectives. They looked at 80 different research papers (studies) that tried to build these automatic extractors. They wanted to answer: Who is doing what? What tools are they using? And are they actually working?

3. What They Found: The "State of the Art"

After analyzing all 80 studies, they found a few major trends, which they explain like this:

The "Recipe" vs. The "Menu": Most researchers are focused on identifying the specific Techniques (the recipes). They are very good at spotting "I used a password cracker." However, they are less good at spotting the high-level Tactics (the menu item: "I am trying to steal money") or searching for specific techniques across huge libraries of text.
The Evolution of Tools:
- Old School: Early researchers used simple Rule-Based systems (like a "Find and Replace" function). If the text said "password," flag it. This is like using a metal detector that only beeps for gold coins.
- Middle Age: They moved to Machine Learning, which is like teaching a dog to sniff out different smells. It's better, but it needs a lot of training.
- Modern Day: Now, everyone is using Transformers (like BERT) and Large Language Models (LLMs like ChatGPT). These are like super-intelligent detectives who understand context. They know that "I will crack the safe" means something different than "I will crack a joke."
The "Black Box" Problem: A major issue the authors found is Reproducibility. Imagine a chef writes a recipe for a delicious cake but doesn't list the ingredients or the oven temperature. You can't bake it yourself.
- Many of these 80 studies are like that. They say, "We built a great system!" but they don't share their code or the data they used. This makes it impossible for other scientists to check if the cake actually tastes good or if the chef just made it up.

4. The Data Sources: Where the Clues Come From

The researchers looked at where these "detectives" get their information.

The Main Source: Most studies use CTI Reports (Cyber Threat Intelligence). These are like the official police reports written by security companies (like FireEye or Kaspersky).
Other Sources: Some use System Logs (like a security camera recording), Vulnerability Databases (lists of broken locks), or even Dark Web Forums (where hackers brag about their crimes).
The Gap: The authors noticed that while everyone loves reading police reports, very few people are analyzing the raw security camera footage (system logs) or the actual stolen goods (malware code).

5. The Future: What Needs to Happen?

The paper concludes with a "To-Do List" for the future, using these metaphors:

Stop Using Fake Scenarios: Many studies test their tools on clean, perfect data. It's like a driving test where the road is empty and the weather is perfect. We need to test these tools on messy, real-world data where the road is full of potholes and rain.
Share the Blueprints: We need more researchers to share their code and data. If we all share our blueprints, we can build a better house together instead of everyone reinventing the wheel.
Think in 3D: Currently, most tools look at one sentence at a time. But a crime story is a whole chapter. We need tools that understand the whole story, including the order of events and the connections between different parts of the report.
The "Multi-Task" Challenge: Real attacks are complex. A hacker might try to steal data and delete files at the same time. Current tools often try to pick just one thing. We need tools that can handle multiple goals at once.

The Bottom Line

This paper is a massive map of the territory. It tells us that we have made incredible progress in teaching computers to read hacker notes, moving from simple spell-checkers to super-smart AI detectives. However, the field is still a bit messy. We need better data sharing, more realistic testing, and tools that can understand the full story of an attack, not just isolated words.

If we fix these issues, we can build a security system that doesn't just react to attacks after they happen, but actually understands the enemy's playbook before they even strike.

1. Problem Statement

The cybersecurity landscape is facing an exponential increase in the volume and sophistication of cyberattacks, leading to a surge in Cyber Threat Intelligence (CTI) reports. These reports contain unstructured text describing Tactics, Techniques, and Procedures (TTPs)—the objectives, methods, and specific implementations of adversary behavior.

The Challenge: Manually extracting TTPs from unstructured CTI reports is labor-intensive, error-prone, and does not scale with the speed of modern threats.
The Gap: While numerous studies propose automated extraction methods (ranging from rule-based systems to Large Language Models), the field lacks a unified understanding. Existing research varies widely in objectives, datasets, ontologies (e.g., MITRE ATT&CK), and evaluation metrics, making it difficult to compare approaches or assess generalizability.
Goal: The authors aim to synthesize the state-of-the-art in automated TTP extraction to identify trends, limitations, and future research directions.

2. Methodology

The authors conducted a Systematic Literature Review (SLR) following the guidelines by Kitchenham et al.

Search Strategy: They queried five major scholarly databases (IEEE Xplore, ACM Digital Library, ScienceDirect, SpringerLink, ACL) using keywords related to "MITRE ATT&CK" and "Tactics, Techniques, and Procedures."
Selection Criteria:
- Inclusion: Peer-reviewed, English, published between 2015 and June 2025, explicitly proposing novel TTP extraction methodologies.
- Exclusion: Non-peer-reviewed content (blogs, white papers), duplicates, and studies not focused on TTP extraction.
Process:
1. Initial search yielded 3,219 papers.
2. After deduplication and screening (title/abstract/full-text), 80 studies were selected for final analysis.
3. Inter-rater Reliability: Two authors independently screened papers, achieving a Cohen's kappa score of 0.86 (strong agreement).
4. Analysis: The selected studies were analyzed using open coding to categorize them across seven Research Questions (RQs) covering objectives, data sources, preprocessing, annotation, methodologies, evaluation metrics, and reproducibility.

3. Key Contributions

The paper provides a structured taxonomy and critical analysis of the TTP extraction landscape:

Taxonomy of Extraction Objectives: Categorized studies into five primary goals:
- Tactic Classification (6 studies): Mapping text to high-level objectives (e.g., Persistence).
- Technique Classification (39 studies): The dominant task, mapping text to specific MITRE ATT&CK techniques.
- Technique Searching (5 studies): Retrieving techniques from text via semantic search.
- IOC Extraction & TTP (6 studies): Extracting Indicators of Compromise alongside TTPs.
- Knowledge Graph (KG) Construction (24 studies): Structuring extracted entities and relations into graphs.
Data Source Analysis: Identified that Benchmark Datasets/Public KBs (48 studies) and CTI Reports (28 studies) are the primary sources. Operational data like system logs and malware repositories are underutilized.
Methodological Evolution: Traced the shift from Rule-Based/Traditional ML (SVM, Naive Bayes) $\rightarrow$ Deep Learning/Transformers (BERT, RoBERTa, SecureBERT) $\rightarrow$ Large Language Models (LLMs) (LLaMA, GPT-4, RAG, Prompting).
Reproducibility Assessment: Revealed a critical gap in open science. Only 12.5% of studies provided both code and data publicly. 50% provided no clear availability information, hindering independent validation.
Future Directions: Proposed a roadmap for the community, emphasizing grounded operational datasets, multi-label evaluation, and context-aware models.

4. Key Results and Findings

A. Dominant Trends

Task Formulation: Technique-level classification is the overwhelming focus (39/80 studies), often treating the problem as single-label classification. Tactic classification and technique searching are significantly underexplored.
Modeling Approaches:
- The field has moved decisively toward Transformer-based architectures.
- Domain-specific embeddings (e.g., SecureBERT, CyBERT, SciBERT) outperform generic BERT models due to better handling of cybersecurity jargon.
- LLMs are emerging (2023–2025) for few-shot learning, retrieval-augmented generation (RAG), and complex reasoning, though adoption is still nascent.
Data Sources: The majority of research relies on CTI reports (narrative text from vendors like FireEye, CrowdStrike). There is a notable lack of research utilizing system logs, network traffic, or malware binaries as primary inputs for TTP extraction, limiting real-time operational applicability.

B. Evaluation and Reproducibility Issues

Metrics: Most studies rely on standard Precision, Recall, and F1-scores. However, many fail to address the multi-label nature of CTI (where one report contains multiple TTPs), often simplifying to single-label settings.
Generalization: Most studies evaluate on a single dataset, failing to test cross-domain generalization or robustness against distribution shifts (e.g., different report styles or time periods).
Reproducibility Crisis:
- 47.5% of papers share some resource (code OR data).
- Only 12.5% share both code and data.
- 50% of papers have unclear or no resource availability, making replication nearly impossible for many works.
- Many datasets are proprietary or suffer from "link rot," preventing long-term benchmarking.

C. Limitations Identified in Literature

Annotation Quality: Many studies lack Inter-Annotator Agreement (IAA) scores, raising concerns about label reliability.
Context Loss: Most models operate at the sentence level, ignoring the broader narrative context, temporal ordering, and causal relationships between TTPs found in full reports.
Bias: Heavy reliance on English-language reports limits applicability to global threat landscapes.

5. Significance and Future Directions

This systematic review serves as a foundational reference for the cybersecurity and NLP communities. It highlights that while technical performance (accuracy/F1) has improved with deep learning, the operational relevance of these systems is compromised by poor reproducibility and narrow evaluation scopes.

Recommended Future Research Directions:

Grounded Datasets: Create and release high-quality, publicly available datasets derived from real-world, noisy operational CTI reports (not just curated benchmarks).
Multi-Label Evaluation: Shift from single-label classification to multi-label paradigms that reflect the complexity of real attack campaigns.
Context-Aware Extraction: Develop models that process document-level context to capture temporal dependencies and implicit relationships between tactics and techniques.
Robustness Testing: Implement cross-dataset and temporal validation to ensure models do not overfit to specific report styles or timeframes.
Adversary Emulation: Explore using CTI-extracted TTPs to generate realistic attack simulations for defensive training.

In conclusion, the paper argues that for automated TTP extraction to transition from academic research to operational utility, the community must prioritize reproducibility, diverse data sources, and rigorous, context-aware evaluation over simple accuracy metrics.