Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

This paper introduces a comprehensive 4×6 Target × Technique taxonomy and benchmark coverage audit framework, derived from 932 security studies, which reveals that current LLM attack benchmarks collectively cover at most 25% of the threat surface and suffer from significant evaluation gaps and naming fragmentation.

Original authors: Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

Published 2026-05-15
📖 5 min read🧠 Deep dive

Original authors: Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Picture: The "Safety Map" Problem

Imagine the world of Large Language Models (LLMs) as a massive, bustling city. The people building these models are like city planners trying to make sure the city is safe from criminals (hackers).

For a few years, researchers have been finding new ways for criminals to break into the city. But here's the problem: everyone is using different maps, different names for the same crimes, and different ways of measuring how safe the city is. Some researchers say, "We checked the banks!" while others say, "We checked the power plants!" But nobody has checked if the whole city is safe.

This paper is like a team of cartographers who decided to stop arguing about maps and instead build one giant, master map of every possible way to attack an AI. Then, they checked the security guards' (the "benchmarks") current patrol routes to see if they are actually covering the whole city.

The shocking finding? The security guards are only patrolling a tiny, crowded neighborhood (about 25% of the city), while huge, dangerous districts are completely empty and unguarded.


1. The Master Map (The Taxonomy)

The researchers looked at 932 scientific papers published between 2023 and 2026. They found over 6,300 mentions of different attacks.

The Naming Chaos:
Imagine if one type of car theft was called "Hotwiring" in one city, "Keyless Entry Theft" in another, and "The Silent Heist" in a third. That's what was happening with AI attacks.

  • The Analogy: The famous "GCG" attack (a way to trick AI) was called 29 different names across 376 papers.
  • The Fix: The team created a standard dictionary (a taxonomy) with 507 specific "leaves" (categories). They cleaned up the mess, merging the 29 names into one, so everyone speaks the same language.

The Structure:
They organized these attacks into a 4x6 Grid (Matrix):

  • The Rows (The Goal): What does the criminal want?
    1. Safety Bypass: Making the AI say something bad (like hate speech).
    2. System Hijacking: Making the AI do something bad (like delete files or send money).
    3. Information Theft: Stealing secrets from the AI's memory.
    4. Service Disruption: Making the AI so slow or expensive to use that it crashes (like a traffic jam).
  • The Columns (The Method): How do they do it?
    • Using confusing words, lying to the AI, hiding code, or attacking the AI's internal brain.

2. The Security Check (The Benchmark Audit)

The researchers took the three most famous "security tests" used by the industry (HarmBench, InjecAgent, and AgentDojo) and plotted them onto their Master Map.

The Result:

  • Zero Overlap: The three tests don't even check the same things. They are like three different fire departments: one only checks kitchens, one only checks garages, and one only checks basements. None of them check the whole house.
  • The Coverage Gap: Together, these tests only cover 25% of the Master Map.
  • The Blind Spots:
    • The "Service Disruption" District: This is where attacks try to make the AI crash or cost a fortune to run. Zero of the major tests check this.
    • The "Model Internals" District: This is where hackers tweak the AI's internal brain directly. Zero tests check this either.

Why this matters:
The paper found that attacks in these "blind spot" areas are actually very effective. Some can make the AI run 46 times slower or cost 100 times more to use, yet no one is testing for them. It's like having a fire alarm that only beeps when you burn toast, but stays silent when the whole house is on fire.


3. The "Catch-All" Problem

The researchers noticed that because the field is moving so fast, many scientists just throw new attacks into a "Miscellaneous" bucket because they don't have a specific name yet.

  • The Analogy: It's like a library where 60% of the new books are just shoved into a box labeled "Stuff."
  • The Impact: This makes it hard to know how many real new attacks are happening. The paper suggests we need better names so we can actually count the threats.

4. What They Did With This

Instead of just pointing out the holes, the researchers released their tools to the public:

  1. The Master Map: A downloadable list of all 507 attack types.
  2. The Audit Report: A clear chart showing exactly which parts of the map are being tested and which are ignored.
  3. A Blueprint for New Tests: They showed how to use their map to build a new test specifically for the "blind spots" (like the Service Disruption district).

The Bottom Line

The paper argues that we are currently "testing for the wrong things." We are very good at testing if an AI will say something rude, but we are terrible at testing if an AI will crash, steal data, or get hijacked to do real-world damage.

The Takeaway: Just because a security test says an AI is "safe" doesn't mean it is. It just means it passed the specific, narrow test it was given. To truly be safe, we need to expand our testing to cover the whole city, not just the neighborhood we're currently comfortable with.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →