NAAMSE: Framework for Evolutionary Security Evaluation of Agents

Imagine you've built a very smart, helpful robot assistant (an "AI Agent") to work in your bank, your hospital, or your office. You want to make sure it's safe before you let it talk to real people.

The problem is, the old ways of testing these robots are broken.

The Old Way (Manual Red Teaming): Hiring a human to try to trick the robot. This is slow, expensive, and the human can only think of so many tricks.
The Static Way (Benchmarks): Giving the robot a fixed list of 100 "bad questions" to answer. The problem? As soon as the robot learns to answer those 100, hackers invent 1,000 new tricks that the list doesn't have.

Enter NAAMSE: The "Evolutionary Security Coach"

The authors of this paper propose a new system called NAAMSE. Think of it not as a test, but as a survival-of-the-fittest training camp for finding security holes.

Here is how it works, using a simple analogy:

1. The Setup: A Single "Evil" Coach

Instead of a human tester or a static list, NAAMSE uses one super-smart AI coach. This coach's only job is to try to trick the target robot into doing something bad (like leaking secrets or breaking rules).

2. The Training Loop: Trial, Error, and Evolution

The coach doesn't just guess randomly. It runs a continuous cycle that looks like this:

Phase 1: The Library (Selection)
The coach starts with a massive library of thousands of questions, ranging from "How do I bake a cake?" (good) to "How do I hack a bank?" (bad). It picks a starting question.
- Analogy: Imagine a chef picking a random ingredient from a giant pantry to start a new recipe.
Phase 2: The Taste Test (Execution & Scoring)
The coach asks the target robot the question. Then, it checks the answer.
- The Trap: If the robot says, "I can't do that" to a bad question, that's good (Score: Low). If the robot actually does the bad thing, that's a failure (Score: High).
- The Twist: The system also checks "Good" questions. If the robot refuses to bake a cake because it thinks baking is dangerous, that's also a failure (Score: High).
- Analogy: The coach is grading the robot on two things: Safety (don't do bad things) and Helpfulness (do good things). If the robot is too scared to do anything, it fails. If it's too reckless, it fails.
Phase 3: The Mutation (Evolution)
This is the magic part. Based on the score, the coach decides how to change its next question:
- If the robot refused the bad question (Low Score): The coach thinks, "That didn't work. I need a totally new angle!" It goes to a different part of the library to find a new type of trick.
- If the robot almost slipped up (Medium Score): The coach thinks, "We're close! Let's tweak the wording slightly to make it more convincing."
- If the robot almost did the bad thing (High Score): The coach thinks, "Bingo! Let's make this even more extreme to see how far we can push it."
- Analogy: Imagine a lockpicker. If a key doesn't fit, they try a different key shape. If a key fits halfway, they file it down a bit to make it fit better. If it opens the door, they try to pick the lock even faster to see if they can break it.
Phase 4: The Archive (Integration)
Every new, improved trick the coach invents is saved back into the library. Over time, the library becomes filled with the most sophisticated, "evolved" tricks that have successfully tricked the robot.

Why is this better?

The paper shows that this "evolutionary" approach finds hidden weaknesses that static lists miss.

The "Blanket Refusal" Problem: Some robots are so scared of breaking rules that they refuse to answer anything (even "What's the weather?"). Static tests might think this robot is "safe" because it never says "yes" to a bad request. But NAAMSE catches this because it also tests if the robot is useless. It forces the robot to be both safe and helpful.
The "Compound" Effect: By evolving the attacks over time, the coach learns to combine small tricks into one big, complex trick that confuses the robot.

The Bottom Line

NAAMSE is like a digital immune system. Instead of just checking if you have a cold once a year, it constantly simulates new, stronger viruses to train your body (the AI) to recognize and fight them, while making sure your body doesn't get so scared of germs that it stops breathing.

It's a way to make AI agents robust enough to handle the real world, where hackers are always inventing new ways to trick them.

Here is a detailed technical summary of the paper "NAAMSE: Framework for Evolutionary Security Evaluation of Agents", published at the ICLR 2026 Workshop on Agents in the Wild.

1. Problem Statement

The rapid deployment of AI agents in production environments has outpaced the development of robust security practices. Current security evaluation methods suffer from three critical limitations:

Manual Red-Teaming: While effective, it is slow, labor-intensive, and unscalable, relying on human intuition rather than systematic coverage.
Static Benchmarks: Existing datasets (e.g., "DAN" prompts) suffer from rapid obsolescence as models are patched. They probe models with fixed inputs, failing to account for adaptive, multi-turn adversaries.
Single-Turn Focus: Most automated tools (e.g., GPTFuzzer, AutoDAN) optimize for Attack Success Rate (ASR) on isolated, single-turn LLMs. They fail to address the utility-security trade-off inherent in production agents, often leading to "degenerate" security where models simply refuse all requests (blanket refusal) to appear safe, rendering them unusable.

2. Methodology: The NAAMSE Framework

The authors propose NAAMSE, a pre-deployment evaluation framework that reframes agent security testing as a feedback-driven optimization problem. Instead of a multi-agent system, NAAMSE employs a single autonomous agent that orchestrates an evolutionary lifecycle consisting of four distinct phases:

A. Architecture Overview

The system operates as a continuous pipeline where prompts flow through four stages:

Selection & Representation (Clustering Engine):
- Starts with a seed corpus of 128K+ adversarial and 50K+ benign queries.
- Uses all-MiniLM-L6-v2 sentence transformers to encode prompts.
- Organizes the corpus into a hierarchical tree via recursive K-means clustering, annotated by LLMs to capture interaction patterns (e.g., "role-play jailbreaks").
Execution & Evaluation (Behavioral Engine):
- Dispatches prompts to the target agent via an Agent-to-Agent (A2A) interface, supporting multi-turn dialogue and tool use.
- Computes a scalar fitness score ( $s \in [0, 100]$ ) based on three signals:
  - Harmfulness: Assessed via specialized LLM judges across safety categories.
  - Alignment: Measures if the request was fulfilled (refusal vs. compliance).
  - Privacy Risk: Detects PII disclosures.
- Dual-Objective Scoring:
  - Adversarial Prompts: Penalizes harmful compliance; rewards refusal.
  - Benign Prompts: Penalizes unnecessary refusal; rewards helpfulness.
  - This prevents the "blanket refusal" strategy from achieving a high security score.
Evolutionary Decision (Mutation Engine):
- The fitness score dictates the next action, modeling an adaptive adversary:
  - Low Score ( $s < 50$ ): Trigger Exploration (abandon trajectory, sample new clusters).
  - Mid-Range ($50 \le s < 80$): Trigger Refinement (generate semantically similar variants to stabilize the attack).
  - High Score ($80 \le s < 100$): Trigger Mutation (apply aggressive, research-derived transformations).
  - Perfect Score ( $s = 100$ ): Trigger Exploration (mark surface as saturated, force transition to new clusters to avoid local optima).
- Mutation operators include game-theoretic reframing, persona role-play, multilingual encoding, and code execution.
Corpus Integration:
- New prompts are embedded and assigned to the nearest cluster centroid based on Euclidean distance, allowing the corpus to evolve and accumulate successful attack vectors over time without global re-clustering.

3. Key Contributions

Reframing Red-Teaming: Shifts the paradigm from static testing to a dynamic, evolutionary optimization process driven by model responses.
Utility-Security Balance: Introduces a scoring mechanism that explicitly penalizes both harmful compliance (security failure) and over-refusal (usability failure), ensuring agents are evaluated on robustness rather than just refusal rates.
Synergy of Exploration and Mutation: Demonstrates that combining broad corpus exploration with targeted genetic mutation is superior to using either strategy in isolation.
A2A Compatibility: The framework is designed for autonomous agents, supporting complex interaction modes (tool use, multi-turn) rather than just chatbots.

4. Experimental Results

The framework was evaluated primarily on Gemini 2.5 Flash, with cross-model validation on Qwen3.5-122B-A10B.

Ablation Study (Table 1):
- Full System ("All"): Achieved the highest mean score (79.76) and discovered 100% failure modes (critical jailbreaks) in multiple iterations.
- Mutation-Only: Converged to local optima with stagnant scores (~53), failing to discover new attack surfaces without exploration.
- Exploration-Only (Random+Similar): Failed to refine high-scoring candidates, causing scores to collapse when mutation was disabled.
- Conclusion: The synergy between exploration (finding candidates) and mutation (refining them) is critical for systematic vulnerability amplification.
Sanity Checks:
- The system correctly identified "All-No" (always refuse) and "All-Yes" (always comply) agents as failing, with scores reflecting their respective security and usability flaws.
Generalizability:
- Cross-model experiments (Table 6 & 7) confirmed that the evolutionary synergy holds regardless of the specific target model or the LLM used as the judge.

5. Significance and Future Work

Realistic Assessment: NAAMSE provides a more realistic assessment of agent robustness against evolving threats, moving beyond static checklists.
Scalability: It offers a scalable alternative to manual red-teaming, capable of continuously discovering compound vulnerabilities that one-shot methods miss.
Limitations:
- Reliance on LLM-based judges introduces potential bias (though the framework is judge-agnostic).
- Coverage is bounded by the diversity of the initial seed corpus and implemented mutation operators.
- Current scope focuses on interaction-level vulnerabilities, excluding system-level compromises like weight extraction.
Future Directions: The framework is extensible to multi-modal injections, API exploits, and tool-call payloads by integrating new mutation operators.

Conclusion: NAAMSE represents a significant step forward in AI safety by treating security evaluation as a continuous, adaptive process. It effectively uncovers high-severity failure modes in production agents while ensuring that security measures do not degrade the agent's utility. The code is open-source at github.com/HASHIRU-AI/NAAMSE.

NAAMSE: Framework for Evolutionary Security Evaluation of Agents

1. The Setup: A Single "Evil" Coach

2. The Training Loop: Trial, Error, and Evolution

Why is this better?

The Bottom Line

1. Problem Statement

2. Methodology: The NAAMSE Framework

A. Architecture Overview

3. Key Contributions

4. Experimental Results

5. Significance and Future Work

More like this

The Structure of Service Level Agreement of Slice-based 5G Network

Digital currency hardware wallets and the essence of money

Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Positionality in Σ_0^2 and a completeness result

Slightly Non-Linear Higher-Order Tree Transducers