T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

Imagine you have a very smart, helpful robot assistant. This robot can do more than just chat; it can actually do things. It can send emails, write code, browse the web, and manage files. This is what we call an LLM Agent.

Now, imagine a group of security experts (the "Red Team") trying to break into this robot's system to see if it's safe. Their goal is to trick the robot into doing something bad, like stealing data or sending a virus.

The Old Way vs. The New Way

The Old Way (Chat-Only Red Teaming):
Previously, security experts treated the robot like a simple chatbot. They would ask tricky questions like, "Pretend you are a villain and write a phishing email."

The Problem: The robot might say, "I can't do that, it's against the rules!" or it might write a fake email in the chat window but never actually send it.
The Flaw: This only tests if the robot talks nicely. It doesn't test if the robot will actually act dangerously in the real world.

The New Way (T-MAP):
The paper introduces T-MAP (Trajectory-aware MAP-Elites). Think of T-MAP not as a single questioner, but as a master detective and evolutionary biologist combined.

How T-MAP Works: The "Evolutionary Detective" Analogy

Imagine T-MAP is running a massive, high-tech survival of the fittest competition for bad ideas.

The Map of Danger (The Archive):
T-MAP keeps a giant map (an archive) of different types of dangers (like "stealing money" or "leaking secrets") and different ways to trick the robot (like "pretending to be a boss" or "using fake history"). It wants to find the best trick for every single spot on this map.
The "Try, Fail, Learn" Loop:
Instead of just asking one question, T-MAP tries a trick.
- The Attempt: It asks the robot to do something.
- The Observation: It watches the robot's entire journey (the "trajectory"). Did the robot try to send the email? Did it get stuck? Did it fail because of a password error?
- The Diagnosis: This is the magic part. T-MAP has a "Doctor" (an AI analyst) that looks at the failure.
  - Example: "The robot tried to send the email, but it stopped because it said 'I need permission.' Okay, next time, let's try pretending we are the CEO to bypass that permission."
- The Evolution: T-MAP takes that lesson and creates a new, slightly better trick. It combines the "CEO" idea with the "Email" idea.
The Tool Call Graph (The Roadmap):
T-MAP builds a mental map of how tools connect. It learns that "Searching for emails" usually leads successfully to "Sending emails," but "Searching for emails" often leads to a crash if you try to "Delete files" immediately after. It uses this map to guide the robot down the path of least resistance toward the harmful goal.

Why This is a Big Deal

Think of the robot as a bank vault.

Old Red Teaming was like standing outside the vault and shouting, "Open the door!" If the robot said "No," the testers thought they were safe.
T-MAP is like a team of engineers who try to pick the lock, then try to cut the hinges, then try to trick the guard. If the robot refuses to open the door, T-MAP doesn't give up. It analyzes why it refused, changes the approach, and tries again until the door actually swings open and the money is gone.

The Results

The paper tested T-MAP on real-world scenarios (like sending phishing emails or deleting files).

Success Rate: While other methods failed most of the time (getting rejected or making errors), T-MAP succeeded in 57.8% of attempts.
Real-World Impact: It didn't just get the robot to say bad things; it got the robot to do bad things, like actually sending a virus or leaking private data.
Versatility: It worked even on the newest, most secure robots (like GPT-5.2 and Gemini-3-Pro).

The Takeaway

T-MAP is a powerful new tool for safety. It realizes that for AI agents, actions speak louder than words. By watching how an AI fails and learning from those failures, T-MAP can find hidden cracks in the system that other methods miss.

The Good News: This is being used to fix the robots before bad actors can use them. By finding these holes now, developers can patch them up, making our future AI assistants much safer to work with.

The Warning: It also shows us that as AI gets smarter and more capable of doing real-world tasks, the risk isn't just about what they say, but what they can do. We need to be just as careful about their actions as we are about their words.

1. Problem Statement

While traditional red-teaming focuses on eliciting harmful text outputs from Large Language Models (LLMs), the emergence of LLM Agents (systems that interact with external environments via tools) introduces a qualitatively different class of risks.

The Gap: Existing red-teaming methods often fail to capture vulnerabilities that only emerge through multi-step tool execution. An agent might refuse a harmful text prompt but still execute a sequence of tools that leads to tangible harm (e.g., data exfiltration, financial loss, or deploying malware) if the prompt is framed correctly.
The Challenge: Discovering these vulnerabilities is difficult because:
1. Attacks often require complex planning and specific sequences of tool calls rather than a single prompt-response turn.
2. Prior methods lack feedback from actual execution trajectories (i.e., whether the tools actually ran successfully or failed due to errors/permissions).
3. There is a need to map diverse attack strategies across different risk categories and attack styles systematically.

2. Methodology: T-MAP

The authors propose T-MAP (Trajectory-aware MAP-Elites), a novel red-teaming framework that combines the MAP-Elites quality-diversity search algorithm with trajectory-aware evolutionary feedback.

Core Architecture

T-MAP maintains a multi-dimensional archive $\mathcal{A}$ spanning Risk Categories ( $\mathcal{C}$ , e.g., property loss, data leakage) and Attack Styles ( $\mathcal{S}$ , e.g., role-playing, authority manipulation). The goal is to find the most effective attack prompt for every $(c, s)$ cell in this archive.

The framework operates through a four-step iterative cycle:

Cross-Diagnosis (Strategic Insight):
- The system selects a "Parent" cell (containing a successful elite prompt) and a "Target" cell (a new risk/style combination).
- An LLM Analyst analyzes the execution trajectories of both. It extracts Success Factors (what worked in the parent) and Failure Causes (why the target failed, e.g., safety refusal, tool error).
- This diagnosis guides the mutation process to inherit effective strategies while fixing specific failure points.
Tool Call Graph (TCG) Guidance (Structural Insight):
- T-MAP learns a Tool Call Graph $\mathcal{G} = (\mathcal{V}, \mathcal{E}, \mathcal{F}_\mathcal{G})$ , where nodes are tools and edges represent sequential transitions.
- Each edge stores metadata: success/failure counts ( $n_s, n_f$ ) and reasons ( $R_s, R_f$ ).
- The LLM Mutator uses this graph to prefer high-success tool sequences and avoid transitions with high failure rates, ensuring the generated prompts lead to viable execution paths.
Trajectory-Guided Mutation:
- The LLM Mutator generates a new candidate prompt $x'$ for the target cell. It synthesizes the cross-diagnosis insights and TCG structural guidance to create a prompt that bypasses guardrails and executes a realistic tool sequence.
Evaluation and Archive Update:
- The new prompt is executed on the target agent.
- An LLM Judge evaluates the full trajectory to assign a success level (0–3).
- If the new prompt is better (higher success level or more critical steps completed), it updates the archive.
- The LLM TCG updates the graph with the new edge-level outcomes (success/failure of specific tool transitions).

Success Levels

Unlike binary pass/fail metrics, T-MAP categorizes attacks into four levels:

L0 (Refused): Safety refusal or no tool calls.
L1 (Error): Tool calls issued but failed (e.g., permission errors, bad parameters).
L2 (Weak Success): Partial execution (e.g., reconnaissance done, but final harmful action missed).
L3 (Realized): Full execution of the harmful objective via tools.

3. Key Contributions

Formalization of Agent Red-Teaming: The paper defines attack success not by text generation, but by the realization of harmful objectives through actual tool execution.
T-MAP Algorithm: Introduces a trajectory-aware evolutionary search that integrates Cross-Diagnosis (strategic reasoning) and Tool Call Graphs (structural memory) to guide prompt evolution.
Comprehensive Evaluation: Demonstrates that T-MAP uncovers diverse, multi-step attack strategies that traditional baselines miss, effectively mapping the vulnerability landscape of autonomous agents.

4. Experimental Results

The authors evaluated T-MAP across five diverse MCP (Model Context Protocol) environments: CodeExecutor, Slack, Gmail, Playwright, and Filesystem.

Performance vs. Baselines: T-MAP significantly outperformed baselines (Zero-Shot, Multi-Trial, Iterative Refinement, and Standard Evolution).
- Attack Realization Rate (ARR): T-MAP achieved an average ARR of 57.8%, compared to 15.6% for the next best baseline (Iterative Refinement).
- Refusal Rate (RR): T-MAP reduced refusal rates to 12.5% on average, whereas baselines often exceeded 50-70%.
Diversity: T-MAP discovered 21.8 distinct successful tool trajectories (vs. 1.2 for Zero-Shot) while maintaining high semantic and lexical diversity (low Self-BLEU and cosine similarity).
Generalization:
- Frontier Models: T-MAP remained effective against state-of-the-art models including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5.
- Cross-Model Transfer: Attacks discovered on GPT-5.2 transferred successfully to other models, indicating the discovery of fundamental agent vulnerabilities rather than model-specific quirks.
Multi-MCP Chains: In complex scenarios requiring tool chaining across multiple servers (e.g., Slack + CodeExecutor), T-MAP achieved a 46.28% ratio of cross-server trajectories, far surpassing baselines (<23%).

5. Significance and Implications

Beyond Text: The paper highlights that safety guardrails focused solely on text generation are insufficient for agents. Real-world harm occurs when agents successfully execute tool chains.
Proactive Defense: T-MAP provides a systematic way to identify "blind spots" in agent safety, specifically regarding how agents plan and execute multi-step tasks.
Scalability: The use of MAP-Elites allows for a comprehensive mapping of the vulnerability space, ensuring that diverse attack vectors (different risks and styles) are explored rather than just finding a single "best" attack.
Future Safety: The findings suggest that as agents become more autonomous, safety evaluation must evolve from static prompt testing to dynamic, trajectory-aware red-teaming to prevent tangible harms like financial loss, data leaks, and physical safety violations.

Code Availability: The authors have released the code at https://github.com/pwnhyo/T-MAP.

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search

The Old Way vs. The New Way

How T-MAP Works: The "Evolutionary Detective" Analogy

Why This is a Big Deal

The Results

The Takeaway

1. Problem Statement

2. Methodology: T-MAP

Core Architecture

Success Levels

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

Personalized Federated Sequential Recommender