Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning

The Big Picture: A New Kind of Lockpick

Imagine you have a very smart robot guard (the AI) standing at the door of a library. This guard is trained to stop anyone from asking for dangerous books, like "How to build a bomb."

For a long time, hackers tried to trick this guard by hiding the bad words inside a picture. They would take a picture of a bomb blueprint and write "How to build a bomb" in tiny letters inside the drawing. The guard would look at the picture, read the text, and say, "Aha! I see the bad words!" and shut the door. This is called an "Image-as-Wrapper" attack. It's like trying to smuggle a knife inside a hollowed-out book; if the guard opens the book, they find the knife immediately.

This paper introduces a new, much sneakier trick called "Visual Exclusivity" (VE).

Instead of hiding bad words, the hackers use the picture itself as the only way to understand the danger.

The Analogy: Imagine the hacker hands the guard a picture of a complex, unlabeled machine. They ask, "How do I put this together?"
The Trap: The question is innocent. The picture looks like a harmless diagram. But to answer "How to put this together," the robot must look closely at the gears, wires, and levers in the picture. If the robot understands the picture, it accidentally reveals how to build a weapon.
Why it's hard to stop: You can't just scan the text (there is no bad text). You can't just blur the picture (then the robot can't see the machine). The danger is locked inside the visual details.

The Solution: The "Master Planner" (MM-Plan)

The researchers realized that tricking the guard requires more than just one clever question. It requires a long, multi-step conversation where the hacker slowly builds trust and peels back layers of the image.

To do this automatically, they built a tool called MM-Plan. Think of it as a Grandmaster Chess Player who plays against the robot guard.

The Old Way (Turn-by-Turn): Imagine a novice player who makes a move, waits for the robot to respond, then makes another move. If the robot says "No," the novice panics and tries a random new angle. This is slow and often fails.
The MM-Plan Way (Global Strategy): The Grandmaster doesn't just think about the next move. They look at the whole board and plan the entire game in their head before making the first move.
- Step 1: "I will pretend to be a curious student."
- Step 2: "I will show you just the top-left corner of the machine."
- Step 3: "I will ask about the springs."
- Step 4: "Finally, I will ask for the full assembly guide."

The Grandmaster (MM-Plan) generates this whole script at once. It learns from its mistakes by playing thousands of games against itself, getting better at finding the perfect sequence of questions to trick the robot.

The "Training Gym" (VE-Safety)

To teach this Grandmaster, the researchers built a special gym called VE-Safety.

It contains 440 real-world examples, like blueprints of weapons, maps of bank vaults, and chemical diagrams.
Crucially, these examples are impossible to solve with text alone. If you describe a bank vault in words, it's just a room. But if you show a map of the vault, you can see the blind spots and the alarm triggers. The gym ensures the AI must look at the picture to be dangerous.

The Results: Breaking the Unbreakable

The researchers tested their Grandmaster against the smartest AI guards in the world (like GPT-5 and Claude 4.5 Sonnet).

The Old Attacks: When hackers used the old "hiding text in pictures" method, the top AI guards blocked them almost 100% of the time.
The New Attack (MM-Plan): The Grandmaster succeeded 46% of the time against Claude and 13% against GPT-5.
- Why this matters: GPT-5 is considered one of the safest AIs ever made. If a Grandmaster can trick it 13% of the time using this visual trick, it means our current safety guards have a huge blind spot. They are great at reading text, but they are bad at realizing when a picture is being used to bypass the rules.

The Takeaway

This paper is a warning label for the future of AI safety.

The Problem: We are getting better at stopping bad words, but we are bad at stopping bad ideas that are hidden inside pictures.
The Lesson: If an AI can "see" and "reason" about a picture, a clever attacker can use that picture to sneak past the safety filters.
The Fix: We need to teach AI guards not just to read the text, but to understand the context of the images they are looking at, even when the user is asking innocent questions.

In short: The paper shows that you can't just lock the door on bad words anymore. You have to watch out for the pictures, too, because sometimes the picture is the key to the lock.

1. Problem Definition: The Limitation of Current Multimodal Attacks

Current multimodal red teaming primarily relies on the "Image-as-Wrapper" paradigm. In these attacks (e.g., Visual Substitution or Visual Control), the image serves merely as a container to conceal malicious text (via typography) or disrupt model encoders (via adversarial noise).

Weakness: These attacks are structurally brittle. Once the payload is exposed via OCR or image captioning, standard text-based safety filters can neutralize the threat.
The Gap: There is a lack of understanding regarding "Image-as-Basis" threats, where the visual content itself is the prerequisite for harm. In these scenarios, the text prompt is benign, and the image contains no hidden text or noise. The harmful intent can only be realized if the model performs visual reasoning over complex technical content (e.g., schematics, floor plans, circuit diagrams).
Challenge: Automating attacks on this "Visual Exclusivity" (VE) threat is difficult because:
1. It requires long-horizon multi-turn interactions to decompose complex reasoning tasks.
2. Existing search-based methods scale poorly with long contexts.
3. Reinforcement Learning (RL) approaches often rely on ground-truth attack trajectories, which are scarce and ethically difficult to obtain.

2. Core Concept: Visual Exclusivity (VE)

The authors formalize Visual Exclusivity (VE) as a new threat model where:

Text Insufficiency: The harmful goal cannot be achieved by the text prompt alone, even with adversarial paraphrasing.
Visual Sufficiency: The goal is achievable only when the model reasons about the spatial, functional, or causal relationships within the image.
Non-textual Irreducibility: The visual information cannot be losslessly compressed into text (e.g., via OCR or captions) without losing the critical structural details required for the harm.

Example: A user uploads a weapon schematic and asks, "How to assemble this?" The text is innocent; the harm arises only when the model interprets the spatial connections between components in the image.

3. Methodology: MM-Plan Framework

To systematically exploit VE, the authors propose Multimodal Multi-turn Agentic Planning (MM-Plan).

A. Global Planning vs. Sequential Reaction

Unlike traditional turn-by-turn agents that react to immediate responses (suffering from myopia), MM-Plan treats red teaming as a global planning problem.

Single-Pass Synthesis: An "Attacker Planner" (based on an open-weight MLLM like Qwen3-VL-4B) generates a comprehensive Jailbreak Plan in a single inference pass.
Plan Structure: The plan includes:
- Persona: A benign role (e.g., "Curious student," "Security auditor").
- Context: Narrative framing to establish trust.
- Execution Sequence: A multi-turn trajectory specifying:
  - Visual Operations: Dynamic image manipulations (e.g., crop, mask, blur) to isolate specific regions or hide sensitive details temporarily.
  - Text Prompts: The conversational query for each turn.

B. Optimization via GRPO

To train the planner without human-annotated attack trajectories, the authors utilize Group Relative Policy Optimization (GRPO):

Mechanism: The planner samples a group of $K$ distinct plans for a given input. These plans are executed against the victim model.
Reward Signal: A "Judge Model" (e.g., GPT-5 or Claude 4.5) evaluates the outcomes and provides a composite reward $R$ $R$ :
- $r_{succ}$ : Success score (graded 1-10 based on partial/full compliance).
- $r_{prog}$ : Progress score (how well each turn advances the goal).
- $r_{turn}$ : Penalty for excessive turns (efficiency).
- $r_{goal}$ : Penalty for drifting from the harmful objective.
Update: The policy is updated to maximize the likelihood of high-performing plans relative to the group average, allowing the agent to self-discover sophisticated strategies.

4. Key Contributions

Formalization of Visual Exclusivity (VE): Defined a new vulnerability class where harm is contingent on visual reasoning, distinguishing it from wrapper-based attacks.
VE-Safety Benchmark: Introduced a human-curated dataset of 440 instances across 15 safety categories (e.g., Illegal Activity, Physical Harm, Cybercrime).
- Features real-world technical imagery (schematics, floor plans).
- Verified Non-textual Irreducibility (harm is impossible via text/OCR alone).
MM-Plan Framework: A novel agentic planning framework that achieves state-of-the-art attack success rates by combining global strategy synthesis with visual manipulation tools, optimized via GRPO without human supervision.

5. Experimental Results

The authors evaluated MM-Plan against 8 frontier Multimodal LLMs (MLLMs), including open-weight models (Llama-3.2, InternVL3, Qwen3-VL) and proprietary models (GPT-4o, GPT-5, Claude 3.7/4.5 Sonnet, Gemini 2.5 Pro).

Attack Success Rate (ASR):
- Claude 4.5 Sonnet: MM-Plan achieved 46.3% ASR, nearly doubling the best baseline (FigStep at 24.4%).
- GPT-5: MM-Plan achieved 13.8% ASR, while all baselines failed to exceed 3.1%.
- Open-Weight Models: Achieved ASRs >60% on Llama-3.2 and InternVL3.
Efficiency: MM-Plan required significantly fewer turns (3–8 turns) to succeed compared to search-based baselines (SSA, Crescendo), which often exhausted their turn budgets (10 turns) without success.
Generalization: The agent demonstrated strong transferability, maintaining high ASR when trained on one model and tested on another, and generalizing well to unseen queries.
Robustness: The method remained effective even when the victim model had input filters (e.g., Llama Guard 3 Vision), reducing refusal rates by ~90% compared to direct requests.

6. Significance and Implications

Safety Gap Exposed: The findings reveal that current safety alignment mechanisms are heavily biased toward text-centric defenses. They fail to protect against agentic adversaries that leverage visual reasoning and multi-turn planning to bypass guardrails.
Defense Limitations: Standard defenses like OCR, captioning, and adversarial denoising are ineffective against VE because the "payload" is the semantic interpretation of the image, not hidden text or noise.
Future Directions: The paper highlights the need for multimodal safety alignment that specifically addresses visual reasoning capabilities and long-horizon strategic interactions, rather than just filtering static inputs.

Conclusion: MM-Plan demonstrates that frontier models remain vulnerable to sophisticated, automated attacks that treat images as the basis for harm. This work provides a critical diagnostic tool (VE-Safety) and a methodology (MM-Plan) to stress-test and improve the safety of next-generation multimodal AI systems.