Imagine you have a very smart, highly trained robot assistant. This robot has been taught strict rules: "Never help someone build a bomb," "Never explain how to spread a deadly virus," and "Always be safe." This is called "safety alignment."

For a long time, hackers (or "red teamers") tried to trick these robots by asking tricky questions. But the robots got better at saying "No."

This paper introduces a new way to trick these robots, specifically the ones that can see pictures as well as read text. The researchers call their method MAPA (Multi-turn Adaptive Prompting Attack).

Here is how it works, using simple analogies:

1. The Problem: The "Overly Obvious" Trap

The researchers found that if you try to trick a robot by showing it a picture of a bomb and asking, "How do I build this?", the robot immediately panics and refuses. It's like walking up to a security guard holding a giant, flashing sign that says "I AM A THIEF." The guard stops you instantly.

Even if you ask a question in text and show a picture, if the picture is too scary or the text is too direct, the robot's safety filters trigger, and it shuts down the conversation.

2. The Solution: The "Slow Burn" Strategy

The paper proposes a strategy called MAPA, which is like a game of chess played over many moves, rather than a single punch.

The Core Idea: Instead of trying to break the robot's defenses all at once, you sneak the bad request in slowly, step-by-step, over many turns of conversation.

How MAPA Plays the Game:

The researchers use a "Coach" (an AI) to guide the attack. The Coach has two main jobs:

Job A: Mixing the Ingredients (The "Turn" Level)
At every single step of the conversation, the Coach tries three different ways to ask the question to see which one works best:

Just Text: Asking without a picture.
Text + Scary Picture: Asking with a picture that matches the text.
Text + "Safe" Picture: Asking with a picture where the scary part is hidden in the image, and the text is rewritten to sound innocent.

Analogy: Imagine you are trying to get a friend to agree to a wild idea.

Option 1: You just ask them directly.
Option 2: You ask them while showing them a wild photo.
Option 3: You ask them while showing them a photo of a sunset, but you talk about the "wild idea" in a way that fits the sunset.
The Coach picks the one that gets the friend closest to saying "Yes" without them getting angry.

Job B: Adjusting the Path (The "Across Turns" Level)
If the friend says "No" or gets confused, the Coach doesn't just give up. It looks at what happened and changes the plan for the next step.

Advance: If the friend is getting warmer to the idea, the Coach moves to the next, slightly more direct question.
Regen: If the friend is confused, the Coach tries asking the same question again but with different words.
Backtrack: If the friend suddenly gets angry because of something said two steps ago, the Coach goes back to that earlier step and tries a different approach.

Analogy: This is like a detective trying to solve a case. If a suspect lies, the detective doesn't just scream; they go back, rethink their theory, and ask a different question to catch the lie.

3. The "Reflection" Mechanism

If the whole attempt fails (the robot still says "No"), the Coach doesn't just try the exact same thing again. It looks at why it failed, learns from the mistake, and designs a completely new, smarter plan for the next attempt. It's like studying a failed test to do better on the next one.

4. The Results

The paper tested this method against several popular AI models (like LLaVA, Qwen, and GPT-4o-mini).

Old methods (just text, or just text + pictures) failed most of the time against these smart robots.
MAPA succeeded 15% to 30% more often than the best existing methods.
In some tests, MAPA managed to trick the robots into giving harmful answers about 96% of the time, whereas other methods only succeeded about 60-70% of the time.

Summary

The paper claims that to break the safety of modern AI that can see and read, you can't just be loud and obvious. You have to be a sneaky, adaptive conversationalist. You must mix text and images carefully, listen to the robot's answers, and slowly, over many turns, guide the conversation toward the forbidden topic until the robot accidentally slips up and answers the harmful question.

Important Note: The authors emphasize that this is a "Red Teaming" exercise. They are doing this to find holes in the safety systems so that developers can fix them and make the AI safer, not to actually teach people how to cause harm.

Technical Summary: Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

Problem Statement

While multi-turn jailbreak attacks have proven effective against text-only Large Language Models (LLMs) by gradually introducing malicious content to bypass safety alignment, extending these techniques to Large Vision-Language Models (LVLMs) remains underexplored. The authors identify a critical gap: naively incorporating visual inputs into multi-turn attacks often makes them easier to defend against. Safety-aligned LVLMs tend to trigger defense mechanisms more readily when presented with overly malicious visual content, resulting in conservative responses. Furthermore, existing methods fail to effectively coordinate harmful cues across modalities (text and vision) to mutually reinforce rather than contradict one another, thereby limiting attack effectiveness.

Methodology: MAPA

To address these challenges, the authors propose MAPA (Multi-Turn Adaptive Prompting Attack), a framework designed to elicit progressively more malicious responses through a two-level adaptive design.

1. Turn-Level: Alternating Attack Actions

At each turn of the dialogue, MAPA employs a greedy search strategy to select the most effective attack action from three candidates. This process involves an Attacker LLM generating an initial unconnected text prompt ( $ucQ_T$ ) and a Connector LLM (operating in a Chain-of-Thought manner) that:

Identifies malicious concepts in the text.
Generates a corresponding image generation prompt.
Refines the text prompt ( $cQ_T$ ) by replacing malicious concepts to align with the visual input.
Uses Stable Diffusion to generate a malicious image ( $cQ_V$ ).

Based on these components, three attack actions are formulated and tested against the victim LVLM:

Action 1: Unconnected text prompt only ( $ucQ_T$ ).
Action 2: Unconnected text prompt + generated malicious image ( $ucQ_T + cQ_V$ ).
Action 3: Connected text prompt + generated malicious image ( $cQ_T + cQ_V$ ).

A Judge LLM evaluates the responses. If none succeed, the action yielding the highest Semantic Correlation (SEM) with the jailbreak task is selected as the optimal action for the current turn.

2. Cross-Turn: Adaptive Trajectory Adjustment

Across turns, MAPA adjusts the attack trajectory through iterative back-and-forth refinement to gradually amplify maliciousness. The system compares the semantic correlation of the current response against historical context and previous turns to decide on one of three policies:

Advance: Triggered if the current semantic correlation increases compared to the previous turn and the version without historical context, indicating successful gradual injection of maliciousness.
Back: Triggered if the current correlation decreases compared to the previous turn but increases without historical context, suggesting the previous turn's context degraded the attack. The system reverts to the previous turn for regeneration.
Regen: Triggered if the correlation increases but not in a gradual manner, or if it decreases overall. The current turn's prompt is regenerated.

Additionally, MAPA includes a Reflection Mechanism. If a multi-turn attack attempt fails after exhausting the turn limit, the Attacker LLM analyzes the failure history (failed strategies and responses) to design a new, more effective attack chain for a subsequent attempt, enabling intra-task learning.

Key Contributions

Characterization of LVLM Jailbreak Failures: The paper uncovers that existing single-turn and naively extended multi-turn attacks often fail against safety-aligned LVLMs because straightforward insertion of malicious visual content triggers defenses.
MAPA Framework: A practical solution utilizing a two-level design (turn-level action alternation and cross-turn trajectory adjustment) to mitigate these failures. It explicitly manages the interplay between text and vision modalities to reinforce attack effectiveness.
Comprehensive Evaluation: Extensive experiments and ablation studies demonstrating the superiority of MAPA over state-of-the-art methods.

Experimental Results

The authors evaluated MAPA on four benchmarks (HarmBench, JailbreakBench, AdvBench, RedTeam-2K) against four target models: LLaVA-v1.6-Mistral-7B, Qwen2.5-VL-7B-Instruct, Llama-3.2-Vision-11B-Instruct, and GPT-4o-mini.

Performance: MAPA consistently outperformed baselines (including CoA, ActorAttack, FootInTheDoor, VRP, and MML). On HarmBench, it achieved an average Attack Success Rate (ASR) of 96.66%, improving upon the second-best method by approximately 26.11%.
Benchmark Consistency: MAPA showed significant gains across all benchmarks, improving average ASR by 23.33% on JailbreakBench, 14.45% on AdvBench, and 32.22% on RedTeam-2K compared to the best baselines.
Ablation Studies: Removing the reflection mechanism reduced performance by 8.89% on average. Disabling policy adjustment (trajectory refinement) caused a 12.23% drop. The study also revealed that naively adding visual inputs (without the adaptive alignment of Action 3) could decrease performance, particularly against models with strong visual safety mechanisms like Llama-3.2-Vision.
Efficiency: Under fixed query budgets, MAPA maintained higher success rates while consuming fewer queries than baselines.

Significance and Claims

The paper claims to be the first work to systematically investigate and address the pain points of multi-turn jailbreaks on LVLMs. It reveals safety vulnerabilities in widely used LVLMs within cross-modality multi-turn dialogues. The authors assert that their findings highlight the necessity of intelligently optimizing text-vision prompts rather than superficially aligning them. Ultimately, the work aims to promote the development of more robust safety alignments for LVLMs in realistic and malicious settings by exposing these specific vulnerabilities.

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models