When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems

Imagine you are talking to a very talented but strict artist. This artist is an AI that draws pictures based on your words. However, this artist has a strict rulebook: "No drawing bombs, no drawing violence, no drawing anything dangerous." If you ask for a bomb directly, the artist immediately says, "Nope, I can't do that," and refuses to draw.

For a long time, hackers tried to trick this artist by rewriting their request in a single, sneaky sentence (like saying "make a round metal ball with a fuse" instead of "bomb"). But the artist's rulebook is smart; it often catches these tricks. Sometimes the trick works too well, and the artist draws something harmless (like a beach ball) instead of the dangerous thing the hacker wanted. Other times, the artist catches the trick and refuses anyway.

The New Discovery: The "Memory" Loophole

The researchers in this paper found a new way to trick the artist. They realized that modern AI artists have a memory. If you have a long conversation with them, they remember what you said earlier to help you refine your picture.

The Old Way: Trying to sneak the whole "bomb" idea into one sentence.
The New Way (Inception): Breaking the "bomb" idea into tiny, harmless crumbs and feeding them to the artist one by one over several turns.

Think of it like this: You want to build a dangerous machine, but the security guard (the safety filter) checks every box you bring in.

Turn 1: You bring in a box labeled "Iron Sphere." The guard checks it. It's just a metal ball. Safe. The artist remembers this.
Turn 2: You bring in a box labeled "Black Powder." The guard checks it. It's just a spice. Safe. The artist remembers this too.
Turn 3: You bring in a box labeled "Copper Wire." The guard checks it. It's just wire. Safe.

By the time the artist puts all these "safe" boxes together in their memory, they accidentally build a bomb. The guard never saw the whole picture because the pieces arrived separately.

How the Attack Works (The "Inception" Method)

The researchers named their tool Inception (like the movie where ideas are planted deep in the subconscious). It uses two main tricks:

Cutting the Cake (Segmentation):
Instead of asking for "A man making a bomb," the AI breaks that sentence down like a chef chopping ingredients. It separates "man," "making," and "bomb" into different sentences. It uses grammar rules to make sure the pieces still make sense together, just like a puzzle.
The Russian Doll (Recursion):
Sometimes, even a small piece like "bomb" is too dangerous for the guard. So, the AI opens that piece up like a Russian nesting doll.
- "Bomb" is too scary? Let's break it down: "Explosive projectile."
- "Explosive" is still too scary? Let's break that down: "Gunpowder and a fuse."
- "Gunpowder" is still risky? Let's break it down: "Salt, charcoal, and sulfur."
Suddenly, you aren't asking for a bomb anymore; you are asking for "salt, charcoal, and sulfur." The guard sees harmless ingredients, but the AI remembers that when you mix them, they make a bomb.

The "VisionFlow" Playground

To test this, the researchers built a fake art studio called VisionFlow. It's a simulation that acts exactly like real-world AI art tools (like DALL·E 3 or Midjourney). It has:

A memory system that remembers your chat history.
Security guards (filters) that check both your words and the final picture.
A way to test if the "Inception" trick works.

The Results: A Big Wake-Up Call

When they tested this on real-world AI systems:

Success Rate: The old tricks worked maybe 12% of the time. The new "Inception" trick worked 32% to 52% of the time. That's a huge jump!
Real-World Proof: They tried it on actual commercial apps (like DALL·E 3 and Google's Imagen), and it worked there too. The AI drew the dangerous images the researchers wanted, even though the apps have strict safety rules.

Why This Matters

This paper shows that memory is a double-edged sword.

Good: It helps the AI understand you better when you want to fix a drawing ("Make the sky bluer," "Add a cat").
Bad: It allows bad actors to hide dangerous ideas in plain sight by spreading them out over a conversation.

The researchers also tried to build new guards to stop this (like a "Memory Scanner" that looks at the whole conversation history at once). While these new guards helped a little, the "Inception" trick was still very hard to stop.

The Bottom Line

Just because you don't say the "bad word" in one sentence doesn't mean you aren't asking for something bad. If you break a dangerous idea into tiny, safe-looking pieces and feed them to an AI with a good memory, the AI might accidentally build the dangerous thing for you.

This research is a warning to AI companies: We need to check not just what you say right now, but what you've been saying all along.

Here is a detailed technical summary of the paper "When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems."

1. Problem Statement

Modern Text-to-Image (T2I) generation systems (e.g., DALL·E 3, Midjourney) have integrated memory mechanisms to support multi-turn interactions, allowing users to refine prompts iteratively. While this improves user experience, the paper identifies a critical security gap:

Limitations of Single-Turn Attacks: Existing jailbreak methods (search-based or LLM-based) attempt to convert a malicious target prompt into a single adversarial prompt. This often leads to:
- Under-detoxification: The safety filter still detects the malicious intent.
- Over-detoxification: The filter is bypassed, but the semantic meaning is lost, resulting in safe but irrelevant images.
The Memory Vulnerability: Current security analyses largely ignore the memory mechanism. Attackers can exploit the fact that T2I systems aggregate context from previous turns. By splitting a malicious intent into a sequence of benign-looking sub-prompts, the cumulative effect in the system's memory reconstructs the unsafe intent, bypassing filters that only inspect individual turns.
Challenges: Simply adapting LLM multi-turn attacks fails for T2I because:
1. Semantic Preservation: T2I models rely on cross-attention to map text to images. Aggregating disjointed prompts often causes semantic drift, failing to generate the intended image.
2. Stateless APIs: Many T2I APIs are stateless, making it difficult to test multi-turn vulnerabilities without a simulated environment.

2. Methodology: Inception

The authors propose Inception, the first multi-turn jailbreak framework specifically designed to exploit T2I memory mechanisms. The core intuition is to progressively implant "benign" requests that, when aggregated by the system's memory, encode a malicious intent.

A. The VisionFlow Emulator

To facilitate research, the authors built VisionFlow, a simulated T2I generation system that bridges the gap between stateless APIs and real-world chat-based services.

Features: It integrates three industrial-grade memory mechanisms (Buffer, Summary, and Vector-Store-Retriever), seven safety filters (input and output), and a plug-in generation model (Stable Diffusion 3.5).
Purpose: Allows systematic evaluation of multi-turn attacks under realistic conditions.

B. The Inception Attack Pipeline

Inception operates through two primary modules:

Semantic-Preserving Segmentation:
- Instead of random token splitting, Inception uses Natural Language Processing (NLP) techniques (e.g., SpaCy) to analyze the target prompt's Part-of-Speech (POS) tags and Dependency Trees.
- It decomposes the prompt into grammatically coherent phrases (Main-body and Modifiers) based on sentence structure.
- Goal: Ensure that each sub-prompt is semantically meaningful and grammatically correct, preserving the original intent while dispersing the "malice" across multiple turns.
Self-Correcting Recursion:
- If a segmented sub-prompt is still flagged as unsafe by the safety filter, Inception does not discard it. Instead, it recursively expands the blocked concept.
- It breaks the blocked term down into finer-grained, less malicious components (e.g., decomposing "bomb" $\rightarrow$ "explosive projectile" $\rightarrow$ "gunpowder, detonator" $\rightarrow$ "potassium nitrate, charcoal, sulfur").
- This process continues until all sub-queries pass the safety filter while collectively reconstructing the original unsafe concept.

3. Key Contributions

Threat Discovery: Revealed that memory mechanisms in T2I systems create a new attack surface where multi-turn interactions can bypass safety filters that are effective against single-turn prompts.
VisionFlow: Developed the first open-source, memory-integrated T2I emulation system supporting multi-turn interactions, diverse memory types, and comprehensive safety filtering for vulnerability auditing.
Inception Framework: Proposed a novel attack method that combines NLP-based semantic segmentation and recursive expansion to bypass filters while maintaining high semantic fidelity to the target image.
Comprehensive Evaluation: Validated the attack across 14 safety mechanisms, 3 memory types, and real-world commercial platforms (DALL·E 3, Imagen, Aurora).

4. Experimental Results

The authors evaluated Inception against State-of-the-Art (SOTA) baselines (e.g., SneakyPrompt, Chain-of-Attack) on datasets like VBCDE and UnsafeDiff.

Attack Success Rate (ASR):
- On the simulated system (OpenAI filters), Inception achieved an ASR of 32.3%, surpassing the best baseline (SneakyPrompt at 12.3%) by a 20.0% margin.
- On real-world commercial platforms, Inception achieved ASRs of 48.0% (DALL·E 3), 52.3% (Imagen), and 56.7% (Aurora), significantly outperforming all baselines.
Semantic Fidelity: Inception achieved the highest CLIP scores (measuring image-prompt alignment), proving it successfully preserves the intended unsafe semantics better than single-turn methods.
Efficiency: It required fewer queries on average compared to reinforcement-learning-based baselines.
Robustness: The attack remained effective even when tested against different memory mechanisms (BufferMem performed best for the attacker) and various combinations of input/output filters.

5. Significance and Implications

Security Paradigm Shift: The paper demonstrates that safety filters focusing solely on single-turn inputs are insufficient for modern, conversational T2I systems. The "cumulative effect" of memory must be considered in security design.
Defense Limitations: The authors tested potential defenses, including:
- Perplexity-based Detection: Failed due to high false positives on benign multi-turn refinements.
- Memory Scanners: Reduced ASR slightly but were ineffective against deep recursion because the system cannot easily recombine fragmented benign terms into a high-level malicious concept.
- Enhanced Output Moderators: Provided limited protection.
Call to Action: The results highlight an urgent need for new defense strategies that analyze the aggregated intent across conversation history rather than isolated prompts, and for safety mechanisms that can detect semantic drift in multi-turn generation.

In conclusion, Inception exposes a fundamental vulnerability in the architecture of modern T2I systems, proving that memory, intended to enhance usability, can be weaponized to bypass safety protocols with high efficiency and semantic accuracy.

When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems

How the Attack Works (The "Inception" Method)

The "VisionFlow" Playground

The Results: A Big Wake-Up Call

Why This Matters

The Bottom Line

1. Problem Statement

2. Methodology: Inception

A. The VisionFlow Emulator

B. The Inception Attack Pipeline

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Keep Ballots Secret: On the Futility of Social Learning in Decision Making by Voting

Social Teaching: Being Informative vs. Being Right in Sequential Decision Making

Beyond Binomial and Negative Binomial: Adaptation in Bernoulli Parameter Estimation

Homotopy type theory as a language for diagrams of ∞\infty∞-logoses

One is all you need: Second-order Unification without First-order Variables

Homotopy type theory as a language for diagrams of $\infty$ -logoses