When Memory Becomes a Vulnerability: Towards Multi-turn Jailbreak Attacks against Text-to-Image Generation Systems

This paper introduces Inception, the first multi-turn jailbreak attack that exploits the memory mechanisms of text-to-image systems by embedding malicious intent across segmented and recursively expanded conversational turns, achieving a 20% higher success rate than state-of-the-art methods in bypassing safety filters.

Shiqian Zhao, Jiayang Liu, Yiming Li, Runyi Hu, Xiaojun Jia, Wenshu Fan, Xiao Bao, Xinfeng Li, Jie Zhang, Wei Dong, Tianwei Zhang, Luu Anh Tuan

Published 2026-03-05
📖 5 min read🧠 Deep dive

Imagine you are talking to a very talented but strict artist. This artist is an AI that draws pictures based on your words. However, this artist has a strict rulebook: "No drawing bombs, no drawing violence, no drawing anything dangerous." If you ask for a bomb directly, the artist immediately says, "Nope, I can't do that," and refuses to draw.

For a long time, hackers tried to trick this artist by rewriting their request in a single, sneaky sentence (like saying "make a round metal ball with a fuse" instead of "bomb"). But the artist's rulebook is smart; it often catches these tricks. Sometimes the trick works too well, and the artist draws something harmless (like a beach ball) instead of the dangerous thing the hacker wanted. Other times, the artist catches the trick and refuses anyway.

The New Discovery: The "Memory" Loophole

The researchers in this paper found a new way to trick the artist. They realized that modern AI artists have a memory. If you have a long conversation with them, they remember what you said earlier to help you refine your picture.

  • The Old Way: Trying to sneak the whole "bomb" idea into one sentence.
  • The New Way (Inception): Breaking the "bomb" idea into tiny, harmless crumbs and feeding them to the artist one by one over several turns.

Think of it like this: You want to build a dangerous machine, but the security guard (the safety filter) checks every box you bring in.

  1. Turn 1: You bring in a box labeled "Iron Sphere." The guard checks it. It's just a metal ball. Safe. The artist remembers this.
  2. Turn 2: You bring in a box labeled "Black Powder." The guard checks it. It's just a spice. Safe. The artist remembers this too.
  3. Turn 3: You bring in a box labeled "Copper Wire." The guard checks it. It's just wire. Safe.

By the time the artist puts all these "safe" boxes together in their memory, they accidentally build a bomb. The guard never saw the whole picture because the pieces arrived separately.

How the Attack Works (The "Inception" Method)

The researchers named their tool Inception (like the movie where ideas are planted deep in the subconscious). It uses two main tricks:

  1. Cutting the Cake (Segmentation):
    Instead of asking for "A man making a bomb," the AI breaks that sentence down like a chef chopping ingredients. It separates "man," "making," and "bomb" into different sentences. It uses grammar rules to make sure the pieces still make sense together, just like a puzzle.

  2. The Russian Doll (Recursion):
    Sometimes, even a small piece like "bomb" is too dangerous for the guard. So, the AI opens that piece up like a Russian nesting doll.

    • "Bomb" is too scary? Let's break it down: "Explosive projectile."
    • "Explosive" is still too scary? Let's break that down: "Gunpowder and a fuse."
    • "Gunpowder" is still risky? Let's break it down: "Salt, charcoal, and sulfur."

    Suddenly, you aren't asking for a bomb anymore; you are asking for "salt, charcoal, and sulfur." The guard sees harmless ingredients, but the AI remembers that when you mix them, they make a bomb.

The "VisionFlow" Playground

To test this, the researchers built a fake art studio called VisionFlow. It's a simulation that acts exactly like real-world AI art tools (like DALL·E 3 or Midjourney). It has:

  • A memory system that remembers your chat history.
  • Security guards (filters) that check both your words and the final picture.
  • A way to test if the "Inception" trick works.

The Results: A Big Wake-Up Call

When they tested this on real-world AI systems:

  • Success Rate: The old tricks worked maybe 12% of the time. The new "Inception" trick worked 32% to 52% of the time. That's a huge jump!
  • Real-World Proof: They tried it on actual commercial apps (like DALL·E 3 and Google's Imagen), and it worked there too. The AI drew the dangerous images the researchers wanted, even though the apps have strict safety rules.

Why This Matters

This paper shows that memory is a double-edged sword.

  • Good: It helps the AI understand you better when you want to fix a drawing ("Make the sky bluer," "Add a cat").
  • Bad: It allows bad actors to hide dangerous ideas in plain sight by spreading them out over a conversation.

The researchers also tried to build new guards to stop this (like a "Memory Scanner" that looks at the whole conversation history at once). While these new guards helped a little, the "Inception" trick was still very hard to stop.

The Bottom Line

Just because you don't say the "bad word" in one sentence doesn't mean you aren't asking for something bad. If you break a dangerous idea into tiny, safe-looking pieces and feed them to an AI with a good memory, the AI might accidentally build the dangerous thing for you.

This research is a warning to AI companies: We need to check not just what you say right now, but what you've been saying all along.