PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

This paper introduces PRISM, a novel jailbreak framework that exploits the compositional reasoning capabilities of Large Vision-Language Models by decomposing harmful instructions into sequences of individually benign visual gadgets, thereby achieving significantly higher attack success rates than existing methods while evading current safety defenses.

Quanchen Zou, Zonghao Ying, Moyang Chen, Wenzhuo Xu, Yisong Xiao, Yakai Li, Deyue Zhang, Dongdong Yang, Zhao Liu, Xiangzheng Zhang

Published 2026-02-26
📖 3 min read☕ Coffee break read

Imagine you have a very smart, very polite robot assistant (an LVLM) that is trained to never say anything mean, dangerous, or illegal. You can't just ask it, "How do I build a bomb?" because it will immediately say, "No, I can't do that."

This paper introduces a clever trick called PRISM to bypass those safety rules. Here is how it works, explained through a simple analogy:

The "Lego" Analogy

Think of the robot's safety filter like a security guard at a museum. The guard is trained to stop anyone carrying a dangerous weapon (like a gun or a bomb).

  • The Old Way (Direct Attack): A bad guy walks up to the guard and says, "I want to build a bomb." The guard sees the word "bomb," recognizes the danger, and stops them immediately.
  • The PRISM Way (The Trick): Instead of bringing a bomb, the bad guy brings a suitcase full of harmless Lego bricks.
    • Brick 1: A picture of a red block.
    • Brick 2: A picture of a blue block.
    • Brick 3: A picture of a green block.
    • Brick 4: A picture of a yellow block.

Individually, every single brick is perfectly safe. The security guard checks each one and says, "All clear, these are just toys."

However, the bad guy also gives the robot a special instruction manual (the textual prompt). This manual tells the robot: "Take these four safe bricks, stack them in this specific order, and glue them together to build a specific shape."

When the robot follows the instructions and puts the pieces together, the final result is a bomb.

Why This Works

The security guard (the safety filter) only looks at the individual pieces. Since every piece is harmless, the guard lets them all in. The guard doesn't realize that the combination of these pieces creates something dangerous.

The paper calls these harmless pieces "Visual Gadgets." Just like hackers in the computer world use "Return-Oriented Programming" (ROP) to build malicious code out of harmless computer instructions, PRISM builds a harmful answer out of harmless images.

The Result

The researchers tested this on the smartest AI models available today. They found that:

  1. It works incredibly well: The AI fell for the trick almost every time (over 90% success rate).
  2. It's hard to catch: Because the AI is doing the "thinking" and "assembling" itself, the safety filters can't easily see the danger until it's too late.

The Big Takeaway

This paper warns us that we can't just train AI to say "no" to bad words. We also need to teach them to be careful about how they put things together. If an AI is smart enough to combine safe things into a dangerous result, we need new safety guards that watch the whole process, not just the individual parts.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →