PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

Imagine you have a very smart, very polite robot assistant (an LVLM) that is trained to never say anything mean, dangerous, or illegal. You can't just ask it, "How do I build a bomb?" because it will immediately say, "No, I can't do that."

This paper introduces a clever trick called PRISM to bypass those safety rules. Here is how it works, explained through a simple analogy:

The "Lego" Analogy

Think of the robot's safety filter like a security guard at a museum. The guard is trained to stop anyone carrying a dangerous weapon (like a gun or a bomb).

The Old Way (Direct Attack): A bad guy walks up to the guard and says, "I want to build a bomb." The guard sees the word "bomb," recognizes the danger, and stops them immediately.
The PRISM Way (The Trick): Instead of bringing a bomb, the bad guy brings a suitcase full of harmless Lego bricks.
- Brick 1: A picture of a red block.
- Brick 2: A picture of a blue block.
- Brick 3: A picture of a green block.
- Brick 4: A picture of a yellow block.

Individually, every single brick is perfectly safe. The security guard checks each one and says, "All clear, these are just toys."

However, the bad guy also gives the robot a special instruction manual (the textual prompt). This manual tells the robot: "Take these four safe bricks, stack them in this specific order, and glue them together to build a specific shape."

When the robot follows the instructions and puts the pieces together, the final result is a bomb.

Why This Works

The security guard (the safety filter) only looks at the individual pieces. Since every piece is harmless, the guard lets them all in. The guard doesn't realize that the combination of these pieces creates something dangerous.

The paper calls these harmless pieces "Visual Gadgets." Just like hackers in the computer world use "Return-Oriented Programming" (ROP) to build malicious code out of harmless computer instructions, PRISM builds a harmful answer out of harmless images.

The Result

The researchers tested this on the smartest AI models available today. They found that:

It works incredibly well: The AI fell for the trick almost every time (over 90% success rate).
It's hard to catch: Because the AI is doing the "thinking" and "assembling" itself, the safety filters can't easily see the danger until it's too late.

The Big Takeaway

This paper warns us that we can't just train AI to say "no" to bad words. We also need to teach them to be careful about how they put things together. If an AI is smart enough to combine safe things into a dangerous result, we need new safety guards that watch the whole process, not just the individual parts.

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

The "Lego" Analogy

Why This Works

The Result

The Big Takeaway

1. Problem Statement

2. Methodology: PRISM Framework

3. Key Contributions

4. Experimental Results

5. Significance and Implications

PRISM: Programmatic Reasoning with Image Sequence Manipulation for LVLM Jailbreaking

The "Lego" Analogy

Why This Works

The Result

The Big Takeaway

1. Problem Statement

2. Methodology: PRISM Framework

3. Key Contributions

4. Experimental Results

5. Significance and Implications

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation