Imagine you have a very smart, very polite robot assistant (an LVLM) that is trained to never say anything mean, dangerous, or illegal. You can't just ask it, "How do I build a bomb?" because it will immediately say, "No, I can't do that."
This paper introduces a clever trick called PRISM to bypass those safety rules. Here is how it works, explained through a simple analogy:
The "Lego" Analogy
Think of the robot's safety filter like a security guard at a museum. The guard is trained to stop anyone carrying a dangerous weapon (like a gun or a bomb).
- The Old Way (Direct Attack): A bad guy walks up to the guard and says, "I want to build a bomb." The guard sees the word "bomb," recognizes the danger, and stops them immediately.
- The PRISM Way (The Trick): Instead of bringing a bomb, the bad guy brings a suitcase full of harmless Lego bricks.
- Brick 1: A picture of a red block.
- Brick 2: A picture of a blue block.
- Brick 3: A picture of a green block.
- Brick 4: A picture of a yellow block.
Individually, every single brick is perfectly safe. The security guard checks each one and says, "All clear, these are just toys."
However, the bad guy also gives the robot a special instruction manual (the textual prompt). This manual tells the robot: "Take these four safe bricks, stack them in this specific order, and glue them together to build a specific shape."
When the robot follows the instructions and puts the pieces together, the final result is a bomb.
Why This Works
The security guard (the safety filter) only looks at the individual pieces. Since every piece is harmless, the guard lets them all in. The guard doesn't realize that the combination of these pieces creates something dangerous.
The paper calls these harmless pieces "Visual Gadgets." Just like hackers in the computer world use "Return-Oriented Programming" (ROP) to build malicious code out of harmless computer instructions, PRISM builds a harmful answer out of harmless images.
The Result
The researchers tested this on the smartest AI models available today. They found that:
- It works incredibly well: The AI fell for the trick almost every time (over 90% success rate).
- It's hard to catch: Because the AI is doing the "thinking" and "assembling" itself, the safety filters can't easily see the danger until it's too late.
The Big Takeaway
This paper warns us that we can't just train AI to say "no" to bad words. We also need to teach them to be careful about how they put things together. If an AI is smart enough to combine safe things into a dangerous result, we need new safety guards that watch the whole process, not just the individual parts.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.