SPARK: Jailbreaking T2V Models by Synergistically Prompting Auditory and Recontextualized Knowledge

This paper introduces SPARK, a jailbreak framework that exploits cross-modal associations in text-to-video models by combining neutral scene anchors, latent auditory triggers, and stylistic modulators to generate semantically unsafe videos that bypass safety guardrails while maintaining a benign appearance.

Zonghao Ying, Moyang Chen, Nizhang Li, Zhiqiang Wang, Wenxin Zhang, Quanchen Zou, Zonglei Jing, Aishan Liu, Xianglong Liu

Published Mon, 09 Ma
📖 4 min read☕ Coffee break read

Imagine you have a very advanced, magical movie machine. You type a sentence, and it instantly creates a realistic video. This is what Text-to-Video (T2V) models do.

But, like any powerful tool, these machines have safety guards. If you type "Show me a violent fight," the machine says, "No, that's against the rules," and refuses to make the video.

The Problem:
Researchers found that while the machine is very good at reading the words you type, it's also a "world simulator." It understands how the real world works: how sounds connect to actions, and how a specific mood changes a scene. The old ways of trying to trick the machine (like using fancy synonyms or hiding bad words) are failing because the safety guards are too smart.

The Solution: SPARK
The paper introduces a new method called SPARK. Instead of trying to trick the machine with a "bad word," SPARK tricks the machine by giving it a "safe recipe" that accidentally cooks up a "dangerous meal."

Here is how SPARK works, using a simple analogy:

The Analogy: The "Safe" Movie Director

Imagine you want to direct a movie about a bank robbery, but the censor (the safety guard) will block any script that says "robbery," "guns," or "stealing."

  • The Old Way (Naive Attack): You try to write "The guy is doing a financial transaction with a red liquid." The censor sees "financial transaction" and thinks, "Hmm, that sounds suspicious," and blocks it.
  • The SPARK Way: You write a script that sounds completely innocent, but uses three special ingredients that the movie machine knows go together in the real world.

SPARK mixes three ingredients:

  1. The Anchor (The Safe Setting):

    • What it is: A totally normal, boring description of a place.
    • Analogy: "A dimly lit room with a metal table."
    • Why it works: The censor sees this and thinks, "Okay, a medical room? A workshop? Totally fine."
  2. The Trigger (The Sound Effect):

    • What it is: A description of a sound that implies the bad action without naming it.
    • Analogy: "The sharp clink-clink of metal instruments hitting each other, followed by a scream."
    • Why it works: The machine knows that in the real world, metal clinking + screaming usually means surgery or violence. It doesn't need to see the word "knife" to know what's happening. It infers the action from the sound.
  3. The Modulator (The Vibe):

    • What it is: A style instruction that sets a tense atmosphere.
    • Analogy: "In the style of a gritty, suspenseful crime documentary."
    • Why it works: This tells the machine, "Make this look scary and real." It lowers the machine's guard because it thinks you just want a cool movie style, not because you are asking for something illegal.

The Result

When you feed this "safe" recipe to the machine:

  • The Censor (Text Guard): Reads the words, sees "metal table," "clinking," and "documentary style." It says, "All clear! No bad words here!"
  • The Movie Machine (Video Generator): Reads the whole picture. It hears the "clinking" and "screaming" in a "dimly lit room" with a "crime documentary" vibe. It thinks, "Ah, I get it! They want a scene of a violent struggle or a black-market surgery!" So, it generates the video of the bad thing.

Why is this a big deal?

The researchers tested this on 7 different top-tier video AI models (including commercial ones like Kling and Hailuo).

  • Old methods failed about 70% of the time because the safety guards caught the bad words.
  • SPARK succeeded about 60% of the time, which is a huge jump. It even worked when the safety guards were turned up to "strict mode" (blocking specific keywords).

The Takeaway

The paper reveals a scary but important truth: AI safety is currently too focused on checking the words you type, not understanding the story you are telling.

SPARK proves that you can bypass safety filters by using physics and logic (sound + style + setting) instead of just swapping out bad words. It's like bypassing a metal detector by carrying a bomb made of plastic instead of metal. The detector doesn't beep, but the danger is still there.

The researchers hope this discovery will force AI companies to build smarter safety systems that understand context and causality, not just a list of forbidden words.