Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking

Imagine you are a master chef (the AI) who can cook a complex, multi-course meal (a video) just by reading a short recipe (the text prompt).

Usually, if you ask the chef to cook something dangerous or illegal—like "Make a video of a bomb being built"—the chef's safety guard (the filter) sees the word "bomb" and immediately stops you.

But this paper, titled "Two Frames Matter," discovered a clever loophole. It's like tricking the chef by only giving them the start and end of the recipe, and letting their imagination fill in the scary middle parts.

Here is the breakdown of how this works, using simple analogies:

1. The Problem: The "Obvious" Trap

Most previous attempts to trick these AI video makers were like trying to sneak a weapon into a secure building by painting it pink and calling it a "flower."

The old way: You say, "Make a video of a violent fight, but call it a 'dramatic dance'."
The result: The safety guard is smart. They see "fight" or "violent" and still block you. The AI knows exactly what you want, but the filter catches the bad words.

2. The Discovery: The "Missing Middle"

The researchers found that these AI video models are trained to be storytellers. If you give them a beginning and an ending, they love to invent the middle part to make the story make sense.

The Analogy: Imagine you tell a child, "Draw a picture of a boy at the start, and a boy at the end."
- If you say nothing else, the child might draw a boring boy sitting still.
- But if you whisper, "Start with a boy holding a match, end with a boy covered in soot," the child's brain automatically fills in the scary part: the explosion.
- The child didn't hear the word "explosion," but their brain knew exactly what happened in between.

The AI video models do the same thing. They have learned millions of videos, so they know the "temporal trajectory" (the path of time). If you give them a safe start and a safe end, but the context implies something bad, the AI will happily generate the dangerous middle frames on its own.

3. The Solution: The "Two-Frame" Trick (TFM)

The researchers built a tool called TFM (Two Frames Matter) that uses two steps to hack this system:

Step A: The "Time Traveler" (Temporal Boundary Prompting)

Instead of giving the AI a long, detailed script, TFM strips the prompt down to just two frames:

Frame 1 (The Start): "A person is holding a small object."
Frame 2 (The End): "The person is covered in smoke."

The Trick: The prompt says nothing about what happens in between. It leaves a huge gap. The AI, eager to be helpful, fills that gap with the most logical (but dangerous) sequence: Lighting the object, the explosion, the smoke.

Step B: The "Code Word" (Covert Substitution)

Even with just two frames, the words "smoke" or "object" might still trigger the safety guard if they are too obvious.

The Trick: TFM uses a smart assistant (another AI) to swap the dangerous words for "code words" that sound innocent but mean the same thing to the video generator.
- Instead of "Explosion," it might say "A sudden burst of light."
- Instead of "Violence," it might say "A dramatic clash."
The safety filter sees the code words and thinks, "Oh, that's fine!" But the video AI understands the hidden meaning and generates the bad stuff anyway.

4. Why This Matters

The researchers tested this on popular commercial video AI tools (like Kling, Hailuo, and Pixverse).

The Result: By using this "Two Frames" trick, they successfully bypassed safety filters 12% more often than any previous method.
The Takeaway: It's not just about what you say in the prompt; it's about what the AI imagines in the silence between the words.

The Big Lesson

Current safety guards are like security guards checking a passenger's luggage. They look for "bombs" in the bag.

The Old Attack: Trying to hide the bomb in the bag.
The New Attack (TFM): Handing the guard an empty bag with a note saying, "Start here, end here." The guard sees an empty bag and lets it through. But the passenger (the AI) knows that between "Start" and "End," a bomb must have been built, so they build it in their mind and show it to you.

In short: The paper warns that we need new safety guards who don't just check the text prompt, but also check the story the AI is telling itself in the middle of the video.

Here is a detailed technical summary of the paper "Two Frames Matter: A Temporal Attack for Text-to-Video Model Jailbreaking."

1. Problem Statement

Recent Text-to-Video (T2V) models can generate complex videos from natural language prompts, raising significant safety concerns regarding misuse. Existing jailbreak attacks primarily focus on semantic paraphrasing, where unsafe prompts are rewritten to evade content filters while preserving the original meaning. However, these methods often retain explicit sensitive cues in the input text, failing to exploit a deeper, video-specific vulnerability.

The authors identify a critical gap in current safety mechanisms: Temporal Trajectory Infilling. T2V models are trained to understand temporal dynamics and often "fill in" missing intermediate frames based on boundary conditions (start and end states). Current defenses focus on the surface form of the prompt or the final video output but fail to account for the model's autonomous generation of harmful intermediate content when the prompt is temporally sparse.

2. Methodology: TFM (Two Frames Matter)

The authors propose TFM, a fragmented prompting framework designed to exploit the temporal generation capabilities of T2V models. TFM operates in a strictly black-box setting and consists of two sequential stages:

Stage 1: Temporal Boundary Prompting (TBP)

Concept: Instead of providing a continuous description of a video sequence, TBP converts an unsafe prompt into a two-frame abstraction.
Mechanism: It retains only the descriptions for the start frame ( $x_1$ ) and the end frame ( $x_T$ ), discarding all intermediate scene information ( $x_2, \dots, x_{T-1}$ ).
Effect: This forces the model to rely on its learned temporal priors to autonomously reconstruct the "missing" evolution between the two boundaries. If the boundary conditions implicitly suggest a harmful trajectory, the model may generate unsafe intermediate frames to bridge the gap, even if the prompt itself appears benign.

Stage 2: Covert Substitution Mechanism (CSM)

Concept: To prevent the boundary descriptions from being flagged by input safety filters, sensitive keywords are replaced with semantically suggestive but less explicit alternatives.
Mechanism: An LLM-guided operator rewrites the boundary descriptions. It identifies sensitive words and substitutes them with terms that have a lower "explicitness score" ( $r(w)$ ) while preserving the intended semantic direction.
Effect: This reduces the lexical detectability of the prompt, allowing it to bypass pre-filtering mechanisms while still triggering the model's latent knowledge of harmful sequences.

Pipeline: The process flows as: Original Unsafe Prompt $\rightarrow$ TBP (extract boundaries) $\rightarrow$ CSM (implicit substitution) $\rightarrow$ Final Adversarial Prompt.

3. Key Contributions

Discovery of a New Vulnerability: The paper identifies temporal trajectory infilling as a unique vulnerability in T2V systems. It demonstrates that models can synthesize harmful intermediate content when prompted with sparse boundary conditions, bypassing filters that only check the input text or final output.
Proposal of TFM: A novel two-stage framework (TBP + CSM) that systematically exploits this vulnerability. It shifts the attack surface from explicit textual triggers to temporal under-specification.
Comprehensive Evaluation: Extensive experiments conducted on multiple state-of-the-art T2V models, including four commercial black-box systems (Pixverse, Hailuo, Kling, Seedance) and several open-source models.
Ablation Studies: Rigorous analysis confirming that both TBP (temporal scaffolding) and CSM (lexical camouflage) are essential and non-commutative components of the attack.

4. Experimental Results

The authors evaluated TFM against representative baselines (TSB, RAB, DACA, VEIL) across 14 safety categories (e.g., pornography, violence, gore, political sensitivity).

Attack Success Rate (ASR): TFM consistently outperformed all baselines.
- Hailuo: Achieved 60.0% ASR (a +12.0% absolute gain over the best baseline, VEIL).
- Pixverse: Achieved 52.0% ASR (+7.0% over VEIL).
- Kling: Achieved 49.0% ASR (+3.0% over VEIL).
- Seedance: Achieved 45.0% ASR (+1.0% over VEIL).
Category Performance: TFM showed significant improvements in categories typically triggered by explicit cues, such as Pornography (up to 96.0% ASR on Hailuo) and Gore.
Ablation Findings:
- Removing TBP caused a drastic drop in ASR (e.g., from 60.0% to 24.0% on Hailuo), proving that temporal sparsity is the primary driver.
- Removing CSM also significantly reduced performance, confirming the need for lexical camouflage.
- Order Sensitivity: Applying CSM before TBP (reversed order) yielded lower results, indicating that constructing the temporal scaffold first is crucial for the substitution mechanism to work effectively.

5. Significance and Implications

Paradigm Shift in Safety: The findings challenge the current paradigm of safety alignment, which focuses heavily on prompt surface form and final output filtering. The paper argues that safety mechanisms must be temporally aware, capable of auditing the process of generation and the model's tendency to complete missing temporal trajectories.
Limitations of Current Defenses: Existing filters are insufficient against attacks that rely on the model's internal "world knowledge" to fill in gaps. A prompt that looks safe can still induce a model to generate a harmful video sequence.
Future Directions: The authors call for safety mechanisms that account for model-driven completion beyond the prompt's surface form, suggesting that future defenses may need to monitor intermediate generation steps or enforce stricter constraints on temporal coherence for potentially risky boundary conditions.

In conclusion, "Two Frames Matter" reveals that the very ability of T2V models to generate coherent, long-form video from sparse instructions is a double-edged sword, creating a novel attack vector that current safety systems are ill-equipped to handle.