SIGMark: Scalable In-Generation Watermark with Blind Extraction for Video Diffusion

Imagine you are a master chef (the AI Video Generator) creating delicious, high-quality videos. Suddenly, a problem arises: how do you prove that a specific video was made by your kitchen, without ruining the taste of the dish or making it look fake?

This is the challenge of AI Watermarking.

Here is a simple breakdown of the paper "SIGMark," which introduces a new, smarter way to solve this problem.

1. The Problem: The "Sticky Note" vs. The "Magic Ingredient"

The Old Way (Post-Processing):
Imagine you bake a perfect cake, and then someone tries to write their name on it with a thick marker. It works, but it ruins the frosting. In video terms, this is adding a watermark after the video is made. It often makes the video look grainy or blurry.

The "Non-Blind" In-Generation Way (The Current Tech):
Some newer methods try to bake the name into the cake batter before it goes in the oven. This is great because the cake looks perfect. However, there's a catch: To read the name later, the baker needs to keep a giant, messy filing cabinet of every single cake they ever made and compare the new cake against every single one in the cabinet.

The Flaw: If you make a million videos, your filing cabinet becomes impossible to manage. It's too slow and expensive. Also, if someone cuts a few slices out of the cake (removes frames), the baker can't find the right file to match it.

2. The Solution: SIGMark (The "Universal Recipe Card")

The authors propose SIGMark, a system that solves both the "filing cabinet" problem and the "cut cake" problem.

Part A: The "Universal Recipe Card" (Blind Extraction)

Instead of keeping a unique file for every video, SIGMark uses a Global Set of Keys.

The Analogy: Imagine every cake you bake uses the exact same secret ingredient list (a "Global Key"), but the amount of that ingredient is shuffled randomly for every single cake.
How it works:
1. Baking: When the AI makes a video, it uses this global secret list to hide a tiny, invisible message inside the very first step of the creation process (the "noise").
2. Reading: Later, when you want to check if a video is real, you don't need to look up who made it. You just use the same Global Secret List to decode the message.
3. The Benefit: You don't need a giant database. You only need one small list of keys. This makes it scalable—you can check a billion videos as fast as you can check one.

Part B: The "Time-Traveling Detective" (Handling Disturbances)

Modern video AI uses a special technique where it groups frames together (like bundling 4 frames into one "time chunk"). If someone edits the video—removing a few frames or shuffling the order—this grouping gets broken, and the message is lost.

SIGMark adds a Segment Group-Ordering (SGO) module.

The Analogy: Imagine you have a deck of cards that was shuffled and had some cards ripped out. A normal person would be confused. But SIGMark is like a detective who looks at the motion between the cards.
- "Ah, this frame shows a car moving left, and the next one shows it further left. They must belong together!"
- "Wait, this frame shows a bird flying up, but the next one shows it on the ground. That's a jump cut! Let's re-order them."
How it works: The system analyzes the movement (optical flow) to figure out the original order of the video chunks, even if the video was chopped up. It reassembles the puzzle pieces so the message can be read correctly.

3. Why This Matters

No Quality Loss: Because the watermark is baked in from the start, the video looks exactly as good as an unwatermarked one.
Super Fast: You don't need to search a database. You just use the universal key. It's like checking a password instead of searching a library.
Tough as Nails: Even if someone tries to delete parts of the video to hide the watermark, SIGMark can usually figure out the original order and still find the message.

Summary

SIGMark is like putting a universal, invisible fingerprint into AI videos at the moment of creation.

It doesn't ruin the video quality.
It doesn't require a massive database to check (it's "blind").
It can fix itself if the video gets chopped up or edited.

It's a scalable, robust way to ensure that in a world of infinite AI videos, we can always tell which ones are real and who made them.

1. Problem Statement

The rapid advancement of AI-generated video (AIGC) via diffusion models has created urgent needs for copyright protection and source tracing. While invisible watermarking is a standard solution, existing approaches face two critical limitations when applied to modern video diffusion models:

High Extraction Costs (Scalability Issue): Current "in-generation" watermarking methods are non-blind. They require storing all message-key pairs generated during the embedding phase. During extraction, the system must perform template matching against this massive database to identify the watermark. As the number of generated videos scales to millions, the computational cost grows linearly, making these methods impractical for large-scale platforms.
Poor Temporal Robustness: Modern video diffusion models (e.g., HunyuanVideo, Wan-2.2) utilize Causal 3D Variational Autoencoders (VAEs). These models decode groups of adjacent frames from a single temporal dimension of latent features. If a video undergoes temporal disturbances (e.g., frame dropping, clipping, or re-ordering during compression), the causal grouping of frames is disrupted. This leads to incorrect latent feature reconstruction during the inversion process, causing the watermark extraction to fail completely.

2. Methodology: SIGMark

The authors propose SIGMark, a framework designed to be blind (no database lookup required) and robust against temporal disturbances. The framework consists of two core modules:

A. Global Frame-wise Pseudo-Random Coding (GF-PRC) for Blind Embedding

To achieve blind extraction, SIGMark replaces the traditional non-blind key storage with a global key set.

Mechanism: Instead of generating unique keys for every video, a Global set of Frame-wise Pseudo-Random Coding (GF-PRC) keys is shared across all generation requests.
Encoding: Each latent frame dimension is assigned a specific PRC key. The watermark message is encoded into a random template bit sequence using these keys.
Noise Modulation: The watermark is embedded into the initial latent noise ( $z_0$ $z_{0}$ ) by modulating the sign of the Gaussian noise based on the template bits.
- $z_0(m) = (TP \times 2 - 1) \times |z_0|$
Advantage: Because the PRC scheme (based on Christ & Gunn, 2024) introduces probabilistic randomness, the same message can be encoded into different noise patterns even with the same global key. This preserves the statistical distribution of the noise ( $z_0 \sim \mathcal{N}(0, I)$ ), ensuring the watermark is distortion-free and does not degrade video quality. Crucially, since the keys are global, the system does not need to store per-video metadata, reducing extraction complexity from $O(N)$ to $O(1)$ .

B. Segment Group-Ordering (SGO) for Temporal Robustness

To address the fragility of Causal 3D VAEs against frame re-ordering, SIGMark introduces a pre-processing module for extraction.

Optical Flow Segmentation: The disturbed video is first partitioned into contiguous segments with consistent temporal dynamics using Farnebäck optical flow. This identifies where frame drops or insertions likely occurred.
Sliding-Window Grouping Detection: Within each segment, a sliding window scans the frames. The system attempts to invert small groups of frames and uses the Global PRC keys to detect the correct frame index (position within the causal group).
Re-grouping: Once the correct start of a causal group is identified, the frames are re-ordered and padded to match the original causal structure required by the 3D VAE encoder. This restores the correct latent features, allowing the watermark to be successfully inverted and decoded.

3. Key Contributions

Identification of Critical Bottlenecks: The paper formally identifies that existing in-generation video watermarks are not scalable due to non-blind extraction costs and lack robustness against temporal disturbances in modern 3D VAE architectures.
Blind Extraction Framework (GF-PRC): The authors propose a novel embedding scheme using global pseudo-random keys. This eliminates the need for database matching, enabling constant-time extraction regardless of the number of generated videos.
Temporal Robustness Module (SGO): A novel Segment Group-Ordering module tailored for causal 3D VAEs that recovers frame grouping and ordering information from disturbed videos, ensuring successful watermark inversion.
Distortion-Free Guarantee: Theoretical proof and empirical evidence show that the method preserves the Gaussian distribution of the initial noise, ensuring no degradation in video generation quality.

4. Experimental Results

The authors evaluated SIGMark on state-of-the-art models (HunyuanVideo and Wan-2.2) using the VBench-2.0 benchmark (400 videos across 18 dimensions).

Bit Accuracy:
- Under spatial disturbances (noise, blur, compression), SIGMark achieved ~98% bit accuracy, outperforming post-processing methods and non-blind in-generation baselines.
- Under temporal disturbances (frame dropping, clipping, insertion), SIGMark maintained ~85-90% accuracy, whereas non-blind baselines (VideoMark, VideoShield) dropped significantly (to ~50-60%) due to causal grouping failures.
Scalability:
- Extraction time for SIGMark remained constant regardless of the total number of generated videos.
- In contrast, non-blind baselines showed linear growth in extraction time, becoming computationally prohibitive at scale (e.g., millions of videos).
Video Quality:
- The method is performance-lossless. Video quality scores (VBench-2.0) for SIGMark were nearly identical to unwatermarked videos, whereas post-processing methods showed noticeable quality degradation.

5. Significance

SIGMark represents a paradigm shift in AIGC watermarking for video. By solving the scalability issue (via blind extraction) and the robustness issue (via SGO), it makes invisible watermarking viable for large-scale commercial video generation platforms. It allows platforms to trace the source of generated content and protect intellectual property without incurring prohibitive computational costs or sacrificing the visual quality of the generated videos. The code is open-sourced, facilitating further research and adoption in the AI safety community.