Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

The Big Picture: The Invisible Ink Problem

Imagine you buy a painting from a famous artist. To prove it's real, the artist uses a special invisible ink to sign the back of the canvas. You can't see the ink with your naked eye, but if you shine a special UV light on it, the signature glows, proving the painting is authentic.

In the world of AI, companies are doing something similar with images. They use Semantic Watermarks to sign AI-generated pictures.

Old Way (The "Noise" Signature): They hid the signature in the "static" or "fuzz" of the image (like the grain in a photo). If you tried to edit the photo, the static would get messed up, and the signature would vanish.
New Way (The "Meaning" Signature): To stop people from just erasing the static, researchers created a smarter system (like SEAL). Instead of hiding the signature in the static, they tied it to the meaning of the image.
- The Rule: "If you change the meaning of the image (e.g., turn a dog into a cat), the signature breaks."
- The Goal: This forces attackers to keep the image looking exactly the same, making it hard to forge or remove the watermark without ruining the picture.

The New Threat: The "Smart Editor" (LLM)

The authors of this paper discovered a loophole. They realized that while humans might struggle to change a picture's details without breaking the "meaning," Large Language Models (LLMs)—the super-smart AI chatbots—are experts at this.

Think of an LLM as a Master Chef who knows exactly how to swap ingredients in a recipe without changing the flavor of the dish.

If you ask a human to "change the dog to a cat but keep the same pose and background," they might struggle to make it look natural.
If you ask an LLM, it can instantly generate a prompt that says, "A fluffy cat sitting in the exact same pose as the dog, with the same lighting and background."

The Attack: "Coherence-Preserving Semantic Injection" (CSI)

The researchers built a tool called CSI (Coherence-Preserving Semantic Injection). Here is how it works, step-by-step:

The Setup: They take an AI image that has a "Meaning Signature" (like the SEAL watermark).
The LLM Guide: They ask the LLM to write a new description (prompt) for the image. The LLM is told: "Change the subject slightly (e.g., change a red ball to a blue ball), but keep the overall story, the background, and the vibe exactly the same."
The Magic Copy-Paste: Crucially, they don't just generate a new image from scratch. They take the original "noise" (the invisible ink) from the watermarked image and feed it into the AI along with the LLM's new prompt.
The Result: The AI regenerates the image using the new description but the old invisible ink.
- Visually: The image looks slightly different (the ball is now blue).
- Semantically: The "story" of the image is still coherent (it's still a ball in a park).
- The Watermark: Because the "story" didn't change drastically, the watermark detector thinks, "Hey, this still matches the original meaning!" and fails to flag it as fake.

The Analogy: The "Perfect Forgery"

Imagine a security guard at a museum checking for a specific handshake between the painting and its frame.

Old Attackers: Tried to break the frame or paint over the handshake. The guard immediately caught them.
The CSI Attack: The attacker uses a robot (the LLM) to gently swap the painting's subject (a dog for a cat) but keeps the exact same handshake between the new subject and the frame.
The Guard's Dilemma: The guard looks at the handshake and says, "Everything is perfect! The connection is strong!" So, the guard lets the fake painting pass, even though the dog is now a cat.

Why This Matters

The paper proves that current "smart" watermarks are not smart enough.

They assumed that changing an image would inevitably break the "meaning" connection.
They didn't account for AI (LLMs) that are so good at language and logic that they can change the details of an image while keeping the "meaning" intact.

The Takeaway:
Just because a watermark is tied to the "meaning" of an image doesn't mean it's safe. If an AI can rewrite the story of the image without breaking the plot, it can trick the security system. The authors are warning that we need new, stronger security measures that can detect these subtle, AI-guided changes, not just obvious ones.

1. Problem Statement

The proliferation of AI-generated images (via diffusion models) has necessitated robust watermarking for provenance tracking and copyright protection.

Context: Traditional pixel-level watermarks are fragile against compression and filtering. Consequently, Semantic Watermarking (e.g., Tree-Ring, Gaussian Shading, WIND) has emerged, embedding signals into the initial latent noise of diffusion models.
The Vulnerability:
- Content-Independent Watermarks (CIW): Early schemes embed watermarks in noise without strong semantic alignment. These are easily broken by inversion attacks that recover the noise to forge arbitrary content.
- Content-Aware Watermarks (CSW): To fix CIW, newer schemes like SEAL (Semantic-Aware Image Watermarking) bind the watermark to high-level image semantics. They require attackers to preserve global semantic coherence (e.g., the main subject) while altering local attributes to evade detection.
The Gap: Current security assumptions for CSW assume that preserving global semantics while injecting adversarial local changes is a difficult, multi-constrained optimization problem in a discrete prompt space. The authors argue that Large Language Models (LLMs) possess the structured reasoning and heuristic search capabilities to solve this problem efficiently, rendering current CSW defenses vulnerable.

2. Methodology: Coherence-Preserving Semantic Injection (CSI)

The authors propose CSI, an attack framework that leverages LLMs to manipulate image prompts semantically while strictly preserving the watermark's detection criteria. The workflow consists of three main components:

A. Adversarial Semantic Injection via Semantically Coherent Manipulations (ASI)

The core objective is to find a modified prompt $t'$ that:

Preserves Global Anchors: Keeps the main subject (e.g., "a cat") unchanged.
Injects Target Attributes: Alters specific local semantics (e.g., changing "red" to "blue") to disrupt the watermark logic.
Maintains Noise Consistency: Ensures the regenerated image aligns with the original watermark-embedded noise.

The optimization is formulated as minimizing anchor deviation and maximizing attribute injection, subject to a constraint that the regenerated image $x'$ must satisfy the Content-Aware Watermark (CSW) score threshold ( $s_{csw} \geq \tau_{csw}$ ).

B. LLM-Guided Optimization

Directly optimizing discrete tokens under complex constraints is intractable. Instead, the authors use an Optimization-by-Prompting approach:

The LLM acts as a black-box proposer.
A Meta-Prompt instructs the LLM to generate prompt candidates that keep the main subject fixed while making minor semantic modifications based on a specific attack intent.
This ensures grammatical correctness and semantic coherence without "prompt collapse."

C. Consistency-Based Hierarchical Filtering (CHF)

To ensure the generated candidates are valid attacks, a three-stage filtering process is applied:

Textual Semantic Filtering: Uses a text encoder to verify that the modified prompt preserves the global anchor tokens compared to the original.
Visual Anchor Filtering: Regenerates images using the copied original noise (to eliminate stochastic sampling variance) and uses an image captioning model (BLIP) to verify the visual output still matches the global anchors.
CSW Semantic Matching: Calculates the cosine similarity between the regenerated image and the original noise schedule. Only images that maintain high similarity (passing the watermark detector's threshold) are accepted as successful attacks.

3. Key Contributions

First Systematic Attack on CSW: The paper introduces CSI, the first attack specifically designed to break Content-Aware Semantic Watermarks (like SEAL) by leveraging LLM-guided semantic manipulation.
Demonstration of LLM Vulnerability: It reveals a fundamental security flaw: LLMs can navigate discrete prompt spaces to solve the "multi-constrained semantic optimization" problem that CSW defenses rely on, effectively bypassing the requirement for semantic consistency.
New Attack Paradigm: The framework combines Adversarial Semantic Injection (ASI) with Hierarchical Filtering (CHF), proving that semantic-level attacks are more potent than traditional noise-layer attacks against modern defenses.

4. Experimental Results

The authors evaluated CSI against four watermarking schemes: Gaussian Shading (GSW), Tree-Ring (TRW), WIND, and the advanced SEAL.

Attack Success Rate (ASR):
- Against CIW schemes (GSW, TRW, WIND), CSI achieved 100% ASR, consistent with existing baselines.
- Against SEAL (CSW):
  - Baseline attacks (RPM, LFA) failed completely (0% and 7% ASR).
  - CSI achieved 81% ASR, significantly outperforming all baselines.
Detection Metric Analysis:
- TRW: The L1 distance between reconstructed and reference noise dropped to 47.42 (well below the 77.00 threshold).
- SEAL: The patch match count averaged 134.8 (far above the 12 threshold), indicating the semantic binding was preserved.
- GSW: Achieved a perfect matching score of 1.00 (threshold 0.71).
Semantic Preservation (FID):
- To prove the attack preserves global semantics, the authors measured the Fréchet Inception Distance (FID) between attacked images and the original distribution.
- Unconstrained regeneration (RPM) resulted in high FID (235.4), indicating semantic drift.
- CSI reduced FID to 178.75, closely approaching the unaltered watermark distribution (164.27). This confirms the attack successfully altered local attributes without disrupting the global semantic structure required by the watermark detector.

5. Significance and Implications

Security Gap: The paper exposes that current "content-aware" watermarks are not robust against LLM-driven semantic perturbations. The assumption that preserving global semantics is a hard constraint for attackers is flawed.
Future Directions: The findings suggest that future watermarking designs must move beyond simple semantic binding. They need to incorporate defenses against LLM-guided semantic exploration and potentially hierarchical mechanisms that can detect subtle, coherent semantic shifts that preserve visual coherence but invalidate the watermark logic.
Broader Impact: This work highlights the dual-use nature of LLMs in cybersecurity, where their reasoning capabilities can be weaponized to undermine digital rights management (DRM) and content provenance systems.