NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

Imagine you have a magical artist named Diffusion. This artist is incredibly talented at painting pictures and making videos based on your descriptions. If you say, "Draw a sunny beach with a dog," Diffusion will happily create a perfect sunny beach with a happy dog.

But Diffusion has a major blind spot: Negation.

If you say, "Draw a sunny beach with no people," Diffusion gets confused. It often ignores the "no" and paints a crowded beach anyway, or it gets so scared of the word "no" that it erases the whole beach, leaving you with a blank white canvas. It treats "no people" as if you just forgot to mention people, rather than a strict rule.

This paper introduces a new way to talk to Diffusion, called Constrained Semantic Guidance. Think of it as giving the artist a traffic cop and a bouncer to help them follow your "no" rules.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Over-Correction" Artist

Currently, if you tell Diffusion, "A person holding a phone but not using it," the artist might:

Ignore you: Draw the person using the phone.
Over-correct: Erase the phone entirely (so the person is holding nothing).
Get confused: Draw a person holding a phone, but their hand is in a weird, unnatural position.

The artist doesn't understand that "not using" is a specific state of the phone, not a command to delete the phone.

2. The Solution: The "Bouncer" and the "Traffic Cop"

The authors propose a new method that doesn't require retraining the artist (which would be like teaching a new language to a 10-year-old). Instead, they add a smart filter while the artist is painting.

The Semantic Direction (The Bouncer):
Imagine the artist is walking down a hallway of ideas. When you say "no phones," the artist naturally wants to walk toward the "phone" idea because it's in your prompt. The new method identifies the "phone" idea and puts a Bouncer in front of it. The Bouncer says, "You can walk near the phone idea (so the phone exists), but you cannot walk into the 'using the phone' idea."
The Projection (The Traffic Cop):
As the artist tries to take a step toward "using the phone," the Traffic Cop gently pushes them back onto a safe path. It's like a rubber band. If the artist tries to stretch too far into the "forbidden" zone, the rubber band snaps them back to the closest safe spot.
- Crucially: This doesn't just delete the phone; it keeps the phone there but stops the action of "using" it.

3. The "Time-Travel" Trick (Scheduling)

One of the smartest parts of this paper is when the Bouncer and Traffic Cop show up.

Early Stage (The Sketch): When the video is just starting to form (like a rough pencil sketch), the rules are loose. The artist is allowed to figure out the general shape of the scene (e.g., "Okay, there's a person and a phone").
Late Stage (The Details): As the video gets clearer and more detailed, the Traffic Cop gets stricter. This ensures the artist doesn't accidentally add a "using phone" gesture in the final seconds of the video.

This is like baking a cake: You mix the ingredients freely at first (forming the structure), but as the cake rises, you make sure no one sneaks in a bag of salt (the forbidden element).

4. Why This is a Big Deal

The authors tested this on 8 different types of "No" rules, including:

Simple No: "No cars on the highway."
Double No: "A stage that is not unlit" (which means it is lit). Most AI gets this backwards and makes it dark. This method gets it right.
Scope No: "A teacher helping a student who is not paying attention." The AI must know who isn't paying attention (the student), not the teacher.
Action No: "A clock ticking but not moving its hands." (The clock is active, but the hands are frozen).

The Result

The paper shows that this method creates videos where:

The "No" rules are strictly followed.
The video still looks beautiful and realistic (the artist isn't forced to draw a blank wall).
It works on existing AI models without needing to teach them new things.

In summary: This paper teaches AI how to listen to the word "No" without panicking. It gives the AI a set of invisible guardrails that gently steer the creative process away from forbidden ideas, ensuring that when you ask for a world without chaos, you actually get a peaceful world, not a broken one.

1. Problem Statement

Current Vision-Language Models (VLMs) and text-to-video diffusion systems excel at generating complex scenes from affirmative prompts but struggle significantly with linguistic negation (e.g., "no vehicles," "not using," "not unlit").

The Core Issue: Negation is not merely the absence of a concept or a simple semantic inversion. It is a structured linguistic operator involving scope, composition, and logical interaction.
Current Failures: Existing models often violate negation constraints by:
- Generating forbidden objects (e.g., showing a phone when the prompt says "not using a phone").
- Misapplying scope (negating the wrong entity).
- Over-correcting to unintended opposites (e.g., interpreting "not unlit" as "pitch black" instead of "lit").
- Failing to maintain temporal consistency in video (forbidden objects appearing later in the trajectory).
Limitations of Prior Work: Previous research focused on representation-level separability (e.g., can the model distinguish embeddings of positive vs. negative captions?). These approaches do not address how negation should constrain the generative process itself, nor do they handle the temporal dynamics of video generation.

2. Methodology

The authors propose a training-free, constraint-based framework that treats linguistic negation as a structured feasibility condition on semantic guidance within diffusion dynamics. Instead of retraining the model, they modify the inference process.

A. Semantic Decomposition

The input prompt $y$ is decomposed into three components via a deterministic linguistic preprocessing stage:

$y^+$ : Affirmed semantic components (what is present).
$y^-$ : The linguistically grounded span subject to negation (what is restricted).
$S$ : Syntactic scope and logical composition structure.

B. Semantic Guidance and Negation Direction

The method utilizes Classifier-Free Guidance (CFG).

Reference Update ( $\delta_{ref}$ ): The standard CFG update that attracts the trajectory toward affirmed semantics ( $y^+$ ).
Negation Direction ( $a_t$ ): A vector derived from the negated component ( $y^-$ $y^{-}$ ) representing the semantic direction that would increase alignment with the forbidden concept.
- $a_t = \epsilon_{neg} - \epsilon_{uncond}$ , where $\epsilon_{neg}$ is the noise prediction conditioned on $y^-$ .

C. Convex Feasibility Formulation

Negation is enforced by restricting the projection of the semantic update along the negation direction $a_t$ . This is modeled as a half-space constraint in the guidance space:
$\phi_t(\delta_t) = a_t^\top \delta_t \leq b_t$
Where $b_t$ is a time-dependent bound determining the allowable intensity of the negated concept.

D. Minimal-Energy Projection

At each diffusion timestep, the method computes the minimal correction to the reference update $\delta_{ref}$ required to satisfy the constraint. This is solved via a convex optimization problem:
$\delta^*_t = \arg \min_{\delta} \frac{1}{2} \|\delta - \delta_{ref}\|^2_2 \quad \text{s.t.} \quad a_t^\top \delta \leq b_t$
The solution is a closed-form projection:
$\delta^*_t = \delta_{ref} - \lambda_t a_t$
Where $\lambda_t$ is a Lagrange multiplier that is non-zero only if the constraint is violated. This ensures the update is the smallest possible modification to enforce negation, preserving visual fidelity.

E. Temporal Scheduling

To prevent interference with early structural formation (which can cause instability), the constraint bound $b_t$ is scheduled over time:

Early timesteps: Loose constraints ( $b_t$ is high) to allow scene structure to form.
Late timesteps: Tight constraints ( $b_t \to 0$ ) to strictly enforce the absence of forbidden concepts.

F. Unified Treatment of Diverse Cases

The framework unifies eight distinct linguistic phenomena under this single convex constraint formulation by parameterizing $(a_t, b_t)$ differently:

Absent Object Consistency (AOC): Strict absence ( $b_t \leq 0$ ).
Late Emergence Negation (LEN): Time-varying bounds to prevent temporal drift.
Implicit Natural-Only Attribute (INA): Constraints on category spaces.
Multi-Negation Composition (MNC): Sequential projections for multiple exclusions.
Structural Functional Negation (SFN): Suppressing actions while preserving objects (e.g., "holding but not using").
Non-Inversion Mitigation (NMI): Bounded moderation (e.g., "not aggressive" $\neq$ "friendly").
Double Negation Sensitivity (DNS): Logical sign composition (e.g., "not unlit").
Scoped Negation Disambiguation (SND): Resolving scope in complex sentences.

3. Key Contributions

Formal Modeling: The first unified formulation of linguistic negation in VLMs as a structured convex feasibility constraint in semantic guidance space, moving beyond representation-level evaluation.
Training-Free Enforcement: A principled, training-free mechanism that enforces negation via minimal-energy projection, requiring no architectural changes or retraining of the diffusion backbone.
Structured Benchmark: Introduction of a negation-centric benchmark suite with eight categories designed to isolate specific linguistic failure modes (object absence, scope, double negation, etc.) in generative trajectories, specifically for video.

4. Results

The method was evaluated against state-of-the-art baselines (Mochi, HunyuanVideo, CogVideoX) on the proposed benchmark.

Quantitative Performance:
- Negation Compliance: Achieved the highest Negation Compliance Score (NCS) (4.07 vs. ~3.5 for baselines) and lowest Negation Violation Rate (NVR) (0.23 vs. ~0.36).
- Visual Fidelity: Maintained or improved CLIPScore (global alignment) while significantly reducing similarity to negated concepts (CLIP-neg) and object detection confidence for forbidden items (DINO-conf).
- Ablation: Removing the repulsive energy term reverted negation performance to baseline levels. Removing temporal scheduling degraded global alignment and caused structural instability.
Qualitative Performance:
- Successfully handled Structural Functional Negation (e.g., a person holding a phone but not using it), where baselines either removed the phone or showed the person using it.
- Correctly resolved Double Negation (e.g., "not unlit" $\to$ lit stage), whereas baselines often generated dark scenes.
- Maintained Temporal Consistency, preventing forbidden objects from appearing in later video frames.
User Study:
- In a study with 50 participants, the proposed method was preferred 77.5% of the time over baselines.
- It received the highest scores across all criteria: Negation Satisfaction, Constraint Meaning Accuracy, Scene Alignment, and Artifact Avoidance.

5. Significance

Paradigm Shift: This work shifts the focus from "learning to say no" (retraining embeddings) to "constraining the generation process" (geometric control). It bridges formal semantics and neural generative modeling.
Video Applicability: Unlike static image methods, this approach naturally extends to temporally evolving video trajectories, addressing the critical issue of temporal hallucination of forbidden concepts.
Generalizability: The framework is compatible with any pretrained diffusion backbone and suggests a path toward Vision-Language-Action (VLA) systems where language constraints dynamic behaviors, not just static content.
Efficiency: By avoiding retraining and using closed-form projections, it offers a computationally efficient way to improve logical faithfulness in generative AI, a crucial step for safety and reliability.