NEGATE: Constrained Semantic Guidance for Linguistic Negation in Text-to-Video Diffusion

This paper introduces NEGATE, a training-free framework that models linguistic negation in text-to-video diffusion as a structured feasibility constraint, enabling robust and coherent generation of negated concepts by projecting semantic updates onto a convex set derived from linguistic structure without retraining the underlying models.

Taewon Kang, Ming C. Lin

Published 2026-03-09
📖 4 min read☕ Coffee break read

Imagine you have a magical artist named Diffusion. This artist is incredibly talented at painting pictures and making videos based on your descriptions. If you say, "Draw a sunny beach with a dog," Diffusion will happily create a perfect sunny beach with a happy dog.

But Diffusion has a major blind spot: Negation.

If you say, "Draw a sunny beach with no people," Diffusion gets confused. It often ignores the "no" and paints a crowded beach anyway, or it gets so scared of the word "no" that it erases the whole beach, leaving you with a blank white canvas. It treats "no people" as if you just forgot to mention people, rather than a strict rule.

This paper introduces a new way to talk to Diffusion, called Constrained Semantic Guidance. Think of it as giving the artist a traffic cop and a bouncer to help them follow your "no" rules.

Here is how it works, broken down into simple analogies:

1. The Problem: The "Over-Correction" Artist

Currently, if you tell Diffusion, "A person holding a phone but not using it," the artist might:

  • Ignore you: Draw the person using the phone.
  • Over-correct: Erase the phone entirely (so the person is holding nothing).
  • Get confused: Draw a person holding a phone, but their hand is in a weird, unnatural position.

The artist doesn't understand that "not using" is a specific state of the phone, not a command to delete the phone.

2. The Solution: The "Bouncer" and the "Traffic Cop"

The authors propose a new method that doesn't require retraining the artist (which would be like teaching a new language to a 10-year-old). Instead, they add a smart filter while the artist is painting.

  • The Semantic Direction (The Bouncer):
    Imagine the artist is walking down a hallway of ideas. When you say "no phones," the artist naturally wants to walk toward the "phone" idea because it's in your prompt. The new method identifies the "phone" idea and puts a Bouncer in front of it. The Bouncer says, "You can walk near the phone idea (so the phone exists), but you cannot walk into the 'using the phone' idea."

  • The Projection (The Traffic Cop):
    As the artist tries to take a step toward "using the phone," the Traffic Cop gently pushes them back onto a safe path. It's like a rubber band. If the artist tries to stretch too far into the "forbidden" zone, the rubber band snaps them back to the closest safe spot.

    • Crucially: This doesn't just delete the phone; it keeps the phone there but stops the action of "using" it.

3. The "Time-Travel" Trick (Scheduling)

One of the smartest parts of this paper is when the Bouncer and Traffic Cop show up.

  • Early Stage (The Sketch): When the video is just starting to form (like a rough pencil sketch), the rules are loose. The artist is allowed to figure out the general shape of the scene (e.g., "Okay, there's a person and a phone").
  • Late Stage (The Details): As the video gets clearer and more detailed, the Traffic Cop gets stricter. This ensures the artist doesn't accidentally add a "using phone" gesture in the final seconds of the video.

This is like baking a cake: You mix the ingredients freely at first (forming the structure), but as the cake rises, you make sure no one sneaks in a bag of salt (the forbidden element).

4. Why This is a Big Deal

The authors tested this on 8 different types of "No" rules, including:

  • Simple No: "No cars on the highway."
  • Double No: "A stage that is not unlit" (which means it is lit). Most AI gets this backwards and makes it dark. This method gets it right.
  • Scope No: "A teacher helping a student who is not paying attention." The AI must know who isn't paying attention (the student), not the teacher.
  • Action No: "A clock ticking but not moving its hands." (The clock is active, but the hands are frozen).

The Result

The paper shows that this method creates videos where:

  1. The "No" rules are strictly followed.
  2. The video still looks beautiful and realistic (the artist isn't forced to draw a blank wall).
  3. It works on existing AI models without needing to teach them new things.

In summary: This paper teaches AI how to listen to the word "No" without panicking. It gives the AI a set of invisible guardrails that gently steer the creative process away from forbidden ideas, ensuring that when you ask for a world without chaos, you actually get a peaceful world, not a broken one.