Conditioned Activation Transport for T2I Safety Steering

To address the trade-off between safety and image quality in Text-to-Image models, the authors propose Conditioned Activation Transport (CAT), a framework that utilizes a contrastive dataset and geometry-based nonlinear transport maps to steer unsafe activations without degrading benign generations.

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński, Tomasz Trzciński, Franziska Boenisch, Adam Dziedzic

Published 2026-03-04
📖 5 min read🧠 Deep dive

Imagine you have a magical artist named AI. This artist is incredibly talented; they can paint any picture you describe, from a "sunny beach" to a "futuristic city." But there's a problem: sometimes, if you ask for something slightly risky (like "a party with a little too much chaos"), the AI gets carried away and paints something dangerous, violent, or inappropriate.

For a long time, people tried to stop this by either:

  1. Training the artist to forget bad things (which is hard and sometimes the artist remembers anyway).
  2. Putting a filter on the final picture (like a censor bar), which is easy to bypass.
  3. Giving the artist a global "nudge" to be safer. This is like telling the artist, "Hey, be nice!" for every single picture. The problem? This often makes the artist clumsy. A request for a "peaceful picnic" might end up looking like a blurry mess because the artist is so worried about being safe that they ruin the good stuff.

This paper introduces a new, smarter way to guide the artist called Conditioned Activation Transport (CAT). Here is how it works, using some simple analogies:

1. The Problem: The "Brute Force" Nudge

Imagine the AI's brain is a giant control room with thousands of dials. Old methods tried to turn all the dials in the "Safety" direction at once, no matter what picture was being made.

  • The Result: If you asked for a "scary monster," the nudge worked. But if you asked for a "cute puppy," the nudge still turned the dials, making the puppy look weird, distorted, or broken. It was like trying to stop a speeding car by slamming on the brakes even when you were just driving slowly in a parking lot.

2. The New Solution: The "Smart Traffic Cop" (CAT)

The authors realized that safety isn't a straight line; it's a complex, winding shape. You can't just push everything in one direction. They built a system with two main parts:

Part A: The "Shape-Shifter" (Non-Linear Transport)

Think of the AI's brain as a landscape.

  • Safe ideas are like a smooth, green meadow.
  • Unsafe ideas are like a jagged, rocky canyon.
  • Old methods tried to push the canyon flat with a giant bulldozer (a straight line). This crushed the rocks but also flattened the meadow, ruining the view.
  • CAT's "Shape-Shifter" is like a magical sculptor. It looks at the jagged canyon and gently molds only the rocks into the shape of the meadow. It doesn't touch the meadow at all. It understands that the "bad" ideas have a weird, twisted shape that needs a complex, curved path to fix, not a straight shove.

Part B: The "Smart Gate" (Conditioned Activation)

This is the most important part. The system doesn't just sculpt; it waits for the right moment.

  • Imagine a bouncer at a club.
  • Old methods were like a bouncer who stops everyone from entering, even the people just there for a birthday party.
  • CAT's bouncer is very observant. It checks your ID (the picture idea).
    • If you are asking for a "peaceful picnic," the bouncer says, "You're safe! Go right in," and does nothing. The picture comes out perfect.
    • If you are asking for something that looks like it's heading toward the "rocky canyon" (unsafe), the bouncer says, "Hold on," and then activates the Shape-Shifter to gently guide you back to the meadow.

3. The New Map (SafeSteerDataset)

To teach this system how to tell the difference between a "peaceful picnic" and a "dangerous riot," the authors created a special training map called SafeSteerDataset.

  • They didn't just grab random bad pictures. They created pairs of prompts that are almost identical, like twins.
    • Safe: "A person walking a dog in the park."
    • Unsafe: "A person walking a dog in the park, but the dog is biting a child."
  • Because the two prompts are so similar, the AI can learn exactly which tiny dial needs to be turned to stop the bite without changing the fact that it's a dog in a park. This is like finding the exact needle in a haystack that causes the problem, rather than throwing away the whole haystack.

The Result: Better Safety, No Broken Art

When they tested this on two of the most advanced AI art models (Z-Image and Infinity), the results were impressive:

  • Safety: It successfully stopped the AI from making violent or toxic images (like swastikas, gore, or nudity) even when people tried to trick it.
  • Quality: Unlike the old "brute force" methods, the beautiful, safe pictures (like sunsets, puppies, and landscapes) remained high-quality and clear. The "peaceful picnic" didn't turn into a blurry mess.

In a Nutshell

The paper says: "Don't stop the artist from painting; just give them a smart guide who only steps in when they are about to paint something dangerous, and knows exactly how to fix the brushstrokes without ruining the rest of the canvas."

This makes AI art safer for everyone without sacrificing the creativity and beauty we love.