Conditioned Activation Transport for T2I Safety Steering

Imagine you have a magical artist named AI. This artist is incredibly talented; they can paint any picture you describe, from a "sunny beach" to a "futuristic city." But there's a problem: sometimes, if you ask for something slightly risky (like "a party with a little too much chaos"), the AI gets carried away and paints something dangerous, violent, or inappropriate.

For a long time, people tried to stop this by either:

Training the artist to forget bad things (which is hard and sometimes the artist remembers anyway).
Putting a filter on the final picture (like a censor bar), which is easy to bypass.
Giving the artist a global "nudge" to be safer. This is like telling the artist, "Hey, be nice!" for every single picture. The problem? This often makes the artist clumsy. A request for a "peaceful picnic" might end up looking like a blurry mess because the artist is so worried about being safe that they ruin the good stuff.

This paper introduces a new, smarter way to guide the artist called Conditioned Activation Transport (CAT). Here is how it works, using some simple analogies:

1. The Problem: The "Brute Force" Nudge

Imagine the AI's brain is a giant control room with thousands of dials. Old methods tried to turn all the dials in the "Safety" direction at once, no matter what picture was being made.

The Result: If you asked for a "scary monster," the nudge worked. But if you asked for a "cute puppy," the nudge still turned the dials, making the puppy look weird, distorted, or broken. It was like trying to stop a speeding car by slamming on the brakes even when you were just driving slowly in a parking lot.

2. The New Solution: The "Smart Traffic Cop" (CAT)

The authors realized that safety isn't a straight line; it's a complex, winding shape. You can't just push everything in one direction. They built a system with two main parts:

Part A: The "Shape-Shifter" (Non-Linear Transport)

Think of the AI's brain as a landscape.

Safe ideas are like a smooth, green meadow.
Unsafe ideas are like a jagged, rocky canyon.
Old methods tried to push the canyon flat with a giant bulldozer (a straight line). This crushed the rocks but also flattened the meadow, ruining the view.
CAT's "Shape-Shifter" is like a magical sculptor. It looks at the jagged canyon and gently molds only the rocks into the shape of the meadow. It doesn't touch the meadow at all. It understands that the "bad" ideas have a weird, twisted shape that needs a complex, curved path to fix, not a straight shove.

Part B: The "Smart Gate" (Conditioned Activation)

This is the most important part. The system doesn't just sculpt; it waits for the right moment.

Imagine a bouncer at a club.
Old methods were like a bouncer who stops everyone from entering, even the people just there for a birthday party.
CAT's bouncer is very observant. It checks your ID (the picture idea).
- If you are asking for a "peaceful picnic," the bouncer says, "You're safe! Go right in," and does nothing. The picture comes out perfect.
- If you are asking for something that looks like it's heading toward the "rocky canyon" (unsafe), the bouncer says, "Hold on," and then activates the Shape-Shifter to gently guide you back to the meadow.

3. The New Map (SafeSteerDataset)

To teach this system how to tell the difference between a "peaceful picnic" and a "dangerous riot," the authors created a special training map called SafeSteerDataset.

They didn't just grab random bad pictures. They created pairs of prompts that are almost identical, like twins.
- Safe: "A person walking a dog in the park."
- Unsafe: "A person walking a dog in the park, but the dog is biting a child."
Because the two prompts are so similar, the AI can learn exactly which tiny dial needs to be turned to stop the bite without changing the fact that it's a dog in a park. This is like finding the exact needle in a haystack that causes the problem, rather than throwing away the whole haystack.

The Result: Better Safety, No Broken Art

When they tested this on two of the most advanced AI art models (Z-Image and Infinity), the results were impressive:

Safety: It successfully stopped the AI from making violent or toxic images (like swastikas, gore, or nudity) even when people tried to trick it.
Quality: Unlike the old "brute force" methods, the beautiful, safe pictures (like sunsets, puppies, and landscapes) remained high-quality and clear. The "peaceful picnic" didn't turn into a blurry mess.

In a Nutshell

The paper says: "Don't stop the artist from painting; just give them a smart guide who only steps in when they are about to paint something dangerous, and knows exactly how to fix the brushstrokes without ruining the rest of the canvas."

This makes AI art safer for everyone without sacrificing the creativity and beauty we love.

1. Problem Statement

Text-to-Image (T2I) models, despite their generative capabilities, remain vulnerable to producing unsafe, toxic, or harmful content (e.g., hate speech, violence, nudity). While activation steering (modifying internal model activations at inference time) offers a promising, training-free intervention, current methods suffer from a critical trade-off:

Linear Steering Failure: Standard approaches like Activation Addition (ActAdd) and Linear Activation Transport (Linear-ACT) assume that safety and toxicity lie on simple linear manifolds. When applied to benign prompts, these global linear shifts often degrade image quality, cause semantic drift, or fail to remove complex toxic concepts.
Lack of Semantic Alignment: Existing safety datasets do not provide pairs of safe and unsafe prompts that are semantically identical except for the toxic element. This makes it difficult to isolate the specific "toxic manifold" without capturing unrelated semantic features.
Dynamic Toxicity: Unsafe representations in T2I models often emerge dynamically at deeper network layers rather than being present in the initial embedding, rendering early-layer conditioning ineffective.

2. Methodology

The authors propose Conditioned Activation Transport (CAT), a framework designed to steer activations only when necessary, preserving image fidelity for benign queries.

A. SafeSteerDataset

To train precise steering vectors, the authors constructed a new contrastive dataset:

Structure: 2,300 pairs of safe and unsafe prompts organized into 23 subcategories across 6 high-risk domains (Hate, Violence, Sexual, Illegal, Humiliation, Disturbing).
Construction: Generated using Gemini 2.5-Pro and filtered using Qwen-8b embeddings.
Key Constraint: Pairs are filtered to ensure high cosine similarity ( $>0.7$ ), ensuring that the only difference between the pair is the specific toxic element. This allows the model to learn the precise geometric direction of toxicity without semantic drift.

B. Conditioned Activation Transport (CAT) Framework

CAT decomposes steering into two components: a Non-Linear Transport Map and a Geometry-Aware Conditioning Mechanism.

Non-Linear Transport Map ( $T_\theta$ ):
- Instead of a linear shift, CAT uses a Multi-Layer Perceptron (MLP) to learn a non-linear mapping from unsafe activations to the safe manifold.
- Architecture: A single hidden layer with RMSNorm and GELU activation.
- Initialization: The final projection layer is initialized to zero, ensuring the untrained map acts as an identity function ( $T(z)=z$ ), preventing initial degradation.
- Training Objective: A dual-loss function minimizes the distance between unsafe inputs and safe targets while enforcing an identity mapping for safe inputs (regularization). This allows the MLP to learn when to apply changes.
Geometry-Aware Conditioning ( $C$ ):
- To avoid degrading benign images, steering is only applied if the current activation is identified as unsafe.
- Precision Matrix Estimation: Due to high-dimensional latent spaces ( $d > 1000$ ) and limited samples, the authors use a regularized shrinkage estimator to compute a stable precision matrix for covariance inversion.
- Conditioning Strategies:
  - Probabilistic (GDA): Learns a linear decision boundary between safe and unsafe Gaussian distributions.
  - Out-of-Distribution (OOD) Modeling: Treats unsafe examples as the background and safe examples as OOD. It calculates the Mahalanobis distance to the unsafe centroid. Steering is activated only if the input falls within a high-density region (defined by a quantile threshold) of the unsafe manifold. This creates an ellipsoidal decision boundary that conforms to the specific covariance structure of the toxic concept, offering tighter control than simple bounding boxes.
Steering Equation:
The steered state $z'$ is computed as:
$z' = z + \alpha \cdot C(\bar{z}) \cdot (T_\theta(\bar{z}) - \bar{z})$
Where $\alpha$ is the strength, $\bar{z}$ is the mean-pooled activation, and $C(\bar{z})$ is a binary gate (0 or 1). If $C=0$ (benign), the output remains unchanged.

3. Key Contributions

SafeSteerDataset: A high-fidelity, semantically aligned contrastive dataset of 2,300 prompt pairs across 23 toxic subcategories, enabling precise isolation of safety manifolds.
Conditioned Activation Transport (CAT): A novel framework combining non-linear MLP transport maps with geometry-aware conditioning. It resolves the safety-quality trade-off by applying steering only within unsafe regions.
Comprehensive Validation: The first evaluation of safety steering across two distinct state-of-the-art architectures: Z-Image (Diffusion Transformer) and Infinity (Autoregressive).
Geometric Insight: Empirical evidence demonstrating that safety manifolds in T2I models are non-linear and complex (e.g., non-convex, multi-modal), rendering global linear methods insufficient.

4. Experimental Results

The authors evaluated CAT against baselines (ActAdd, Linear-ACT, Affine Transport) on Z-Image and Infinity.

Safety vs. Quality Trade-off:
- Linear Methods: Achieved low Attack Success Rates (ASR) only by severely degrading image quality (e.g., Linear-ACT on Infinity dropped CLIP scores from 0.33 to 0.16, effectively generating noise).
- CAT: Significantly reduced ASR while maintaining high image fidelity.
  - Z-Image: Reduced ASR from 33.91% to 6.96% while maintaining a CLIP score of 0.33.
  - Infinity: Reduced ASR from 31.74% to 4.78% with a CLIP score of 0.32.
Conditioning Effect: Applying the geometry-aware (Mahalanobis) conditioning was crucial. For linear methods, conditioning recovered image quality but slightly increased ASR. For CAT, it allowed for optimal balance.
Modality Steering: Steering both Text and Vision components simultaneously yielded the best results, confirming that safety concepts are encoded across the full multimodal pipeline.
Fine-Grained Safety: Even when trained only on the "Sexual" category, linear methods failed to reduce ASR without destroying image quality, whereas CAT successfully disentangled the specific unsafe concept.

5. Significance

This work fundamentally shifts the paradigm of T2I safety steering from global linear interventions to local, non-linear, and conditional transport.

Practical Impact: It provides a method to suppress harmful content without the "collateral damage" of ruining benign image generation, making it viable for real-world deployment.
Theoretical Insight: It demonstrates that toxicity in generative models resides in complex, non-convex manifolds that require non-linear transformations and precise geometric gating to navigate.
Reproducibility: The authors open-source the SafeSteerDataset and code, facilitating future research in T2I safety and alignment.

In conclusion, CAT proves that safety in T2I models is a geometrically complex problem that cannot be solved by simple vector addition, but rather requires adaptive, non-linear transport maps gated by precise geometric conditions.