CleanStyle: Plug-and-Play Style Conditioning Purification for Text-to-Image Stylization

Imagine you are an artist trying to paint a picture based on a friend's description, but you have to use a specific reference photo to get the style right (e.g., "paint this like a Van Gogh").

In the world of AI image generation, this is called Style Transfer. You tell the AI, "Draw a cat running," and you show it a picture of a Van Gogh painting so the AI knows to use those swirling, thick brushstrokes.

However, current AI models have a messy habit called "Content Leakage."

The Problem: The "Over-Enthusiastic" Assistant

Think of the AI as an assistant who is too eager to please. When you show it the Van Gogh reference to get the style, the assistant doesn't just copy the brushstrokes; it accidentally copies the objects in the reference too.

Your Prompt: "A cat running."
Your Reference: A Van Gogh painting of a sunflower field.
The Result: The AI draws a cat running, but it also puts giant sunflowers growing out of the cat's fur or turns the background into a field of wheat, even though you never asked for that.

The AI is confused. It thinks, "Oh, you want a Van Gogh style? I'll give you everything from that Van Gogh picture!" This ruins the specific image you asked for.

The Solution: CleanStyle

The paper introduces CleanStyle, a new "plug-and-play" tool that acts like a smart filter for the AI. It doesn't require retraining the AI (which is like teaching a dog new tricks from scratch); instead, it just cleans up the instructions the AI receives while it's drawing.

Here is how it works, using two simple analogies:

1. The "SVD Filter" (Cleaning the Signal)

The AI looks at the reference image and turns it into a list of numbers (an "embedding"). The authors discovered that this list has two parts:

The Main Part (The Head): This contains the "vibe" or the style (the swirling colors, the texture).
The Tail Part (The Tail): This contains the specific details of the objects in the reference (the sunflowers, the specific faces).

The Analogy: Imagine the reference image is a radio broadcast. The "Main Part" is the music (the style), and the "Tail Part" is the DJ talking about his lunch (the specific content).

Old Method: The AI listens to the whole broadcast and tries to paint both the music and the DJ's lunch.
CleanStyle (CS-SVD): It uses a mathematical trick called SVD (Singular Value Decomposition) to act like a noise-canceling headphone. It isolates the "Tail" (the lunch talk) and mutes it, while keeping the "Main" (the music) loud and clear.
The Twist: It doesn't just mute the tail forever. It mutes it hard at the beginning of the drawing process (when the AI is sketching the outline) so the AI doesn't get confused about what to draw. As the drawing gets more detailed, it lets a little bit of the tail back in, just enough to keep the texture rich without bringing back the unwanted objects.

2. The "Negative Guide" (Teaching by Example of What Not to Do)

Standard AI tools use a "Negative Prompt" (a way to tell the AI what to avoid) that is usually just a blank, empty signal (like a zero vector). It's like telling a student, "Don't draw anything weird," without showing them what "weird" looks like.

CleanStyle (SS-CFG) changes the game.

The Analogy: Instead of saying "Don't draw weird stuff," CleanStyle takes the "Tail" part it just muted (the sunflowers, the lunch talk) and says to the AI: "Here is exactly what the 'weird stuff' looks like. Do the opposite of this."
By showing the AI the specific "bad" content it wants to avoid, the AI can actively push those elements away. It's like a teacher pointing at a messy drawing and saying, "Don't do that," which is much more effective than just saying "Be neat."

Why This Matters

It's Plug-and-Play: You can add this tool to existing AI art generators (like InstantStyle or DEADiff) without needing to retrain the whole model. It's like putting a new filter on a camera lens.
It's Fast: It doesn't slow down the generation process much.
It's Accurate: The result is an image that looks exactly like the style you wanted (Van Gogh, watercolor, cyberpunk) but features exactly what you asked for (a cat, a car, a house), without the accidental "leakage" of random objects from the reference photo.

In short: CleanStyle teaches the AI to listen to the music of the style reference without getting distracted by the lyrics (the specific objects) in the background.

1. Problem Statement

Content Leakage in Encoder-Based Style Transfer:
Recent text-to-image (T2I) diffusion models utilize encoder-based methods (e.g., InstantStyle, IP-Adapter) to inject the visual style of a reference image into the generation process. These methods extract style embeddings using pre-trained image encoders and inject them into the cross-attention layers of the U-Net.

The Core Issue: These methods suffer from content leakage, where semantic elements from the style reference image (e.g., specific objects, backgrounds, or textures unrelated to the prompt) undesirably appear in the output.
Consequence: This impairs prompt fidelity (the generated image does not match the text description) and reduces stylistic consistency.
Limitation of Existing Solutions: Current approaches often rely on model fine-tuning, inversion, or complex disentanglement training, which are computationally expensive and lack flexibility. Furthermore, standard Classifier-Free Guidance (CFG) uses generic negative inputs (e.g., zero vectors), which fail to specifically target and suppress style-specific content artifacts.

2. Methodology

The authors propose CleanStyle, a training-free, plug-and-play framework that purifies style embeddings using Singular Value Decomposition (SVD) and a novel guidance mechanism.

A. Empirical Insight: SVD Analysis

The authors analyze style embeddings via SVD and observe a structural separation:

Dominant Components (Head): Encode global, high-variance stylistic attributes (colors, brushstrokes, overall texture).
Tail Components: Encode low-variance, localized, and content-specific artifacts (specific objects or semantic details from the reference image).
Hypothesis: Filtering out the "tail" components of the singular spectrum removes content leakage while preserving style.

B. Core Module 1: CleanStyleSVD (CS-SVD)

This module filters the style embeddings injected into the cross-attention layers (Key $K$ and Value $V$ matrices).

Decomposition: Apply SVD to the projected Key/Value matrices: $K = U\Sigma V^\top$ .
Tail Suppression: Attenuate singular values beyond the top- $k$ components using an exponential decay function:
$\sigma'_i = \begin{cases} \sigma_i & \text{if } i \le k \\ e^{-\alpha \sigma_i} \cdot \sigma_i & \text{otherwise} \end{cases}$
Time-Aware Schedule: To balance structure formation and detail preservation, the suppression strength $\alpha_t$ $α_{t}$ is modulated by a sigmoid function dependent on the denoising timestep $t$ $t$ :
- Early Steps: Strong suppression to establish a clean global layout and prevent content leakage.
- Late Steps: Progressive relaxation to preserve fine-grained stylistic details (e.g., brush strokes).

C. Core Module 2: Style-Specific Classifier-Free Guidance (SS-CFG)

Standard CFG contrasts a conditional input against a generic unconditional input (e.g., zero vectors). CleanStyle replaces the generic negative input with a style-aware negative signal.

Conditional Branch ( $\epsilon_{cond}$ ): Uses the filtered embeddings (dominant components only) from CS-SVD.
Unconditional Branch ( $\epsilon_{uncond}$ ): Uses the isolated tail components (the suppressed content-related signals) as the negative input.
Mechanism: This creates a precise contrastive objective. The model is guided to adhere to the "clean" style while actively diverging from the "content-contaminated" signal, effectively suppressing unwanted visual patterns.

D. Integration

CleanStyle is designed to be plug-and-play. It can be integrated into existing encoder-based pipelines (e.g., InstantStyle, DEADiff, StyleShot) by operating directly on the $K$ and $V$ matrices within the cross-attention layers, requiring no retraining or architectural changes.

3. Key Contributions

Empirical Analysis: Identified that content leakage in encoder-based diffusion models is primarily encoded in the tail components of the style embedding's singular spectrum.
CS-SVD: Proposed a training-free filtering scheme that dynamically suppresses tail components using a time-aware exponential schedule, effectively separating style from content.
SS-CFG: Introduced a novel guidance mechanism that repurposes the suppressed tail components as targeted negative conditions, replacing generic zero vectors to enhance prompt fidelity.
Generalizability: Demonstrated that the method is lightweight, interpretable, and compatible with multiple state-of-the-art (SOTA) stylization frameworks without retraining.

4. Experimental Results

The authors evaluated CleanStyle on benchmarks including StyleBench and a curated CleanStyle dataset (100 style images, 52 prompts).

Qualitative Improvements:
- Reduced Leakage: Significantly mitigated the appearance of reference objects (e.g., preventing a "green tree" from appearing in a prompt for a "house covered in ice").
- Prompt Alignment: Generated images adhered more strictly to text instructions compared to baselines like InstantStyle, CSGO, and DEADiff.
- Visual Quality: Maintained high stylistic fidelity (texture, color) without the structural distortions seen in other methods.
Quantitative Metrics:
- CLIP Text Alignment (TA): Consistently achieved the highest scores, indicating superior prompt adherence.
- Style Similarity (SS): Slightly lower scores on CLIP-SS and DINO-SS compared to some baselines. The authors argue this is a favorable trade-off: existing metrics often conflate content leakage with style similarity (i.e., copying a cat from the reference image boosts the score even if the prompt asked for a dog). CleanStyle prioritizes semantic correctness over raw pixel similarity.
User Study: In a study with 43 participants and 2,580 judgments, CleanStyle received the highest preference for text alignment, style retention, and overall image quality.
Efficiency: The method introduces negligible inference overhead, with computation times comparable to the original baselines.

5. Significance

Paradigm Shift: Moves away from heavy fine-tuning or complex disentanglement training toward analytical signal processing (SVD) for style purification.
Practical Utility: Offers a "free lunch" solution for existing T2I pipelines, allowing developers to instantly improve style transfer quality without retraining models.
Theoretical Insight: Provides a new understanding of how style and content are encoded in diffusion model embeddings, suggesting that "noise" in the tail of the singular spectrum is actually structured semantic leakage.
Future Impact: Sets a precedent for using targeted negative guidance (SS-CFG) to solve specific generation artifacts, potentially applicable to other domains like image editing or identity preservation.

In summary, CleanStyle effectively solves the content leakage problem in text-to-image stylization by mathematically purifying style embeddings and leveraging those purged components to guide the generation process, resulting in images that are both stylistically rich and semantically accurate.