Original authors: Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

Published 2026-06-17

📖 4 min read☕ Coffee break read

Original authors: Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written or endorsed by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine a giant, magical painting machine (a Text-to-Image AI) that has been trained on billions of pictures from the internet. You can tell it, "Draw a doctor," and it will paint one. But, because it learned from the real world, it has picked up on real-world biases. If you ask for a "doctor," it almost always paints a man in a white coat. If you ask for a "nurse," it almost always paints a woman.

The problem is, we usually only check for the big, obvious biases we already know about (like gender or race). But what about the weird, rare, or subtle things the machine could paint but just never chooses to? Maybe it knows how to paint a doctor with curly hair or a doctor in a vintage photo, but it keeps ignoring those ideas.

Enter RAIGen.

Think of the AI's brain as a massive library of "neurons" (little switches that light up when the AI thinks about specific things). Most of the time, the same few switches light up over and over again (the "popular" ideas). The rare ideas are hidden in switches that almost never get turned on.

How RAIGen Works (The "Matryoshka" Analogy)

The Russian Dolls (Matryoshka Sparse Autoencoders):
Imagine the AI's brain is a set of Russian nesting dolls. The outer dolls are big and cover broad ideas (like "a person"). The inner dolls are tiny and cover very specific details (like "a specific type of hat").
RAIGen uses a special tool called a Matryoshka Sparse Autoencoder to open these dolls. It doesn't just look at the big picture; it peels back the layers to find the tiny, specific switches inside.
The "Quiet Neuron" Detector:
RAIGen looks for the switches that are the quietest. It asks two questions:
- How rarely does this switch turn on? (If it only lights up 1% of the time, it's a "rare" idea).
- Is this switch unique? (Does it represent something totally different from the average, or is it just a noisy glitch?).
If a switch is quiet and unique, RAIGen flags it as a "Rare Attribute."
The Treasure Hunt:
Once RAIGen finds these quiet switches, it looks at the pictures that made them light up. It might find a picture of a "female doctor in a black-and-white portrait" or a "train with huge smoke plumes." These are things the AI knows how to draw, but it usually skips them in favor of the "standard" versions.

What RAIGen Actually Found

The researchers tested this on popular AI models (like Stable Diffusion) and found:

It finds hidden gems: It discovered rare concepts that standard bias checkers missed, like specific hairstyles, cultural symbols, or unusual camera angles.
It works on different models: It found these rare traits in both older, smaller models and newer, massive ones.
It's not just about fairness: It found rare things that aren't about gender or race, but about style and context (like a "doctor holding a medical chart" vs. just a generic doctor).

The "Volume Knob" Effect

The coolest part is what happens next. Once RAIGen identifies these quiet switches, the researchers showed they could "turn up the volume" on them. By slightly changing the text prompt based on what RAIGen found, they could force the AI to draw these rare pictures much more often.

The Bottom Line

RAIGen is like a detective that listens to the AI's internal thoughts to find the ideas it is secretly holding back. It doesn't just tell us what the AI is bad at; it tells us what the AI is ignoring. This helps us understand the full range of what these models can do, not just the most common things they produce.

Important Note: The paper is careful to say this only finds things the AI already learned. If the AI never saw a picture of a specific rare cultural symbol during training, RAIGen won't find it. It only reveals the "hidden" parts of the library that are already there but rarely visited.

Technical Summary: RAIGen – Rare Attribute Identification in Text-to-Image Generative Models

1. Problem Statement

Text-to-image (T2I) diffusion models, such as Stable Diffusion, generate high-fidelity images but inherit and amplify biases from their training data. While prior work addresses this through closed-set approaches (mitigating biases in predefined categories like gender or race) or open-set approaches (identifying dominant majority attributes), a critical gap remains. Existing methods overlook the systematic underrepresentation of rare or minority attributes that are encoded within the model's internal representations but suppressed during generation. These attributes may be social, cultural, or stylistic and are not necessarily known a priori. Suppressing dominant attributes does not automatically amplify these underrepresented features; instead, it often reallocates probability mass unevenly. There is a need for a framework to systematically discover these "suppressed" minority attributes without relying on predefined labels.

2. Methodology: RAIGen Framework

RAIGen is a label-free framework designed to identify rare attributes directly from the internal latent representations of diffusion models. It operates through three primary stages:

2.1. Feature Decomposition via Matryoshka Sparse Autoencoders (MSAEs)

To map entangled internal representations into semantically interpretable features, RAIGen employs Matryoshka Sparse Autoencoders (MSAEs).

Input: The framework extracts bottleneck representations ( $h_t$ ) from a diffusion model during the reverse sampling process (typically at the final timestep where semantic information is most expressed).
Architecture: MSAEs decompose these representations into a hierarchy of sparse latent features ( $z$ ) at varying levels of granularity.
Level Selection: While MSAEs offer fine-grained levels, RAIGen focuses on the coarsest level ( $k_1$ ). Finer levels often fragment single concepts into localized, brittle part-features, whereas the coarsest level yields semantically and spatially coherent features suitable for identifying global attributes.

2.2. The Minority Score

To distinguish rare, meaningful attributes from noise or common features, RAIGen introduces a Minority Score ( $s$ ) that combines two complementary signals:

Activation Frequency ( $\nu$ ): The proportion of samples in a dataset where a specific neuron is active. Rare attributes correspond to neurons with low activation frequency.
Semantic Distinctiveness ( $d$ ): The cosine distance between the neuron's activation-weighted CLIP centroid ( $\mu_i$ ) and the global dataset centroid ( $\mu_{D_c}$ ). This ensures the neuron represents a concept that deviates significantly from the dataset's average semantics.

The score is calculated as:
$s(z) = d \odot (1 - \nu)$
This formulation prioritizes neurons that are both infrequently active and semantically distinct.

2.3. Discovery and Verification Pipeline

Ranking: Neurons are ranked by their Minority Score.
De-duplication: To ensure a compact and diverse set, redundancy is assessed via pairwise cosine distances between neuron centroids. Neurons with similar semantics are filtered out, retaining only the highest-scoring representative.
Interpretability: The top-activating images for selected neurons are visualized with MSAE activation heatmaps. Human-readable annotations are generated using Multi-Modal Large Language Models (MLLMs) to describe the discovered attributes.
Label-Free Nature: The process requires no predefined minority categories or attribute labels, relying solely on the model's internal representations and a pretrained semantic prior (CLIP) for distinctiveness measurement.

3. Key Contributions

First Label-Free Framework: RAIGen is the first framework, to the authors' knowledge, for rare attribute identification in diffusion models that does not require predefined minority categories. It shifts the focus from identifying dominant biases to discovering suppressed minority features encoded in the model.
Novel Minority Metric: The paper proposes a metric combining neuron activation frequency and semantic distinctiveness, validated as a reliable proxy for identifying underrepresented concepts.
Systematic Auditing and Amplification: RAIGen demonstrates the ability to reveal attributes beyond fixed fairness categories across multiple architectures (Stable Diffusion 1.4, 2, XL, and FLUX.1-schnell). It further enables the targeted amplification of these rare attributes through lightweight prompt interventions.

4. Experimental Results

The authors evaluated RAIGen on Stable Diffusion v1.4, SDXL, and FLUX.1-schnell using WinoBias (occupation-based) and COCO (diverse scenes) prompts.

Validation of Heuristic: In a controlled toy setting with known ground-truth rare factors, the authors demonstrated that the least frequently activated latents in MSAEs disproportionately map to rare features (Spearman correlation $\rho \approx 0.991$ ), validating activation frequency as a proxy for rarity.
Quantitative Performance: On WinoBias and COCO, attributes discovered by RAIGen showed significantly lower "Attribute Presence" (approx. 0.20) compared to majority attributes identified by OpenBias (approx. 0.94), confirming that RAIGen isolates suppressed features.
Qualitative Findings: RAIGen successfully uncovered diverse minority attributes, including:
- Social/Fairness: Female doctors, female sheriffs.
- Contextual/Stylistic: Doctors in framed portraits, doctors with medical charts, trains with large smoke plumes, side-view trains with motion blur, and women with afro-curly textured hair.
- Cross-Architecture: The method generalized to transformer-based models (FLUX.1-schnell), though with slightly more diffuse heatmaps compared to U-Net architectures.
User Study: A study with 25 participants confirmed that RAIGen-discovered attributes are perceptually rare. For five professions, participants reported the presence of these attributes in fewer than 3 out of 10 generated images on average, validating that these features are encoded but underexpressed in standard sampling.
Amplification: Using RAIGen's annotations to revise prompts resulted in a substantial reduction in attribute-presence deviation (e.g., from 0.50 to 0.22 in SD v1.4), demonstrating that reintroducing minority descriptors effectively increases the coverage of underrepresented modes.

5. Significance and Claims

The paper positions RAIGen as a foundational tool for comprehensive auditing of generative models. Its significance lies in:

Bridging the Gap: It addresses the limitation of current bias mitigation tools that only handle known categories or majority trends, enabling the discovery of "unknown unknowns" in the model's latent space.
Representation-Grounded Discovery: By grounding discovery in internal model representations rather than external world models, RAIGen identifies features that the model has actually learned but chooses not to express.
Practical Utility: It provides a mechanism for targeted amplification, allowing developers to systematically improve the diversity of generated outputs without retraining the model.

The authors maintain a modest stance, noting that RAIGen can only surface attributes already encoded in the model; it cannot discover socially salient minorities that the model has failed to learn entirely. They suggest that RAIGen should be viewed as a component of a hybrid auditing framework, complementing external semantic priors (like LLM-based tools) to provide a complete view of underrepresentation. They also acknowledge potential misuse, such as the deliberate generation of sensitive or stereotyped imagery, emphasizing the need for governance and safeguards.

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models