CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

Imagine you are a chef trying to cook a massive banquet for a diverse group of people. You have a recipe book (your AI model) and a pantry full of ingredients (medical images).

The problem? Your pantry is imbalanced. You have thousands of photos of "young, white, male" patients, but only a handful of "elderly, Asian, female" patients. In fact, for some specific combinations, you have zero photos at all.

When you try to teach your AI chef to generate new images of these missing groups, it struggles. It's like asking the chef to cook a dish they've never seen using ingredients they don't have. The result? The AI makes great pictures for the common groups, but the pictures for the rare groups look blurry, weird, or just wrong. This is what the authors call the "Imbalanced Generator Problem."

The Old Way: "Just Try Harder"

Previous attempts to fix this were like telling the chef: "Hey, when you cook for the rare groups, try really hard! Don't worry about the common groups, focus on the rare ones!"

This is called loss reweighting. It's an optimization trick. But it has a fatal flaw: You can't teach someone to cook a dish they've never seen, no matter how much you yell at them. If the AI has never seen an "elderly Asian female" in the training data, no amount of "trying harder" will magically create that image.

The New Way: CompDiff (The "Lego" Approach)

The authors propose a new framework called CompDiff. Instead of just telling the AI to try harder, they change how the AI understands the ingredients.

They realized that demographic identity is compositional, just like building with Lego bricks.

You know what a "red brick" looks like (Age).
You know what a "blue brick" looks like (Race).
You know what a "square brick" looks like (Sex).

Even if you've never built a "Red-Blue-Square" tower before, you can still build it because you understand the individual pieces and how they fit together.

How CompDiff Works (The "Specialized Architect")

In standard AI models, the chef tries to remember everything in one giant, messy list. If the list is too long, the rare items get forgotten.

CompDiff introduces a Hierarchical Conditioner Network (HCN). Think of this as a specialized architect who helps the chef:

Break it Down: Instead of treating "80-year-old Asian Female" as one giant, confusing concept, the architect breaks it into three simple parts: Age, Race, and Sex.
Learn the Parts: The AI learns these parts separately. It gets really good at recognizing "Asian" and really good at recognizing "Female."
Build the Interactions: The architect then teaches the AI how these parts interact. "Okay, when 'Asian' and 'Female' are together, here's how they look."
Compose the New: When the AI needs to generate an image for a group it has never seen (e.g., "80-year-old Asian Female"), it doesn't panic. It simply grabs the "Asian" brick, the "Female" brick, and the "80-year-old" brick it already knows, and snaps them together to build the new image.

Why This Matters

The paper tested this on two types of medical images: Chest X-rays and Eye (Fundus) images.

Better Quality: The images generated by CompDiff looked much sharper and more realistic than previous methods, especially for the rare groups.
Fairness: The AI didn't just make "okay" images for the rare groups; it made them just as good as the common groups.
Zero-Shot Magic: The most impressive part? The AI was tested on groups it was completely forbidden from seeing during training. Because it understood the "Lego bricks" (the individual traits), it could still build the tower correctly. It was like a child who learned to build a castle with red and blue bricks, and then successfully built a castle with a purple brick they had never seen before, just by understanding how bricks work.

The Bottom Line

The authors show that the secret to fair AI isn't just about giving the AI more data or punishing it for mistakes. It's about teaching it how to think.

By teaching the AI to break down complex human identities into simple, understandable parts and reassemble them, CompDiff ensures that medical AI can serve everyone—not just the people who show up most often in the data. This means better diagnostic tools for rare diseases and underrepresented populations, leading to a healthier, fairer world.

1. Problem Statement: The Imbalanced Generator Problem

The paper addresses a critical gap in medical AI: while generative models (specifically diffusion models) are used to augment datasets for fairness, the models themselves often fail to generate high-quality images for rare or unseen demographic intersections.

The Core Issue: Standard diffusion models encode demographics implicitly within text prompts (e.g., "80-year-old Asian female"). In these models, demographic tokens compete with clinical tokens for a limited embedding budget (e.g., CLIP's 77 tokens).
The Limitation: When training data lacks specific intersections (e.g., no examples of "80+ Asian females with a specific pathology"), standard models cannot learn these combinations.
Failure of Existing Remedies: Optimization-level fixes like FairDiffusion (which uses loss reweighting) fail because they cannot generate learning signals for combinations that do not exist in the training data. They rely on implicit encoding, which struggles with rare intersections.

2. Methodology: CompDiff Framework

The authors propose CompDiff, a framework that shifts the solution from the optimization level to the representation level. The core insight is that demographic identity is compositional: a rare intersection can be constructed from well-learned single attributes and pairwise interactions.

Key Architectural Components

Hierarchical Conditioner Network (HCN):
Instead of relying solely on text prompts, CompDiff processes demographic attributes (Age, Sex, Race) through a dedicated HCN.
- Single-Attribute Embeddings ("Grandparents"): Individual attributes are embedded into a shared latent space ( $e_{age}, e_{sex}, e_{race}$ ).
- Pairwise Interactions ("Parents"): Dedicated MLPs model non-additive relationships between pairs of attributes (e.g., $f_{age,sex}$ ), capturing interactions that simple addition misses.
- Full Composition ("Child"): A final MLP combines these pairwise interactions to produce a holistic demographic representation ( $h_{demo}$ ).
Structured Factorization & Latent Projection:
- The final representation $h_{demo}$ is mapped to a diagonal Gaussian distribution $(\mu, \log \sigma)$ .
- A latent vector $z$ is sampled (using reparameterization) and projected into the cross-attention dimension ( $c$ ) to be concatenated with clinical text embeddings.
- This structured factorization encourages parameter sharing across subgroups, improving data efficiency for rare intersections.
Training Objective:
The model is trained end-to-end with a composite loss function:
$L = L_{diff} + \lambda_{comp}L_{comp} + \lambda_{aux}L_{aux} + \lambda_{KL}L_{KL}$
- $L_{diff}$ : Standard diffusion loss.
- $L_{comp}$ (Compositional Consistency): A soft anchor term ( $1 - \cos(h_{demo}, e_{age}+e_{sex}+e_{race})$ ) that stabilizes training toward an additive baseline while allowing non-additive interactions.
- $L_{aux}$ (Auxiliary Classification): Crucially, this loss is applied to the projected token $c$ (the input the UNet actually sees), not the pre-projection latent $\mu$ . This ensures the demographic information survives the projection and remains informative to the diffusion model.
- $L_{KL}$ : Regularizes the variational latent toward a standard normal distribution.

3. Key Contributions

Representation-Level Solution: Proposes the first framework to address the "imbalanced generator problem" by explicitly modeling demographic composition rather than just reweighting training samples.
Hierarchical Conditioner Network (HCN): Introduces a novel architecture that decomposes demographics into single attributes and pairwise interactions, enabling zero-shot generalization to unseen intersections (e.g., generating "80+ Asian females" even if that specific group was absent in training).
Auxiliary Supervision Strategy: Demonstrates that auxiliary classification loss must be applied to the final projected token ( $c$ ) rather than the latent mean ( $\mu$ ) to effectively guide the UNet.
Zero-Shot Intersectional Generalization: Validates that the model can compose representations for unseen demographic combinations using learned single-attribute and pairwise embeddings.

4. Experimental Results

The method was evaluated on Chest X-rays (MIMIC-CXR) and Fundus images (FairGenMed), comparing against standard fine-tuning and FairDiffusion.

Image Quality: CompDiff achieved superior Fréchet Inception Distance (FID) scores (64.3 for Chest X-ray vs. 75.1 for FairDiffusion) and better disease classification AUROC (0.82 vs. 0.74), indicating better alignment with clinical features.
Fairness (Equity-Scaled FID): CompDiff significantly reduced quality disparities across subgroups. It achieved the lowest ES-FID for sex, race, and age, proving that fairness improvements did not come at the cost of majority group performance.
Zero-Shot Performance: On held-out intersectional subgroups (completely removed from training), CompDiff improved FID by up to 21% compared to baselines. Notably, FairDiffusion performed worse than the baseline on some rare intersections, confirming that loss reweighting fails without structural inductive bias.
Downstream Utility: Classifiers trained on CompDiff-generated data showed higher AUROC and reduced demographic bias (lower Equalized Odds Difference and underdiagnosis rates) when tested on real data.

5. Significance and Conclusion

Architectural Inductive Bias: The paper establishes that the structure of how demographics are conditioned is more critical for fairness than the optimization strategy. By explicitly modeling compositionality, the model can generalize to data distributions it has never seen.
Clinical Impact: The ability to generate high-quality, fair synthetic data for rare demographic intersections allows for the training of more equitable diagnostic AI systems, addressing a major bottleneck in medical AI deployment.
Limitations: The approach assumes structured demographic attributes (discrete categories) and does not yet extend to continuous or unstructured attributes. Additionally, while it improves equity, it does not fully eliminate the performance gap relative to well-represented groups.

In summary, CompDiff represents a paradigm shift in fair medical image generation, moving from data-level balancing to representation-level compositional modeling to solve the fundamental challenge of generating data for unseen demographic intersections.

CompDiff: Hierarchical Compositional Diffusion for Fair and Zero-Shot Intersectional Medical Image Generation

The Old Way: "Just Try Harder"

The New Way: CompDiff (The "Lego" Approach)

How CompDiff Works (The "Specialized Architect")

Why This Matters

The Bottom Line

1. Problem Statement: The Imbalanced Generator Problem

2. Methodology: CompDiff Framework

Key Architectural Components

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Exploration and Exploitation Errors Are Measurable for Language Model Agents

SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications

Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models

Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach

WebXSkill: Skill Learning for Autonomous Web Agents