The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

Imagine you have a magical artist named AI. This artist is incredibly talented at painting pictures based on your descriptions (prompts). You can tell them, "Draw a cat," and they will. You can also say, "Draw a fluffy orange cat sitting on a red velvet cushion in a sunlit room," and they will try to do that too.

This paper is like a giant study on how the complexity of your instructions affects the quality, variety, and accuracy of the paintings this AI artist creates. The researchers wanted to know: Does giving the AI more details make it a better artist, or does it confuse it?

Here is the breakdown of their findings using simple analogies:

1. The Three Rules of a Good Painting

The researchers judged the AI's work on three main things:

Quality (Beauty): Does the picture look good? Is it pretty and realistic?
Diversity (Variety): If you ask for "a cat" 10 times, do you get 10 different cats, or 10 identical clones?
Consistency (Listening): If you asked for a "blue cat," did it actually draw a blue cat, or did it ignore you and draw a red dog?

2. The "General vs. Specific" Trap

The biggest discovery was about how hard it is for the AI to understand different types of instructions.

The "AND" Problem (Specific Instructions): If you tell the AI, "Draw a black dog," you are asking it to combine two ideas (Black + Dog). The paper found that AI is actually pretty good at this. It's like asking a chef to add salt and pepper to a soup. It's easy to combine specific ingredients.
The "OR" Problem (General Instructions): If you tell the AI, "Draw a dog" (without saying what kind), the AI has to imagine any dog. The paper found this is surprisingly hard for the AI.
- The Analogy: Imagine the AI has a library of photos of "Black Dogs" and "White Dogs." If you ask for a "Dog," the AI tries to mash those two photos together. Instead of picking one specific dog, it often creates a blurry, average-looking dog that looks like a mix of all possibilities. It struggles to pick one specific path when the path isn't clearly defined.

Key Takeaway: It is harder for the AI to generalize (be vague) than to be specific. When you give it a vague prompt, it tends to produce "average" or "boring" results.

3. The "Detail Dilemma"

The researchers tested what happens when you keep adding more and more details to the prompt.

The Sweet Spot: When you give a prompt with just a few details (e.g., "A cat on a mat"), the AI creates beautiful, varied, and accurate images.
The Overload: When you give a very long, complex prompt (e.g., "A cat on a mat, wearing a tiny hat, with a red collar, looking left, in a Victorian room with a chandelier..."), two things happen:
1. Variety Drops: The AI gets so focused on following every single rule that it stops being creative. It stops making different kinds of cats and just makes the exact same cat over and over.
2. Listening Drops: The AI starts to forget parts of your long list. It might draw the hat but forget the red collar.

4. The "Magic Expander" (Prompt Expansion)

The researchers found a clever trick to fix the "boring" problem. They used a second AI (a language model) to act as a creative assistant.

How it works: You tell the assistant, "The user wants a 'dog'." The assistant thinks, "Okay, let's give the artist more ideas!" and expands that into "A golden retriever playing fetch," "A poodle in a park," "A husky in the snow," etc.
The Result: By feeding these expanded, specific ideas to the image AI, they got much more variety (diversity) and better-looking pictures (quality) than if they had just asked for "a dog" directly.
The Catch: Sometimes, this "Magic Expander" gets too creative. It might add details the user didn't want, making the picture less faithful to the original simple request.

5. The "Newer Models" Paradox

The paper looked at older AI models vs. newer, fancier ones.

Newer Models: They make incredibly beautiful, high-definition pictures (High Quality). However, they are sometimes too obedient. If you ask for a "dog," they might only draw a Golden Retriever because that's what they think is the "perfect" dog. They have lost some of their wild variety.
Older Models: They were a bit messier and less pretty, but they were more willing to try weird, different types of dogs.

The Big Conclusion

The paper suggests that Prompt Complexity is a dial you need to tune carefully.

If you want variety, you need to be specific (or use the "Magic Expander" to give the AI specific ideas).
If you want the AI to be creative, you can't just give it a vague command like "draw something cool." You have to guide it with enough detail to stop it from getting confused, but not so much detail that it gets stuck in a rut.

In short: The AI artist is amazing, but it needs clear instructions to be its best. If you are too vague, it gets confused and averages everything out. If you are too specific, it gets rigid. The secret sauce is finding the right balance, or letting a "creative assistant" help you write the perfect instructions.

1. Problem Statement

Text-to-Image (T2I) models are increasingly used to generate synthetic data for downstream training and self-improvement. The utility of this synthetic data is typically evaluated along three axes: Quality (aesthetics/realism), Diversity, and Consistency (alignment with the prompt).

While previous works have analyzed these metrics, the systematic impact of prompt complexity (the level of detail, specificity, or length of the text prompt) on these utility axes remains underexplored. Specifically, it is unclear how T2I models generalize when moving between training distributions (often highly descriptive) and inference distributions (ranging from very general to highly specific). The paper addresses the gap in understanding how prompt complexity influences the trade-offs between quality, diversity, and consistency, and whether synthetic data can truly match the distributional properties of real data.

2. Methodology

A. Theoretical Foundation & Synthetic Experiments

The authors first establish a theoretical and empirical baseline using a synthetic setting involving a mixture of four Gaussians (representing categories like "white dog," "black cat," etc.).

Generalization Asymmetry: They derive that generalizing from fine-grained prompts (specific) to general prompts (broad, e.g., "dog" vs. "black dog") is mathematically harder.
- OR Operator (Generalization to broad prompts): Requires estimating a likelihood weighting ( $p(x|c_{general}) = \sum p(x|c_{fine}) \cdot \text{weight}$ ) that diffusion models do not learn. This leads to mode collapse or averaging, resulting in poor diversity and distributional shift.
- AND Operator (Generalization to specific prompts): Can be approximated by summing score functions (similar to Classifier-Free Guidance), which is more tractable for diffusion models.
Findings: Generalizing to general conditions results in higher KL-divergence and Fréchet Distance (FD) compared to generalizing to fine-grained conditions.

B. Benchmarking Framework

The authors propose a novel evaluation framework to compare real and synthetic data across varying prompt complexities:

Captioning: Using LLMs (Gemma3) to generate captions of varying complexity levels (from 1-word to long, detailed sentences) for images in datasets like CC12M, ImageNet-1k, and DCI.
Pairing & Alignment: Filtering and aligning image sets across different complexity levels to ensure semantic comparability.
Generation: Generating synthetic images using state-of-the-art T2I models (LDMv1.5, LDMv3.5M, LDMv3.5L, Flux-schnell, Infinity) conditioned on these prompts.
Interventions: Evaluating various inference-time interventions:
- Vanilla Guidance: Standard Classifier-Free Guidance (CFG).
- Advanced Guidance: Condition-annealing (CADS), Interval Guidance, Adapted Projected Guidance (APG).
- Prompt Expansion: Using an LLM to expand short prompts into detailed descriptions (acting as a likelihood estimator).

C. Evaluation Metrics

Reference-Free: Aesthetic Score (Quality), Vendi Score (Diversity), DSG Score (Consistency).
Reference-Based: Fréchet Distance (FDD), Precision, Density, Coverage (measuring alignment with the real data distribution).

3. Key Contributions

Theoretical Proof of Asymmetry: Demonstrated that generalizing to "OR" conditions (broad prompts) is inherently harder for diffusion models than "AND" conditions (specific prompts) due to the lack of learned likelihood weighting.
Comprehensive Framework: Introduced a scalable framework to evaluate synthetic data utility as a function of prompt complexity, bridging the gap between synthetic and real data evaluation.
Discovery of Non-Linear Trends: Revealed that utility trends are non-linear. Specifically, aesthetic quality drops sharply for very short (general) prompts but degrades more gradually for long prompts.
Diversity Plateau: Identified an inherent "lower bound of diversity" in T2I models; diversity does not collapse to zero but plateaus as prompt length increases, suggesting models struggle to follow excessive constraints.
Optimal Intervention Strategy: Showed that Prompt Expansion (using an LLM to add detail) combined with Advanced Guidance (specifically APG) yields the best trade-offs, often surpassing real data in diversity and aesthetics while maintaining reasonable consistency.

4. Key Results

Prompt Complexity vs. Diversity:
- As prompt complexity increases (more details), diversity decreases and eventually plateaus.
- Prompt Expansion is the most effective method for boosting diversity, often allowing synthetic data to exceed the diversity of real data, particularly for short prompts.
- However, this comes at the cost of Consistency (DSG score) and Distributional Fidelity (Precision/Density).
Prompt Complexity vs. Quality (Aesthetics):
- Aesthetic scores show an asymmetric trend: they drop sharply for very general prompts but decrease gradually for longer, more complex prompts.
- Prompt expansion consistently improves aesthetic scores compared to vanilla guidance.
Prompt Complexity vs. Consistency:
- Consistency decreases as prompt length/complexity increases. Models struggle to incorporate all specific details (objects, attributes, relations) in long prompts.
- Advanced guidance methods (like CADS and Interval) often reduce consistency further compared to vanilla CFG, whereas APG maintains the best consistency among advanced methods.
Reference-Based Metrics (Fidelity):
- Optimizing for reference-free metrics (diversity/aesthetics) via prompt expansion or advanced guidance often harms distributional fidelity (lower Precision and Density).
- Synthetic data generated with prompt expansion can have higher diversity than real data but may lie outside the support of the real data distribution (hallucinations).
- LDMv3.5L shows superior reference-free quality and consistency but lower diversity compared to LDMv1.5, which surprisingly has better FDD (closer to real data distribution) despite lower aesthetic scores.
Model Comparison:
- Newer models (LDMv3.5L) excel in aesthetics and consistency but suffer from lower diversity compared to older models (LDMv1.5).
- Flux-schnell and Infinity (autoregressive) show different behaviors; Infinity has very low diversity but high consistency, while Flux-schnell performs similarly to LDMv3.5L.

5. Significance and Implications

Guidance for Synthetic Data Usage: The paper cautions that while synthetic data can be highly diverse and aesthetic, it may not faithfully represent the real-world distribution if generated with aggressive interventions (prompt expansion). This is critical for downstream training tasks where distributional shift could lead to model collapse or bias.
Prompt Engineering Strategy: The study suggests that for applications requiring high diversity, Prompt Expansion is essential. For applications requiring high fidelity to real data, simpler prompts or vanilla guidance might be preferable.
Model Development: The findings highlight a trade-off in model evolution: newer models are becoming better at aesthetics and consistency but are losing the inherent diversity of earlier models, necessitating explicit intervention (like prompt expansion) to recover diversity.
Theoretical Insight: The "OR is harder than AND" derivation provides a fundamental explanation for why T2I models struggle with broad, general prompts, offering a theoretical basis for future model architecture improvements.

In conclusion, the paper establishes prompt complexity as a critical, non-linear axis for evaluating T2I models. It provides a roadmap for balancing the "intricate dance" between generating diverse, high-quality synthetic data and maintaining fidelity to real-world distributions.