A User-Friendly Framework for Generating… — Plain-Language Explanation

Imagine you have a magical paintbrush (like Stable Diffusion) that can turn your words into stunning pictures. But there's a catch: this paintbrush speaks a very specific, fancy language. If you say, "Draw a tree," it might give you a stick figure. But if you say, "A majestic oak tree with golden leaves, painted in the style of Van Gogh, with dramatic lighting and 8k resolution," it creates a masterpiece.

The problem is that most of us (the "novice users") only know how to say, "Draw a tree." We don't know the secret code the paintbrush loves.

This paper introduces a solution called UF-FGTG (User-Friendly Fine-Grained Text Generation). Think of it as a super-smart translator that sits between you and the magical paintbrush.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Language Barrier"

The researchers noticed a big gap.

You (The User): Speak in short, simple sentences ("A green tree").
The AI Model: Was trained on long, detailed, fancy descriptions ("A green tree with moss, in a forest, impressionist style...").

Because of this mismatch, when you ask for a tree, the AI gets confused or gives you a boring result. It's like trying to order a complex meal at a fancy restaurant by just saying, "I'm hungry." The chef (the AI) doesn't know exactly what you want.

2. The Solution: A New Dictionary (The CFP Dataset)

To fix this, the team built a new "dictionary" called the CFP Dataset.

The Analogy: Imagine they took thousands of beautiful, detailed paintings and their fancy descriptions. Then, they used a summarizer to turn those fancy descriptions back into simple sentences.
The Result: They now have pairs of "Simple Request" + "Fancy Description" + "The Picture." This teaches the AI how to translate your simple words into the fancy language it loves.

3. The Translator: UF-FGTG Framework

This is the main invention. It's a system that takes your simple prompt and upgrades it. It has three special tools:

A. The Prompt Refiner (The "Translator")

This is the brain of the operation. You type "A green tree," and the Refiner rewrites it into "A green tree with moss growing on the ground, in a forest, impressionist painting style..."

How it learns: It doesn't just guess words. It looks at the picture the AI is trying to make. If the picture looks like a cartoon, the Refiner knows to add words that make it look realistic. It's like a chef tasting the soup while cooking and adding salt until it's perfect.

B. The Image-Feedback Loop (The "Quality Control")

Usually, text generators only look at other text. But this system looks at images too.

The Analogy: Imagine a student writing an essay. A normal teacher just checks the grammar. This system is like a teacher who also checks if the essay matches the picture the student is trying to describe. If the text says "sunny day" but the picture is dark, the system fixes the text. This ensures the final prompt actually creates a good image.

C. The Adaptive Feature Extractor (The "Creativity Spark")

There's a risk that the translator gets too repetitive. If you ask for "a tree" ten times, it might give you the exact same "tree" description every time.

The Analogy: This module is like a DJ who takes a single beat (your simple prompt) and remixes it into different genres (jazz, rock, classical) so you get variety. It looks at the image features and says, "Okay, let's make this tree look like a fantasy painting this time, and a photo-realistic one the next time." This keeps the results fresh and diverse.

4. The Results

When they tested this system:

Better Pictures: The images generated were 5% better in quality and beauty compared to other methods.
More Variety: Instead of getting the same boring tree every time, you get a forest, a bonsai, a giant oak, or a glowing magical tree, all from the same simple input.
User-Friendly: You don't need to be an expert. You just type what you want, and the system does the heavy lifting of writing the "magic spell" for the AI.

Summary

Think of this paper as building a universal remote control for AI art. Before, you had to manually program every button (write complex prompts). Now, you just press "Play" (type a simple sentence), and the remote automatically translates your command into the perfect code to get the exact picture you imagined.

1. Problem Statement

Text-to-image (T2I) models like Stable Diffusion and Midjourney have revolutionized image generation, but their performance is heavily dependent on the quality of the input prompt.

The Gap: There is a significant distribution mismatch between novice user inputs (typically short, coarse-grained, and simple) and model-preferred prompts (typically long, fine-grained, and rich in stylistic details).
Limitations of Current Solutions:
- Manual Prompt Engineering: Requires significant time, expertise, and familiarity with specific keywords, which is a barrier for novice users.
- Existing Automated Methods: Traditional generative language models (e.g., GPT-2, T5, GPT-4) are trained on unimodal text data. They lack an understanding of the specific visual semantics and structural requirements of T2I models, often failing to generate prompts that yield high-fidelity or aesthetically pleasing images. They also tend to produce monotonous styles or alter the original user intent.

2. Methodology: UF-FGTG Framework

The authors propose the User-Friendly Fine-Grained Text Generation (UF-FGTG) framework, which automates the translation of coarse user prompts into model-preferred fine-grained prompts.

A. The Coarse-Fine Granularity Prompts (CFP) Dataset

To train the framework, the authors constructed a novel dataset bridging the gap between user behavior and model training data:

Source: 81,910 instances derived from the Lexica.art community.
Structure: A triplet dataset consisting of:
1. Fine-grained Prompt ( $t_f$ ): The original, detailed prompt from the community.
2. Fine-grained Image ( $I$ ): The image generated by Stable Diffusion using $t_f$ .
3. Coarse-grained Prompt ( $t_c$ ): A summarized version of $t_f$ (generated via BART) with lengths of 1–5, 6–10, or 11–15 tokens.
Filtering: NSFW content was filtered out, resulting in ~79,447 clean training instances.

B. The UF-FGTG Architecture

The framework consists of three core components (see Figure 2 in the paper):

Prompt Refiner (Text Generation Network):
- Encoder ( $E_T$ ): Based on OpenCLIP. It transforms the feature space of the coarse-grained input ( $t_c$ ) into a fine-grained feature space aligned with the T2I model's expectations.
- Domain Adapter ( $Q$ ): A Multi-Layer Perceptron (MLP) that projects CLIP text features into the T5 text feature space.
- Decoder ( $D_T$ ): Based on FLAN-T5, it decodes the features into human-readable, fine-grained prompts.
Image-Related Supervision (Loss Functions):
Unlike standard text generation, this framework incorporates visual feedback during training to ensure the generated prompts are "model-preferred."
- $L_{mse}$ (Diffusion Loss): Minimizes the difference between the noise predicted by the Stable Diffusion UNet ( $\epsilon_\theta$ ) using the generated prompt and the actual noise added to the image. This ensures the prompt aligns with the diffusion model's internal representation.
- $L_{sft}$ (Supervised Fine-Tuning): Standard cross-entropy loss to ensure the generated text matches the ground-truth fine-grained prompts.
Adaptive Feature Extraction Module:
- Problem: Directly mapping coarse prompts to fine prompts can lead to style collapse (monotonous results).
- Solution: A module that extracts dynamic weights from the image features (via CLIP image encoder) and applies them to the prompt features.
- $L_{clip}$ (CLIP-Enhance Loss): Aligns the prompt features with the adaptive image features to ensure diversity. It prevents the model from generating prompts that only fit a single visual style.

Total Loss Function:
$L = L_{mse} + \alpha_1 L_{sft} + \alpha_2 L_{clip}$
(Where $\alpha_1, \alpha_2 = 0.1$ )

3. Key Contributions

CFP Dataset: The first dataset to explicitly pair coarse-grained user inputs with corresponding fine-grained prompts and images, facilitating the study of the user-model gap.
UF-FGTG Framework: A novel multi-modal training framework that transforms coarse prompts into fine-grained, model-preferred prompts by integrating image-related supervision (Diffusion loss) into text generation.
Adaptive Feature Extraction: A mechanism that dynamically aligns prompt features with image features to prevent style homogenization and ensure diverse generation results.

4. Experimental Results

The authors evaluated their method against state-of-the-art baselines (GPT-2, FLAN-T5, GPT-3.5, GPT-4) using the CFP dataset and Stable Diffusion-v2.1.

Quantitative Performance:
- Evaluated using six non-reference metrics: NIMA-TID, MUSIQ-KonIQ, DB-CNN, TReS (Quality) and NIMA-AVA, MUSIQ-AVA (Aesthetics).
- Result: UF-FGTG achieved the highest scores across all six metrics, showing an average improvement of 5% over the best baseline.
- Ablation Study: Removing the diffusion loss ( $L_{mse}$ ) or the adaptive feature extraction loss ( $L_{clip}$ ) resulted in significant performance drops, confirming the necessity of multi-modal supervision and diversity mechanisms.
Qualitative Performance:
- Visual Quality: Generated images were more visually appealing and detailed compared to those from GPT-4 or FLAN-T5.
- Diversity: Without the adaptive module, the model produced repetitive styles (e.g., all "green trees" looked identical). With the module, the model generated diverse interpretations (e.g., trees in forests, cyberpunk settings, artistic styles).
- Intent Preservation: Unlike GPT-4, which often altered the semantic meaning of the original prompt (e.g., changing "woman in a blue dress" to a completely different scene), UF-FGTG preserved the core user intent while adding necessary stylistic details.

5. Significance

Bridging the Gap: The work effectively solves the "prompt engineering" barrier for novice users, allowing them to achieve professional-grade results with simple inputs.
Multi-Modal Training Paradigm: It demonstrates that text generation for T2I tasks cannot rely solely on text data; integrating image feedback (via diffusion losses) is crucial for generating prompts that the model actually understands.
Plug-and-Play Capability: The trained fine-grained text encoder can replace the standard text encoder in Stable Diffusion, acting as a drop-in replacement to enhance image quality for any user input without requiring the user to learn complex prompt syntax.
Open Source: The dataset and code are publicly available, fostering further research in automated prompt optimization.

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis