DEIG: Detail-Enhanced Instance Generation with Fine-Grained Semantic Control

Imagine you are an art director for a busy movie set. You have a script that says: "In the center, a man in a red hat and blue jacket stands next to a woman in a yellow dress holding a green umbrella. Behind them, a striped dog chases a polka-dotted ball."

In the past, if you asked an AI artist to draw this, it might get the people right, but the colors would get mixed up. The man might end up with a yellow hat, the woman might have a red dress, and the dog might be solid black. The AI struggles to keep track of which details belong to which person, especially when there are many of them.

This paper introduces DEIG (Detail-Enhanced Instance Generation), a new "super-assistant" for AI artists that solves this mess. Here is how it works, broken down into simple concepts:

1. The Problem: The "Color Bleed" Effect

Think of current AI image generators like a group of painters working on a single canvas. If you tell them, "Paint a red hat here and a blue shirt there," they often get confused. The red paint might accidentally smear onto the blue shirt, or the AI might forget the hat entirely and just paint a generic person. They lack a system to say, "This detail belongs only to Person A, and this detail belongs only to Person B."

2. The Solution: DEIG's Two-Step Magic

DEIG acts like a strict project manager who organizes the painters before they even pick up a brush. It uses two main tools:

Tool A: The "Detail Extractor" (IDE)

The Analogy: Imagine the AI's brain is a giant library with millions of books (words). When you give a complex description like "a man in a red hat," the AI usually grabs a vague summary.
What DEIG does: The Instance Detail Extractor is like a super-organized librarian. It takes your long, messy sentence and breaks it down into tiny, specific "index cards." It creates a compact, high-quality summary for each person or object.
- Card 1: "Man" + "Red Hat" + "Blue Jacket."
- Card 2: "Woman" + "Yellow Dress" + "Green Umbrella."
- It ensures the AI understands exactly what each card means before it starts drawing.

Tool B: The "Detail Fusion" (DFM)

The Analogy: Now imagine the painters are back at the canvas. Without a manager, they might shout over each other, mixing their instructions.
What DEIG does: The Detail Fusion Module acts like a set of invisible, magical walls. It tells the AI: "Okay, the 'Red Hat' instruction can only touch the 'Man' area. It cannot cross the invisible line to the 'Woman' area."
This prevents "attribute leakage" (where colors or textures spill over into the wrong object). It forces the AI to keep the details strictly contained within their own "zones."

3. The Training: Learning from a Better Teacher

To teach this new system, the authors didn't just use old, simple descriptions like "a dog." They used a smart robot (a Vision Language Model) to look at real photos and write rich, detailed stories for every single object.

Instead of "a car," the new training data says: "A metallic, striped red car with shiny wheels."
They also built a new test suite called DEIG-Bench. Think of this as a final exam where the AI has to draw complex scenes with many people and objects, all with specific, mixed-up colors and textures.

4. The Results: Why It Matters

When they tested DEIG against other AI models:

Old AI: Drew a scene where the man had the woman's dress, or the dog had the ball's pattern.
DEIG: Drew the scene exactly as described. The man kept his red hat, the woman kept her yellow dress, and the dog kept its stripes.

The Best Part: DEIG is "plug-and-play." You don't need to rebuild the entire AI artist from scratch. You can just snap this new "Project Manager" module onto existing AI tools, and suddenly, they become much better at following complex instructions.

Summary

DEIG is like giving an AI artist a pair of labeled folders and a set of dividers.

Labeled Folders (IDE): It sorts your complex instructions so the AI knows exactly what to do for each specific item.
Dividers (DFM): It puts up walls so the instructions for one item don't accidentally mess up the instructions for another.

The result? AI can finally draw complex, crowded scenes where every single person and object looks exactly how you described them, without the details getting mixed up.

1. Problem Statement

Multi-Instance Generation (MIG) aims to generate images containing multiple semantically distinct objects at user-specified spatial locations. While recent diffusion-based approaches have improved spatial placement and basic attribute binding, they face significant limitations in fine-grained semantic understanding:

Attribute Leakage: Existing methods often fail to prevent attributes (e.g., color, texture) from "leaking" between adjacent instances, leading to visual incoherence.
Lack of Detail: Current models struggle with complex, compositional prompts involving multiple attributes (e.g., "a red plaid shirt with gold buttons"). They typically rely on coarse-grained templates or single-attribute prompts.
Data Limitations: Training datasets often lack detailed, instance-level descriptions, preventing models from learning rich semantic-visual mappings, particularly for human-centric scenes with complex clothing combinations.

2. Methodology: DEIG Framework

The authors propose DEIG, a novel framework designed for fine-grained, controllable multi-instance generation. It functions as a plug-and-play module compatible with standard diffusion pipelines (specifically built upon GLIGEN/UNet architectures). The core components are:

A. Instance Detail Extractor (IDE)

Purpose: To transform high-dimensional text encoder embeddings into compact, instance-aware representations that capture fine-grained details.
Mechanism:
- Replaces traditional multi-modal encoders with a frozen large text encoder (e.g., T5-XL) to better capture semantic nuances.
- Uses learnable queries to distill high-dimensional text features into a compact "Aggregated Semantic Dimension" ( $S$ ).
- Employs stacked self-attention and cross-attention layers conditioned on diffusion timesteps (via TimeMLP and AdaLN). This allows the model to align specific textual tokens with visual regions dynamically during the generation process.
- Output: Compact embeddings that represent specific attributes (color, material, texture) for each instance without overwhelming the model.

B. Detail Fusion Module (DFM)

Purpose: To integrate the refined instance embeddings into the generation process while preventing semantic leakage.
Mechanism:
- Grounding Embeddings Broadcast: Spatial coordinates (bounding boxes) are encoded via Fourier features and broadcast to align with the semantic dimension, fusing spatial and semantic cues.
- Instance-based Masked Attention: This is the core innovation for preventing leakage. The self-attention mechanism in the UNet is modified with a binary mask $M$ $M$ :
  - Visual-Visual: Unmasked (allows global coherence).
  - Instance-Visual: Symmetric masking; an instance embedding can only attend to visual features of its own bounding box.
  - Instance-Instance: Masked; instances cannot attend to each other's embeddings.
- This ensures that attributes defined for "Instance A" do not influence the generation of "Instance B."

C. Detail-Enriched Dataset Construction

To support fine-grained training, the authors curated a high-quality dataset from MS-COCO.
Process: Used a Vision-Language Model (VLM, Qwen2.5-VL) to generate natural, compositional captions (20–30 words) for cropped instances.
Filtering: Applied CLIP scoring and human verification to ensure consistency and remove low-fidelity or hallucinated descriptions.

3. Key Contributions

DEIG Framework: A novel architecture integrating an Instance Detail Extractor (IDE) and a Detail Fusion Module (DFM) that enables precise, attribute-consistent generation of multiple instances.
DEIG-Bench: A new benchmark specifically designed to evaluate fine-grained, multi-attribute generation.
- Includes human-centric tasks (evaluating color combinations across clothing regions) and object-centric tasks (evaluating color, material, and texture combinations).
- Features compositional prompts and structured difficulty levels (C1–C3 for humans, L1–L4 for objects).
State-of-the-Art Performance: Demonstrated significant improvements over existing methods (GLIGEN, MIGC, InstanceDiffusion, ROICtrl) in spatial consistency, semantic accuracy, and compositional generalization.
Plug-and-Play Design: The method can be integrated into existing diffusion pipelines with minimal overhead, requiring no retraining of the base model.

4. Experimental Results

The paper evaluates DEIG on DEIG-Bench, MIG-Bench, and InstDiff-Bench.

Quantitative Performance:
- DEIG-Bench: DEIG achieved a Multi-Attribute Accuracy (MAA) of 0.75 for humans and 0.44 for objects (using Qwen2.5-VL), significantly outperforming the next best method (ROICtrl: 0.31/0.39). It showed particularly strong gains in complex color combinations.
- MIG-Bench: Achieved an Instance Success Rate of 72.25% (vs. ~65% for baselines) and high mIoU scores, indicating superior spatial and attribute alignment.
- InstDiff-Bench: Showed strong performance in accuracy and CLIP alignment, though spatial precision (AP) was slightly lower in extremely dense scenes due to the strict masking strategy.
Ablation Studies:
- Removing the IDE caused a drop in semantic alignment.
- Removing the DFM (masking) led to attribute leakage and reduced accuracy.
- Removing detailed captions caused the largest performance drop, highlighting the importance of fine-grained supervision.
Efficiency: The aggregated semantic dimension $S$ was found to have an optimal range of 16–32, balancing precision and GPU memory usage.

5. Significance and Impact

Advancing Controllable Generation: DEIG bridges the gap between simple spatial control and complex semantic understanding, enabling the generation of scenes with rich, localized details (e.g., specific clothing patterns on multiple people).
Benchmarking: By introducing DEIG-Bench, the paper addresses the lack of rigorous evaluation standards for fine-grained, multi-attribute generation, particularly for human subjects.
Practical Application: The plug-and-play nature of DEIG makes it immediately applicable to industries requiring high-fidelity visual synthesis, such as fashion design, advertising, and animation, where precise control over multiple elements is crucial.
Future Directions: The authors note limitations in handling extremely dense scenes with heavy overlap and small objects, suggesting future work in improving spatial disentanglement in cluttered environments.

In summary, DEIG represents a significant step forward in text-to-image generation by solving the "attribute leakage" problem and enabling the synthesis of complex, multi-instance scenes with high semantic fidelity.