Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Imagine you want to create a custom 3D character for a video game or a movie, but you don't have a camera crew, a studio, or a team of artists. You just have a few sentences describing the person or a single photo.

This paper introduces PromptAvatar, a new "magic machine" that turns those simple descriptions or photos into a high-quality, ready-to-use 3D character in under 10 seconds.

Here is how it works, explained through everyday analogies:

1. The Problem: The Old Ways Were Slow or Clunky

Before this, creating 3D characters from text or photos was like trying to sculpt a statue by guessing.

The "Guessing Game" (Text-to-3D): Old methods would take a text description (e.g., "a man with a beard") and try to sculpt a 3D face by constantly checking a dictionary (CLIP) and adjusting the clay over and over again. It was slow, often resulted in smooth, boring faces, and couldn't handle tiny details like a specific type of eyebrow or a scar.
The "Scarcity" Problem (Image-to-3D): Other methods tried to learn from photos, but they needed thousands of perfect, studio-lit photos of real people. Getting these photos is expensive and hard, so the AI didn't learn enough to make good characters for everyone.

2. The Solution: Building a Massive "Recipe Book"

The authors realized the AI needed a better teacher. So, they built a giant, automated kitchen to create a massive dataset (a "recipe book") of 100,000 unique characters.

The Ingredients: They didn't just take photos. They used AI to generate four things for every character:
1. The Face Shape: The 3D skeleton.
2. The Skin Texture: A flat, perfectly lit map of the skin (like a wallpaper for the face) without shadows or wrinkles caused by bad lighting.
3. The Photo: A realistic photo of the face in the wild (with messy hair, sunglasses, or weird lighting).
4. The Description: A detailed text description written by a super-smart AI (Qwen) that notices tiny details like "olive skin," "crow's feet," or "a rounded chin."

Think of this dataset as a library where every book contains a photo, a 3D model, a flat skin map, and a detailed biography, all perfectly matched.

3. The Engine: The "Dual Diffusion" Chef

With this massive library, they built PromptAvatar, which uses two specialized "chefs" (Diffusion Models) working together:

Chef 1: The Skin Artist (Texture Diffusion Model)
- What it does: It paints the skin.
- How it works: If you give it a text prompt ("a woman with freckles"), it paints the skin map. If you give it a photo, it looks at the photo, peels off the skin texture, and uses that as a guide to paint a perfect, shadow-free skin map. It can even take a photo and a text prompt together to get the best of both worlds.
- The Magic: It doesn't just guess; it learns the direct link between "words/photos" and "skin patterns."
Chef 2: The Sculptor (Geometry Diffusion Model)
- What it does: It shapes the face.
- How it works: It takes a text prompt (e.g., "a square jaw") and instantly sculpts the 3D shape of the head. It doesn't need to guess; it knows exactly how to turn words into a 3D mesh.

4. The Result: Instant, High-Fidelity Avatars

Because these chefs learned directly from the massive dataset, they don't need to "guess and check" anymore.

Speed: It takes less than 10 seconds to generate a character.
Quality: The characters have high-frequency details (pores, wrinkles, specific beard shapes) that older methods smoothed out.
Flexibility: You can change the character's age, gender, or add a beard just by changing a few words in the prompt, and the AI updates the 3D model instantly.

The Bottom Line

Think of PromptAvatar as a 3D printer for human faces. Instead of needing a factory full of scanners and artists, you just type a description or upload a photo, and the machine instantly prints a detailed, animated 3D character ready for movies, games, or VR. It solves the old problems of being too slow, too expensive, or too blurry by learning from a massive, self-made library of perfect examples.

1. Problem Statement

Generating high-fidelity, animatable 3D avatars from text or image prompts is a critical challenge for VR, AR, and human-computer interaction. Existing methods face significant limitations:

Text-to-3D Methods: Rely on iterative Score Distillation Sampling (SDS) or CLIP optimization. These approaches suffer from slow inference (minutes to hours), produce overly smooth results that lack fine-grained detail, and struggle with precise semantic control over facial attributes.
Image-to-3D Methods: Are bottlenecked by the scarcity and high cost of acquiring high-quality 3D facial scans. Existing datasets often lack "light-normalized" textures (free from shadows and occlusions), making them unsuitable for rendering under new lighting conditions. Furthermore, many methods rely on linear coefficients that fail to capture high-frequency facial details (e.g., pores, wrinkles).
Data Scarcity: There is a lack of large-scale datasets pairing fine-grained textual descriptions with high-quality 3D geometry and light-normalized texture UV maps.

2. Methodology: PromptAvatar

The authors propose PromptAvatar, a framework that eliminates iterative optimization by learning a direct mapping from multi-modal prompts to 3D representations using Dual Diffusion Models.

A. Large-Scale Dataset Construction

To train the models, the authors constructed a novel dataset of 100,000+ pairs comprising four modalities:

Fine-grained Text Descriptions: Generated automatically using Qwen2.5-VL-32B-Instruct, structured into three categories: demographics ( $P_{base}$ ), texture details ( $P_{tex}$ ), and geometry ( $P_{geo}$ ).
In-the-Wild Face Images: Diverse images with complex lighting and poses.
Light-Normalized Texture UV Maps: High-quality, evenly illuminated textures free of occlusions.
3D Geometric Identity Coefficients: Parameters representing facial shape.

Pipeline:

De/Relighting: Uses NeRFFaceLighting to synthesize de-lighted frontal/side views and relit images under varied spherical harmonics lighting.
Texture Correction: Unwraps images into UV maps, applies blending and color correction using predefined masks and templates (inspired by FFHQ-UV) to create clean, normalized textures.
Filtering: A dual-stage filter removes artifacts (using an MLP aesthetic classifier) and ensures semantic alignment (using CLIP cosine similarity between images and text).

B. Dual Diffusion Architecture

The framework consists of two specialized diffusion models operating in latent spaces:

Texture Diffusion Model (TDM):
- Goal: Generates high-resolution, light-normalized UV texture maps.
- Input: Supports Text Prompts (via CLIP text encoder) and/or Image Prompts (wild face images).
- Mechanism: Uses a Latent Diffusion Model (LDM) in UV space. For image prompts, the system extracts an incomplete UV texture from the input image, encodes it, and concatenates it with the noisy latent variable. This ensures identity consistency.
- Training: Employs classifier-free guidance with random dropping of text/image prompts to support single or multi-modal conditioning.
Geometry Diffusion Model (GDM):
- Goal: Generates neutral facial geometry (3DMM identity coefficients).
- Input: Text prompts describing facial shape ( $P_{geo}$ ).
- Mechanism: Operates in 1D coefficient space using a custom 1D UNet (ID-UNet). It integrates cross-attention layers to inject textual semantic guidance directly into the geometric denoising process.

Inference: The system generates the geometry and texture in parallel (or sequentially) and combines them to produce a final 3D avatar compatible with rendering engines like Blender. The entire process takes under 10 seconds.

3. Key Contributions

Novel Dataset: Creation of a low-cost, reproducible, large-scale dataset (100k+ samples) linking fine-grained text, wild images, normalized UV textures, and 3D geometry.
PromptAvatar Framework: A dual-diffusion architecture that enables direct, non-iterative generation of high-fidelity 3D avatars from text and/or images.
Efficiency & Quality: Achieves generation in <10 seconds while preserving high-frequency details (pores, wrinkles) and fine-grained semantic control, outperforming iterative SDS/CLIP methods.
Light Normalization: The generated UV maps are explicitly light-normalized, allowing for realistic relighting in downstream applications.

4. Experimental Results

The authors conducted extensive quantitative and qualitative evaluations:

Dataset Quality: Compared to FFHQ-UV, the new dataset shows significantly higher Identity Similarity across multi-view images and lower Brightness Symmetry (BS) Error, indicating better illumination normalization.
Text-to-Avatar:
- CLIP Score: Achieved 21.14, outperforming DreamFusion (20.16), Describe3D (19.81), and DreamFace (20.56).
- Inference Speed: 10 seconds vs. 300s (DreamFace) and 2400s (DreamFusion).
- Qualitative: Demonstrated superior alignment with fine-grained prompts (e.g., specific beard shapes, eyebrow styles) compared to the smooth, averaged results of SDS-based methods.
Image-to-Avatar:
- Outperformed FFHQ-UV and FlameTex in identity similarity on FFHQ and CelebAMask-HQ datasets.
- Preserved high-frequency details (crow's feet, skin texture) better than GAN-based or linear model approaches.
Ablation Studies:
- Confirmed that extracting identity embeddings via ArcFace alone is insufficient; the proposed method of injecting incomplete UV textures directly into the diffusion latent space yields better identity preservation.
- Demonstrated that fine-tuning the VAE on UV maps (rather than using a pre-trained natural image VAE) is crucial for recovering realistic high-frequency details.

5. Significance

This work represents a paradigm shift in 3D avatar generation:

From Iterative to Direct: It moves away from computationally expensive iterative optimization (SDS) to a direct, feed-forward diffusion approach.
Democratization: By reducing generation time to seconds and removing the need for expensive 3D scanning, it lowers the barrier for creating personalized digital assets.
Versatility: The ability to handle both text and image prompts, along with the production of light-normalized assets, makes the output immediately usable in physically-based rendering (PBR) pipelines for games, film, and VR.
Open Science: The authors plan to release the dataset and model, fostering further research in multi-modal 3D generation.

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

1. The Problem: The Old Ways Were Slow or Clunky

2. The Solution: Building a Massive "Recipe Book"

3. The Engine: The "Dual Diffusion" Chef

4. The Result: Instant, High-Fidelity Avatars

The Bottom Line

1. Problem Statement

2. Methodology: PromptAvatar

A. Large-Scale Dataset Construction

B. Dual Diffusion Architecture

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Founder effects shape the evolutionary dynamics of multimodality in open LLM families

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

Graphs RAG at Scale: Beyond Retrieval-Augmented Generation With Labeled Property Graphs and Resource Description Framework for Complex and Unknown Search Spaces

T-MAP: Red-Teaming LLM Agents with Trajectory-aware Evolutionary Search