Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

The paper introduces PromptAvatar, a framework utilizing dual diffusion models trained on a novel large-scale multi-modal dataset to generate high-fidelity, shading-free 3D avatars from text or image prompts in under 10 seconds, overcoming the slow inference and limited generalization of existing methods.

Hong Li, Yutang Feng, Minqi Meng, Yichen Yang, Xuhui Liu, Baochang Zhang

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you want to create a custom 3D character for a video game or a movie, but you don't have a camera crew, a studio, or a team of artists. You just have a few sentences describing the person or a single photo.

This paper introduces PromptAvatar, a new "magic machine" that turns those simple descriptions or photos into a high-quality, ready-to-use 3D character in under 10 seconds.

Here is how it works, explained through everyday analogies:

1. The Problem: The Old Ways Were Slow or Clunky

Before this, creating 3D characters from text or photos was like trying to sculpt a statue by guessing.

  • The "Guessing Game" (Text-to-3D): Old methods would take a text description (e.g., "a man with a beard") and try to sculpt a 3D face by constantly checking a dictionary (CLIP) and adjusting the clay over and over again. It was slow, often resulted in smooth, boring faces, and couldn't handle tiny details like a specific type of eyebrow or a scar.
  • The "Scarcity" Problem (Image-to-3D): Other methods tried to learn from photos, but they needed thousands of perfect, studio-lit photos of real people. Getting these photos is expensive and hard, so the AI didn't learn enough to make good characters for everyone.

2. The Solution: Building a Massive "Recipe Book"

The authors realized the AI needed a better teacher. So, they built a giant, automated kitchen to create a massive dataset (a "recipe book") of 100,000 unique characters.

  • The Ingredients: They didn't just take photos. They used AI to generate four things for every character:
    1. The Face Shape: The 3D skeleton.
    2. The Skin Texture: A flat, perfectly lit map of the skin (like a wallpaper for the face) without shadows or wrinkles caused by bad lighting.
    3. The Photo: A realistic photo of the face in the wild (with messy hair, sunglasses, or weird lighting).
    4. The Description: A detailed text description written by a super-smart AI (Qwen) that notices tiny details like "olive skin," "crow's feet," or "a rounded chin."

Think of this dataset as a library where every book contains a photo, a 3D model, a flat skin map, and a detailed biography, all perfectly matched.

3. The Engine: The "Dual Diffusion" Chef

With this massive library, they built PromptAvatar, which uses two specialized "chefs" (Diffusion Models) working together:

  • Chef 1: The Skin Artist (Texture Diffusion Model)

    • What it does: It paints the skin.
    • How it works: If you give it a text prompt ("a woman with freckles"), it paints the skin map. If you give it a photo, it looks at the photo, peels off the skin texture, and uses that as a guide to paint a perfect, shadow-free skin map. It can even take a photo and a text prompt together to get the best of both worlds.
    • The Magic: It doesn't just guess; it learns the direct link between "words/photos" and "skin patterns."
  • Chef 2: The Sculptor (Geometry Diffusion Model)

    • What it does: It shapes the face.
    • How it works: It takes a text prompt (e.g., "a square jaw") and instantly sculpts the 3D shape of the head. It doesn't need to guess; it knows exactly how to turn words into a 3D mesh.

4. The Result: Instant, High-Fidelity Avatars

Because these chefs learned directly from the massive dataset, they don't need to "guess and check" anymore.

  • Speed: It takes less than 10 seconds to generate a character.
  • Quality: The characters have high-frequency details (pores, wrinkles, specific beard shapes) that older methods smoothed out.
  • Flexibility: You can change the character's age, gender, or add a beard just by changing a few words in the prompt, and the AI updates the 3D model instantly.

The Bottom Line

Think of PromptAvatar as a 3D printer for human faces. Instead of needing a factory full of scanners and artists, you just type a description or upload a photo, and the machine instantly prints a detailed, animated 3D character ready for movies, games, or VR. It solves the old problems of being too slow, too expensive, or too blurry by learning from a massive, self-made library of perfect examples.