InstructHumans: Editing Animated 3D Human Textures with Instructions

Imagine you have a digital 3D puppet of a person. You want to change their outfit or give them a new hairstyle just by typing a sentence like, "Put him in a traditional Japanese kimono."

This is the goal of InstructHumans, a new technology that lets you edit 3D characters using simple text instructions. But here's the catch: previous attempts to do this were like trying to repaint a portrait while accidentally erasing the person's face or turning their clothes into a blurry mess.

Here is how the paper solves this problem, explained through simple analogies.

The Problem: The "Over-Eager" Painter

To understand the breakthrough, we first need to understand the old method.

Imagine you have a 3D character (let's call him Bob). You want to change his shirt to a suit.

The Old Way (Standard SDS): You hire a painter who is an expert at creating new art from scratch. You tell him, "Paint a man in a suit."
- The Result: The painter ignores Bob entirely. He paints a new man in a suit. Bob's original face, his specific nose shape, and his original pants might get wiped out or turned into a blurry, unrecognizable mess. The painter is too focused on the instruction and forgets to respect the original subject.

In technical terms, the old method used a technique called Score Distillation Sampling (SDS). It was designed for generating new things, not editing existing ones. When applied to editing, it destroyed the consistency of the original character.

The Solution: The "Smart Editor" (InstructHumans)

The authors created a new framework called InstructHumans. Instead of hiring a painter who starts from a blank canvas, they hired a Smart Editor who knows exactly what to keep and what to change.

Here are the three main "tricks" they used to make this work:

1. The "Time-Traveling" Editor (SDS-E)

The old painter tried to do everything at once, which caused chaos. The new editor, SDS-E, realizes that editing is a process that happens in stages, like developing a photo.

Early Stage (The Rough Draft): When you first start editing, you need to make big changes. The editor focuses on the "big picture" instructions (e.g., "Change the clothes").
Late Stage (The Details): As the edit gets closer to the final look, the editor stops making big changes and focuses on fine details (e.g., "Make the fabric texture look like silk").
The Magic: The paper shows that if you use the "big picture" rules during the "detail" phase, you ruin the image. SDS-E acts like a conductor, telling the editor which rules to follow at which time. This ensures the character's face stays recognizable while the clothes change perfectly.

2. The "Spotlight" Strategy (Gradient-Aware Sampling)

Imagine you are editing a photo of a person.

If you tell the computer, "Put on sunglasses," it should focus 100% of its energy on the face.
If you say, "Put on a kimono," it should focus on the body.

Old methods treated every part of the body equally, wasting time trying to edit the feet when you only wanted to change the face.
InstructHumans uses a Spotlight Strategy. It looks at the text instruction, figures out which part of the body needs the most attention, and shines a "spotlight" on that area. It spends more time rendering and editing that specific spot, making the process faster and the result much sharper.

3. The "Smoothie" Blender (Smoothness Regularizer)

Sometimes, when you edit a 3D model, the texture can look "spotty" or noisy, like a TV with bad static.

The Fix: The authors added a Smoothness Blender. Think of it like a blender that takes the colors of a pixel and its neighbors and mixes them slightly. This ensures that if you paint a red shirt, the red is smooth and consistent, not a patchwork of red and white pixels. It keeps the texture looking high-quality and realistic.

The Result: A Faithful Transformation

When you put all these pieces together, you get InstructHumans.

Before: You type "Turn him into a clown," and the result is a blurry, scary mess that doesn't look like the original person.
After: You type "Turn him into a clown," and the result is the exact same person (same face, same body shape) but now wearing a clown nose and red hair. The animation still works perfectly; he can still dance and jump without his clothes glitching out.

Why This Matters

This technology is a bridge between generative AI (making new things) and editing AI (changing existing things). It proves that you don't have to sacrifice the identity of a character to change their look. Whether you are a game developer wanting to quickly swap a character's outfit, or a filmmaker needing to age a character by 20 years, this tool allows for intuitive, high-quality, and consistent 3D editing.

In short: It's like having a magical tailor who can completely re-dress your favorite 3D character based on a text message, without ever losing the character's unique personality or face.

Here is a detailed technical summary of the paper "InstructHumans: Editing Animated 3D Human Textures with Instructions".

1. Problem Statement

The paper addresses the challenge of text-guided editing of animatable 3D human avatars. While recent advancements in 2D diffusion models (like InstructPix2Pix) allow for intuitive image editing, applying these to 3D human textures presents unique difficulties:

Consistency vs. Change: Unlike 3D generation (where a model starts from random noise), 3D editing starts with an existing source avatar. The goal is to modify specific features (e.g., clothing, makeup) based on text instructions while strictly preserving the original identity, geometry, and unedited textures.
Failure of Standard SDS: Existing methods often apply Score Distillation Sampling (SDS) directly. SDS was designed for generation, not editing. When naively applied to editing, it causes:
- Loss of Consistency: The avatar's identity (face, original clothes) is distorted or destroyed.
- Blurriness and Artifacts: The resulting textures often become blurry, spotty, or over-saturated.
- Conflict of Objectives: The guidance signal in standard SDS conflicts with the requirement to maintain the source structure, leading to poor-quality edits.

2. Methodology: InstructHumans Framework

The authors propose InstructHumans, a framework that integrates a modified distillation process with a hybrid 3D human representation.

A. Modified Score Distillation Sampling for Editing (SDS-E)

The core innovation is SDS-E, a customized version of SDS tailored for editing. The authors decompose the standard SDS guidance signal into individual terms and analyze their impact across different diffusion timesteps:

Decomposition: The guidance is broken down into four terms ( $m_1, m_2, m_3, m_4$ $m_{1}, m_{2}, m_{3}, m_{4}$ ) based on dual conditioning (Source Image $I$ $I$ + Text Instruction $y$ $y$ ).
- $m_1$ (Baseline-shift): Shifts away from the source image distribution.
- $m_3$ (Condition-divergence): Adjusts from image-only to text+image.
- $m_4$ (Full-condition): Guides toward the joint text-image mode.
Temporal Staging: The authors found that these terms have conflicting effects depending on the diffusion timestep ( $t$ $t$ ):
- Large Timesteps: Harmful; they disrupt the original structure. Action: Removed entirely.
- Middle Timesteps: $m_4$ alone causes "intermediate mode traps" (over-smoothing). Action: Combine $m_3$ and $m_4$ to escape traps while balancing conditions.
- Small Timesteps: $m_1$ causes saturation; $m_3$ may distance the result too far from the source. Action: Selectively apply terms to ensure high-fidelity details without losing identity.
Result: SDS-E selectively applies these terms based on a non-increasing timestep schedule, ensuring the model moves toward the desired edit while anchoring to the source avatar.

B. Hybrid 3D Human Representation

The framework utilizes the EditableHumans representation:

Structure: Combines an explicit SMPL-X mesh (for animation and pose) with an implicit NeRF (for texture and geometry).
Latent Codes: Each mesh vertex is associated with local geometry and texture latent codes. This allows for localized texture edits while maintaining the ability to animate the avatar by changing pose parameters.

C. Optimization Enhancements

To further improve quality and efficiency, two additional components are introduced:

Laplacian Smoothness Regularization:
- Problem: SDS optimization often introduces high-frequency noise and "spotting" due to inconsistent multi-view supervision.
- Solution: A Laplacian constraint is applied to the latent codes to enforce spatial coherence between neighboring vertices, reducing artifacts and flickering.
Gradient-Aware Viewpoint Sampling:
- Problem: Uniform camera sampling wastes computation on regions not requiring edits (e.g., editing a face while sampling the back of the body).
- Solution: The system calculates the gradient magnitude for different body regions (Face, Head, Body, Arms) and dynamically allocates more camera views to regions with high "editing strength." This accelerates convergence and improves editing specificity.

3. Key Contributions

Theoretical Analysis of SDS for Editing: The paper provides the first in-depth decomposition of SDS terms specifically for the dual-condition (image + text) editing scenario, revealing why standard SDS fails for editing and how to fix it.
SDS-E (Score Distillation Sampling for Editing): A novel loss function that selectively applies SDS terms across timesteps to balance instruction adherence with source preservation.
InstructHumans Framework: A complete pipeline integrating SDS-E with a hybrid human representation, enabling high-fidelity, animatable 3D edits.
Efficiency and Quality Innovations: Introduction of Gradient-Aware Viewpoint Sampling (for speed and focus) and Laplacian Smoothness (for texture quality).

4. Experimental Results

The authors evaluated InstructHumans against state-of-the-art methods, including IN2N (NeRF editing), AvatarCLIP, TADA (Generation), and standard SDS variants (SSD, NFSD).

Qualitative Results:
- Fidelity: InstructHumans produces sharper, more photorealistic textures compared to the blurry or spotty results of SDS-based baselines.
- Consistency: Unlike generation methods (TADA, AvatarCLIP) that often alter the subject's identity, InstructHumans preserves the original avatar's face and unedited body parts while applying the requested changes (e.g., changing clothes to a kimono or adding makeup).
- Animation: The edited avatars remain fully animatable and driveable via SMPL-X pose parameters without artifacts.
Quantitative Results:
- CLIP-Direc (Text Alignment): Outperforms IN2N and SDS variants.
- CLIP-Img (Image Consistency): Achieves the highest scores, indicating better preservation of the original avatar compared to other editing methods.
- LPIPS (Perceptual Quality): Significantly lower (better) than baselines, indicating fewer artifacts.
- User Study: In a study with 315 participants, InstructHumans was significantly preferred over all baselines in Visual Quality, Image Consistency, and Text Consistency.

5. Significance

Bridging Generation and Editing: The paper clarifies the fundamental differences between 3D generation and 3D editing, demonstrating that "generation" tools cannot be naively repurposed for "editing" tasks without modification.
Practical Application: It enables intuitive, text-driven customization of 3D human assets for applications in gaming, virtual reality, and digital fashion, where maintaining character identity is crucial.
Generalizability: While focused on humans, the SDS-E decomposition and temporal staging strategy offer a generalizable approach for improving text-guided 3D editing in other domains (e.g., objects, scenes).

In summary, InstructHumans solves the critical problem of identity loss in 3D text editing by mathematically re-engineering the distillation process (SDS-E) and optimizing the rendering pipeline, resulting in the first method capable of producing high-fidelity, animatable 3D human edits that strictly adhere to user instructions while preserving the source avatar's identity.