WildGHand: Learning Anti-Perturbation Gaussian Hand Avatars from Monocular In-the-Wild Videos

Imagine you are trying to take a perfect, high-definition photo of your hand to create a 3D digital twin (an "avatar") that you can use in video games or virtual reality.

In a perfect world, you would do this in a studio with perfect lighting, a steady camera, and no distractions. But in the real world ("in the wild"), things go wrong. Your hand might be holding a coffee cup (occlusion), the light might suddenly change from bright sun to a dark room, your hand might move so fast it gets blurry, or you might be doing a weird, twisted pose.

Most existing 3D hand technologies are like perfectionist chefs who can only cook if the kitchen is spotless and the ingredients are perfect. If you give them a messy kitchen, they burn the food or give up.

WildGHand is like a master chef who can cook a gourmet meal even in a chaotic, stormy kitchen.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Noise" vs. The "Signal"

When you record a video of your hand in the real world, the camera sees two things mixed together:

The Signal: Your actual hand (the shape, the skin texture, the wrinkles).
The Noise: The messiness (blur, shadows, objects blocking the view, weird lighting).

Old methods try to learn everything they see. So, if your hand is blurry, the 3D model learns to be blurry. If a coffee cup blocks your finger, the model learns that your finger is actually a coffee cup. This results in a weird, distorted digital hand.

2. The Solution: Two Special Tools

WildGHand uses two clever tricks to separate the "Signal" from the "Noise."

Trick #1: The "Time-Traveling Filter" (Dynamic Perturbation Disentanglement)

Imagine you are watching a movie of your hand, but every few seconds, a ghost appears and changes the color of the screen or blurs the image.

Old Method: The computer tries to memorize the ghost and the hand together.
WildGHand: It has a special "Time-Traveling Filter." It knows that the ghost (the noise) only shows up at specific times.
- It creates a mental note: "At second 5, the image is blurry. At second 12, the light is too bright."
- It learns to add a correction to the 3D model to cancel out these specific moments.
- The Magic: When it's time to show the final 3D hand to the user, it simply turns off the filter. The ghost disappears, and you are left with a clean, perfect hand, even though the original video was messy.

Trick #2: The "Smart Spotlight" (Perturbation-Aware Optimization)

Imagine you are trying to paint a picture of your hand, but someone keeps throwing mud on the canvas.

Old Method: The painter tries to paint over the mud, making the whole picture muddy.
WildGHand: It uses a Smart Spotlight.
- When the computer sees a part of the image that looks "weird" (like a blurry finger or a coffee cup blocking the view), the spotlight dims that area. It says, "I don't trust this part of the image. Don't let the painter learn from this spot."
- It shines a bright light only on the clear, trustworthy parts of the hand.
- This ensures the 3D model only learns from the good parts of the video, ignoring the "mud."

3. The New Playground (The HWP Dataset)

To prove their method works, the researchers realized that existing test videos were too easy (like practicing in a quiet library). So, they built a new "gym" called the HWP Dataset.

This dataset is full of "chaos": people spinning pens, shuffling cards, applying lotion, and moving their hands in crazy ways, all while the camera shakes or the lights flicker.
It's like a "stress test" for their 3D hand model.

The Result

When they tested WildGHand against other top methods:

Other methods produced hands that looked like melted wax, had missing fingers, or looked like they were made of plastic.
WildGHand produced hands that looked real, with detailed skin texture, veins, and nails, even when the input video was terrible.

In summary: WildGHand is a smart system that doesn't just "look" at a messy video; it actively figures out what is wrong with the video, ignores the bad parts, and mathematically cleans up the image to build a perfect 3D hand avatar. It's the difference between trying to see through a dirty window and having a magical wiper that cleans the glass just for you.

1. Problem Statement

Existing methods for 3D hand reconstruction from monocular videos typically rely on data captured in controlled studio environments. When applied to real-world ("in-the-wild") scenarios, these methods suffer significant performance degradation due to severe perturbations, including:

Hand-object interactions: Occlusions caused by holding or manipulating objects.
Complex poses: Extreme articulation and self-occlusion.
Illumination variations: Global lighting changes affecting the entire scene.
Motion blur: Caused by rapid hand movement.

Current dynamic reconstruction models (e.g., those handling transient distractors) are insufficient because hand perturbations are often global and persistent (unlike short-lived moving objects) and hands exhibit highly articulated motions that make them sensitive to corrupted supervision. Naive optimization leads to an "underfitting-overfitting" dilemma: the model either fails to capture true hand geometry or overfits to the noise/perturbations.

2. Methodology: WildGHand

The authors propose WildGHand, an optimization-based framework extending 3D Gaussian Splatting (3DGS) to handle in-the-wild perturbations. The framework consists of two core components:

A. Dynamic Perturbation Disentanglement (DPD) Module

Concept: Instead of treating perturbations as part of the hand's canonical appearance, WildGHand explicitly models them as time-varying biases added to the 3D Gaussian attributes during optimization.
Mechanism:
- A lightweight Multi-Layer Perceptron (MLP) takes the frame index $l$ and the predicted Gaussian attributes $g$ as input.
- It generates a temporal embedding $z_l$ using positional encoding.
- It predicts a bias $\Delta g$ and a temporal weight $\omega_l \in (-1, 1)$ .
- The final attributes for a frame are $\tilde{g} = g + \omega_l \cdot \Delta g$ .
Inference Strategy: During training, the model learns to separate the canonical hand ( $g$ ) from the perturbation ( $\Delta g$ ). During inference, the bias terms are removed, ensuring the rendered avatar is free from the learned noise. This prevents overfitting to specific video artifacts.

B. Perturbation-Aware Optimization (PAO) Strategy

Concept: To prevent the model from learning from corrupted regions, the framework adaptively down-weights unreliable areas in the loss function.
Mechanism:
- Uses Segment Anything Model (SAM) to segment the hand and background.
- Calculates an anisotropic weighted mask $\lambda_u$ $λ_{u}$ for each region based on:
  1. Reconstruction Error: Regions with high rendering error (likely perturbed) receive lower weights.
  2. Hand Foreground Ratio: Ensures focus remains on the hand.
  3. Temporal Weight: Incorporates the global perturbation strength ( $\omega_l$ ) estimated by the DPD module.
- The overall loss is weighted by this mask, effectively ignoring corrupted pixels during gradient updates.

C. Overall Pipeline

Input: A short monocular video of a hand.
Initialization: Uses a pre-trained Gaussian prediction network (trained on InterHand2.6M) to estimate initial Gaussian attributes based on a latent identity map and hand mesh (MANO-HD).
Optimization: Jointly optimizes the Gaussian attributes, the DPD biases, and the PAO masks to minimize the weighted reconstruction loss between rendered and input frames.
Output: A high-fidelity, animatable 3D hand avatar.

3. Key Contributions

WildGHand Framework: A novel 3DGS-based framework specifically designed for robust hand avatar reconstruction under diverse, severe perturbations.
Dynamic Perturbation Disentanglement (DPD): A module that models perturbations as separable, temporally weighted biases, allowing the model to learn the "clean" hand appearance while discarding noise at inference.
Perturbation-Aware Optimization (PAO): A strategy that generates spatially and temporally adaptive loss masks to suppress the influence of occlusions, blur, and lighting changes during training.
HWP Dataset: The authors curated a new Hand With Perturbation (HWP) dataset containing over 13.8K frames from monocular videos. It covers four challenging categories (hand-object interaction, complex poses, illumination changes, motion blur) in both single-hand and interacting-hand scenarios, addressing the lack of realistic benchmarks in the field.

4. Experimental Results

The method was evaluated on the new HWP dataset, two public datasets (InterHand2.6M, AnchorCrafter), and online videos.

Quantitative Performance: WildGHand achieved State-of-the-Art (SOTA) results across all metrics (PSNR, SSIM, LPIPS).
- On the HWP dataset, it achieved a 15.8% relative gain in PSNR and a 23.1% relative reduction in LPIPS compared to its base model.
- It consistently outperformed baselines like Handy, InterGaussianHand, and UHM, particularly in interacting-hand scenarios where occlusion is heavy.
Qualitative Performance: Visual comparisons showed WildGHand successfully reconstructing fine details (nails, wrinkles, veins) and maintaining geometric consistency, whereas baselines suffered from artifacts, floating structures, and skin tone deviations.
Ablation Studies:
- Removing the DPD module resulted in artifacts like "floaters" and broken structures.
- Removing the PAO strategy led to significant performance drops, confirming that adaptive masking is crucial for robustness.
- The full model significantly outperformed a version using simple binary masks (SAM output), proving the necessity of anisotropic, error-aware weighting.

5. Significance

Bridging the Gap: WildGHand effectively bridges the gap between controlled studio reconstruction and real-world application, enabling high-fidelity hand avatar creation from casual, short monocular videos.
Robustness to Noise: By explicitly disentangling perturbations rather than trying to ignore them or treat them as static outliers, the method offers a new paradigm for handling persistent, global noise in differentiable rendering.
Resource Efficiency: Unlike methods relying on heavy generative priors (e.g., video diffusion models), WildGHand uses lightweight MLPs and standard 3DGS, making it computationally efficient and suitable for per-scene optimization.
Community Resource: The release of the HWP dataset provides a critical benchmark for future research in robust 3D hand reconstruction, encouraging the development of methods that can handle real-world complexity.