Motion-Aware Animatable Gaussian Avatars Deblurring

The Problem: The "Motion Blur" Mystery

Imagine you are trying to take a photo of a fast-moving dancer. If you leave your camera shutter open for too long, the dancer doesn't look like a sharp, clear person. Instead, they look like a smear of paint or a ghostly streak. This is called motion blur.

In the world of computer vision, scientists want to turn videos of people into 3D digital avatars (like video game characters that you can rotate and watch from any angle). However, existing 3D models are very picky: they only work well if the input video is crystal clear. If the video is blurry because the person was moving fast, the 3D model gets confused. It might build a twisted, distorted avatar because it can't tell if the blur is caused by the person moving or if the person is just a weird shape.

The Core Challenge: It's like trying to solve a jigsaw puzzle where the pieces are all melted together. You can't tell where one piece ends and the next begins.

The Solution: "MAD-Avatar" (The Time-Traveling Chef)

The authors of this paper created a new method called MAD-Avatar. Instead of trying to fix the blurry video first (like sharpening a photo) and then building the 3D model, they do both at the same time.

Think of the process like a Time-Traveling Chef:

The Blurry Dish: You have a blurry photo (the "dish") that looks like a smoothie of a person.
The Secret Recipe (Physics): The Chef knows the laws of physics. They know that a smoothie is just many distinct ingredients blended together.
The Reverse Blend: Instead of guessing the shape, the Chef works backward. They ask: "If I had a super-sharp version of this person at 100 different tiny moments in time, and I blended them all together, would I get this blurry photo?"

How It Works: The Three Magic Tricks

1. The "Virtual Time-Slice" Camera

Normally, a camera takes one picture in, say, 1/50th of a second. During that tiny moment, a person moves a little bit, creating a blur.
MAD-Avatar imagines that inside that 1/50th of a second, the camera actually took 100 tiny, super-fast snapshots (virtual sharp images).

Analogy: Imagine a fan spinning so fast it looks like a solid disk. MAD-Avatar imagines stopping the fan 100 times in a split second to see every single blade clearly, then mathematically "blending" them back together to see if it matches the real blurry photo.

2. The "Skeleton Puppeteer" (SMPL)

To keep the 3D model from turning into a blob, the system uses a digital skeleton (called SMPL).

Analogy: Think of a marionette puppet. Even if the puppet is moving so fast it's a blur, the puppeteer knows exactly how the strings are pulling the joints. The system uses this "puppeteer logic" to guess how the joints moved during the blur. It forces the 3D model to stay in a human shape, even when the video is messy.

3. The "Consistency Check" (Regularization)

Sometimes, the math gets confused. A blur could mean the person moved left-to-right, or right-to-left. Both look the same in a blur.

Analogy: Imagine watching a movie where the character suddenly teleports from the left side of the room to the right side in one frame. It looks weird and unnatural.
The system adds a rule: "Motion must be smooth." It checks that the movement from one frame to the next makes sense. If the math suggests the person teleported, the system says, "No, that's wrong," and corrects the direction.

Why This Is a Big Deal

Previous methods tried to fix the blur first (like using Photoshop to sharpen a photo) and then build the 3D model.

The Old Way: Like trying to sharpen a blurry photo of a crowd, then trying to build a 3D model of one specific person. The sharpening might make the background look weird, confusing the 3D builder.
The New Way (MAD-Avatar): It builds the 3D model while understanding the blur. It treats the blur as a clue, not just a mistake.

The Results

The team tested this on:

Fake Data: They took clear videos of dancers, artificially blurred them, and saw if the AI could recover the sharp 3D version.
Real Data: They built a special camera rig with 12 cameras spinning around a person. Some cameras took blurry photos (simulating a fast shutter), and others took sharp ones.
The iPhone Demo: They even made it work with a video taken on a regular iPhone 16 Pro.

The Verdict: The new method creates much sharper, more detailed 3D avatars than previous methods, even when the input video is very blurry. It successfully recovers the "ghost" of the person and turns them back into a solid, rotatable 3D character.

Summary

MAD-Avatar is a smart system that doesn't just "erase" blur. Instead, it understands how blur happens. It acts like a detective who looks at a smeared fingerprint and reconstructs the exact shape of the finger that made it, allowing us to create perfect 3D digital twins from messy, real-world videos.

1. Problem Statement

The creation of high-quality, animatable 3D human avatars from multi-view videos is a critical task in computer vision, typically achieved using 3D Gaussian Splatting (3DGS) combined with the Skinned Multi-Person Linear (SMPL) model. However, existing methods rely heavily on sharp, high-quality input images.

In real-world scenarios, motion blur is unavoidable due to unpredictable human movement speeds and camera exposure times. Motion blur introduces two critical challenges for 3D reconstruction:

Structural Ambiguity: Blurry frames cause 3DGS models to learn distorted 3D representations because the blur obscures structural details and textures.
Parameter Estimation Errors: Blur leads to inaccurate estimation of SMPL parameters (pose and shape), which are essential for animating the avatar.

Existing solutions often attempt a two-stage approach: first applying 2D deblurring to the video, then training the 3D model. This fails because 2D deblurring lacks 3D scene consistency, leading to artifacts across different viewpoints and failing to resolve the inherent ambiguities of motion direction.

2. Methodology

The paper proposes MAD-Avatar, a novel framework that directly reconstructs sharp, animatable 3D Gaussian avatars from blurry videos by jointly optimizing the avatar representation and motion parameters within a 3D-aware physics-based blur formation model.

A. 3D-Aware Blur Formation Model

Instead of treating blur as a 2D image processing problem, the authors model it as an integration of virtual sharp images over the camera exposure time ( $\tau$ ).

Formulation: The captured blurry image $I_B$ is modeled as the average of $T$ virtual sharp images rendered at discrete time steps within the exposure period.
3DGS Integration: The model uses a canonical 3D Gaussian set $\{G_k\}$ and deforms them into the observation space at each time step $t$ using SMPL parameters (pose $\Theta_t$ , shape $\beta_t$ , and blend weights $B_t$ ).
Synthesis: The final blurry image is synthesized by rasterizing the warped Gaussians at each time step and averaging them. This allows the loss to be computed directly against the observed blurry frame.

B. 3D Human Motion Model

To resolve the ambiguity of motion direction (e.g., distinguishing between moving left-to-right vs. right-to-left which produce similar blur), the authors introduce a specialized motion model:

Sub-frame Rigid Sequential Pose Model:
- Uses B-spline interpolation (De Boor–Cox formulation) to estimate continuous joint rotations ( $\Theta_t$ ) between frames. This ensures smooth motion trajectories.
- Pose Deformation: A CNN-based module ( $G_{disp}$ ) predicts fine-grained pose displacements ( $\Delta \Theta_t$ ) to capture high-frequency, non-rigid motion variations that B-splines alone cannot model.
Inter-frame Motion Regularization:
- To prevent directional ambiguity (where the model might flip motion direction), a regularization term ( $L_{reg}$ ) is added. It minimizes the Geodesic distance between the final pose of the current exposure period and the initial pose of the next period, ensuring temporal coherence.
Shape and Skin Weight Optimization:
- SMPL shape parameters ( $\beta$ ) and Linear Blend Skinning (LBS) weights are jointly optimized. A CNN predicts offsets for LBS weights to refine how the 3D Gaussians deform with the skeleton.

C. Optimization Pipeline

The system performs end-to-end optimization:

Initialize SMPL parameters from a coarse estimation (e.g., EasyMocap).
Estimate sub-frame motion using the B-spline + deformation model.
Warp the canonical 3DGS according to the estimated motion.
Render virtual sharp images, average them to simulate the blur, and compute the $L_1$ loss against the real blurry input.
Update 3DGS, motion parameters, and shape/weights simultaneously.

3. Key Contributions

First Direct Deblurring Avatar Model: Introduces the first method to reconstruct sharp, animatable 3D Gaussian avatars directly from blurry videos without a separate 2D deblurring stage.
3D-Aware Blur Physics: Extends the traditional 2D blur formulation into a 3D-aware model that explicitly accounts for human motion and the integration of light over time within the 3D Gaussian rendering pipeline.
Ambiguity Resolution: Proposes a 3D-aware motion model combining B-spline interpolation, pose deformation networks, and inter-frame regularization to resolve the inherent ambiguities of motion-induced blur.
New Benchmarks:
- Synthetic Dataset: Created based on ZJU-MoCap with simulated motion blur.
- Real-World Dataset: Captured using a custom 360-degree hybrid-exposure camera system (4 blurry cameras + 8 sharp cameras) to provide ground truth for evaluation.
- Mobile Demo: Demonstrated on iPhone 16 Pro monocular video.

4. Experimental Results

The method was evaluated on both synthetic and real-world datasets, comparing against state-of-the-art (SOTA) 2D deblurring methods (ShiftNet, RVRT, VRT, BSST) followed by 3DGS training, and direct 3DGS training on blurry data.

Quantitative Performance:
- On the Synthetic Dataset, MAD-Avatar achieved 25.55 PSNR, 0.829 SSIM, and 0.1476 LPIPS, significantly outperforming the best baseline (23.08 PSNR).
- On the Real Dataset, it achieved 27.01 PSNR, 0.827 SSIM, and 0.1668 LPIPS, compared to the baseline's ~25.6 PSNR.
Qualitative Improvements:
- The method successfully recovers fine details (e.g., clothing textures, facial features) and eliminates residual blur artifacts common in two-stage approaches.
- It produces consistent results across multiple views, whereas 2D deblurring often leads to view-inconsistent artifacts.
Ablation Studies:
- Removing B-spline interpolation or pose deformation led to significant performance drops, confirming the necessity of modeling continuous sub-frame motion.
- The inter-frame regularization ( $L_{reg}$ ) was crucial for correcting motion direction in non-midpoint timesteps.
- The model is robust to inaccurate initial SMPL estimates and varying blur intensities ( $K_{blur}$ ).

5. Significance

This work bridges a critical gap in 3D human reconstruction by enabling the creation of high-fidelity avatars from unconstrained, real-world video where motion blur is prevalent.

Practical Impact: It removes the need for expensive, high-speed, or perfectly synchronized camera setups, making high-quality 3D avatar creation feasible with standard consumer cameras (including smartphones).
Theoretical Advancement: It demonstrates that modeling the physical formation of blur in 3D space is superior to treating it as a 2D image restoration problem, particularly for dynamic, articulated objects.
Future Applications: The technology enables more robust AR/VR applications, digital human generation for gaming, and automated content creation from casual video footage.

The authors have released the code and datasets to facilitate further research in blur-aware 3D reconstruction.