Egocentric Visibility-Aware Human Pose Estimation

Imagine you are wearing a high-tech pair of VR goggles (like a headset for a video game). You want the computer to know exactly where your hands, elbows, knees, and feet are in 3D space so it can move your virtual avatar accordingly.

This is called Egocentric Human Pose Estimation. "Egocentric" just means "from your own eyes."

The problem? Your own body gets in the way.

The Core Problem: The "Blind Spot"

When you wear a headset, the cameras are stuck on your forehead.

If you reach your hands out in front of you, the cameras see them.
But if you kick your leg back, or if your arm swings behind your back, the camera cannot see it. It's blocked by your own body or simply out of the camera's view.

Previous computer programs tried to guess where these "invisible" body parts were, but they treated them the same as the "visible" ones. It's like a detective trying to solve a crime where half the witnesses are missing, but the detective asks the missing witnesses the same questions as the ones who are there. The result? The detective gets confused, and the guesses for the visible parts (the ones the detective can actually see) become sloppy and inaccurate.

The Solution: A New Dataset and a New "Brain"

The authors of this paper, working with ByteDance (the company behind TikTok), decided to fix this by doing two main things:

1. The New Map: "Eva-3M"

To teach computers how to handle these blind spots, you need a massive library of examples.

The Old Way: Previous datasets were either fake (computer-generated) or didn't tell the computer which body parts were hidden.
The New Way (Eva-3M): They recorded over 3 million frames of real people wearing a Pico VR headset doing 24 different daily activities (like kicking, stretching, sitting).
The Secret Sauce: They manually labeled 435,000 of these frames to tell the computer exactly: "This hand is visible. That knee is hidden behind the body." It's like giving the detective a map that says, "Witness A is here, but Witness B is hiding in the closet."

2. The New Brain: "EvaPose"

They built a new AI system called EvaPose that uses this map to get smarter. Think of it as a three-step cooking process:

Step 1: The "Memory Bank" (VQ-VAE)
Before the AI even looks at the video, it studies a huge library of human movement (like a dance archive). It learns the "rules of physics" for how human bodies move. It knows that if your left arm is up, your right arm is probably down, and your legs are likely standing straight. This acts as a safety net. If the camera can't see your leg, the AI uses this "memory" to guess a realistic position instead of hallucinating a weird, impossible pose.
Step 2: The "Detective" (Visibility-Aware Network)
This is the main chef. When looking at the video, it doesn't just guess the pose; it first asks, "Can I see this elbow?"
- If YES: It focuses heavily on getting the position perfect.
- If NO: It says, "Okay, I can't see this, so I won't force a guess based on blurry pixels. I'll rely more on my 'Memory Bank' from Step 1."
  By treating visible and invisible parts differently, it stops the "invisible" confusion from messing up the "visible" accuracy.
Step 3: The "Time-Traveler" (Attention Module)
Humans don't move in frozen pictures; we move in smooth movies. This step looks at the video over time. If the AI is unsure about a hand in frame 10, it looks at frame 9 and frame 11 to smooth out the guess. It connects the dots across time to make the movement look fluid and natural, not jittery.

Why Does This Matter?

Imagine you are playing a VR game where you have to catch a virtual ball.

Old AI: Because it got confused by your hidden arm, it thought your hand was in the wrong place. You missed the ball.
EvaPose: It knows your arm is hidden, so it uses its "memory" of how arms move to guess correctly. It knows exactly where your hand is, even though the camera can't see it. You catch the ball.

The Bottom Line

The paper introduces a giant new dataset that teaches computers to recognize when body parts are hidden, and a new AI method that uses this knowledge to predict human poses with record-breaking accuracy. It's a huge leap forward for making Virtual Reality and Augmented Reality feel truly real and responsive.

1. Problem Statement

Egocentric Human Pose Estimation (HPE) using head-mounted devices (HMDs) is critical for VR/AR applications but faces a unique challenge: keypoint invisibility. Unlike external camera setups, HMDs suffer from:

Severe Self-Occlusion: Body parts (especially the lower body) are often blocked by the wearer's own torso or limbs.
Limited Field-of-View (FoV): Hands and legs stretched out often fall outside the camera's view.

Current Limitations:

Lack of Data: Existing egocentric HPE datasets (e.g., EMHI, EgoBody3M) do not provide visibility annotations for keypoints.
Methodological Flaw: Current methods treat visible and invisible keypoints indiscriminately. The absence of visual evidence for invisible keypoints introduces ambiguity, which degrades the estimation accuracy of visible keypoints due to the model's inability to distinguish between "missing data" and "occluded data."

2. Key Contributions

A. Eva-3M Dataset

The authors introduce Eva-3M, the first large-scale, real-world egocentric HPE dataset with ground-truth visibility labels.

Scale: Over 3.0 million frames from 31 subjects performing 24 types of daily activities.
Annotations: Includes 435K frames with detailed keypoint visibility labels (visible vs. invisible) and ground-truth SMPL poses.
Hardware: Captured using a commercial Pico4 Ultra VR-MR headset, ensuring realistic occlusion patterns and FoV limitations compared to custom rig setups used in prior work.
Diversity: Demonstrates wider motion diversity and spatial distribution compared to existing benchmarks like EMHI.
Extension: The authors also annotated the existing EMHI dataset with visibility labels to facilitate further research.

B. EvaPose Method

A novel visibility-aware framework designed to explicitly leverage visibility information to improve pose estimation. It consists of three main components:

Pose Prior via VQ-VAE: A Vector Quantized-VAE (VQ-VAE) pre-trained on extensive motion capture (mocap) datasets (AMASS, MOYO, AIST++) encodes plausible human poses into a discrete codebook. This serves as a strong prior to resolve ambiguities, especially for invisible keypoints.
Visibility-Aware 3D Estimation Network:
- Takes stereo egocentric images and camera poses (from SLAM) as input.
- Predicts 3D keypoints and their visibility confidence scores simultaneously.
- Employs a visibility-based loss-weighting scheme: During training, invisible keypoints are assigned a lower loss weight (0.1) compared to visible ones (1.0). This prevents the model from being misled by the inherent ambiguity of invisible joints, thereby protecting the accuracy of visible keypoint predictions.
Iterative Intra- and Inter-Frame Attention:
- Stereo Transformer Decoder (STD): Fuses features from left and right camera views within a single frame.
- Temporal Transformer Encoder (TTE): Aggregates information across the temporal window to ensure smoothness and consistency.
- The network iteratively refines predictions by interacting stereo features, temporal context, and visibility scores.

3. Methodology Details

Input: Stereo egocentric videos ( $I_L, I_R$ ) and camera poses ( $C_L, C_R$ ) from the HMD's SLAM system.
Coordinate Systems:
- Camera Coordinate System: Local to the HMD.
- Canonical Coordinate System: Aligned with the floor plane and head orientation (invariant to global translation/rotation).
- World Coordinate System: Global space.
Training Strategy:
- Stage 1: Train the visibility-aware 3D pose estimation network using the weighted loss function ( $L_{vis}$ , $L_{heatmap}$ , $L_{3D}$ ).
- Stage 2: Train the iterative attention network (STD + TTE) using joint position loss and smoothness loss (acceleration) to refine the output.
Inference: The fused features are passed through the pre-trained VQ-VAE decoder to reconstruct high-fidelity 3D poses, ensuring physical plausibility.

4. Experimental Results

The method was evaluated on Eva-3M and EMHI datasets against state-of-the-art (SOTA) baselines (UnrealEgo, EgoPoseFormer, FRAME).

Performance: EvaPose achieved SOTA performance across all metrics (MPJPE, PA-MPJPE, U-PE, L-PE, HandPE, FootPE, and Jitter).
- On the EMHI P2 test set (unseen actions), EvaPose-ResNet50 reduced MPJPE from 60.5mm (FRAME) to 38.5mm, and EvaPose-ViT-L reduced it to 33.3mm.
- Significant improvements were observed in smoothness (Jitter), reducing noise in the motion trajectory.
Visibility Analysis:
- Ablation studies confirmed that treating visible and invisible keypoints indiscriminately hurts visible keypoint accuracy.
- With visibility awareness, the error on visible limb keypoints dropped significantly (e.g., from ~70mm to ~45mm), proving that the model successfully mitigates interference from invisible joints.
Generalization: Cross-dataset evaluation (training on Eva-3M, testing on EMHI) showed EvaPose-ViT-L significantly outperformed other methods, demonstrating strong generalization capabilities.
Efficiency: EvaPose-ResNet50 runs at 48.0 FPS on an NVIDIA V100, enabling real-time performance.

5. Significance and Impact

Paradigm Shift: This work highlights that visibility awareness is a critical, previously overlooked factor in egocentric HPE. Ignoring occlusion leads to suboptimal performance even for visible parts.
Data Benchmark: Eva-3M sets a new standard for egocentric datasets by providing the first large-scale, real-world visibility annotations, addressing a major gap in the field.
Practical Application: By leveraging commercial VR hardware data and achieving real-time speeds, the proposed method is directly applicable to current and future VR/AR applications, robotic control, and digital avatars.
Robustness: The integration of motion priors (VQ-VAE) and temporal attention allows the system to "hallucinate" plausible poses for invisible parts while maintaining high accuracy for visible parts, a crucial capability for immersive experiences.