Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

Imagine you are wearing a pair of smart glasses that act as a super-smart assistant. You're in a busy airplane cockpit or a high-tech operating room, and you need to pick up a tool, open a valve, or adjust a dial. Usually, these glasses would need to see your face to know who you are and what permissions you have. But what if your face is covered by a mask, or the lighting is too dark?

This paper introduces a clever new way to identify you: by watching how your hands dance with objects.

Here is the story of their invention, called I2S (which stands for Interact2Sign), broken down into simple concepts.

1. The Problem: The "Faceless" Identity Crisis

In high-stakes jobs like flying planes or performing surgery, security is everything. You need to prove who you are instantly. Traditional methods (like scanning your face or fingerprint) don't work well in these scenarios because:

You might be wearing a helmet or mask.
The camera is on your chest or head (looking out, not at you), so it can't see your face.
It needs to be fast and work on small, lightweight devices (like a headset), not a giant supercomputer.

2. The Solution: The "Handshake" with the World

The researchers realized that while your face might be hidden, your hands are always busy. When you pick up a wrench, open a jar, or type on a keyboard, you do it in a very specific, unique way. It's like a signature, but instead of ink on paper, it's movement in 3D space.

They built a system that doesn't just look at what you are holding, but how you hold it.

3. How I2S Works: The Three-Step Detective

Think of the I2S system as a three-step detective process that gets smarter at every stage:

Step 1: "What is that?" (Object Detection)
First, the system looks at your hand's shape and position to guess what object you are touching. Is it a coffee cup? A screwdriver? A laptop?
- Analogy: It's like a waiter seeing you reach for a menu and knowing you are about to order, even before you speak.
Step 2: "What are you doing?" (Interaction Recognition)
Once it knows the object, it asks, "How are you using it?" Are you grasping it tightly? Are you twisting it? Are you holding it steady?
- Analogy: The waiter now sees you screwing the cap off the bottle, not just holding it.
Step 3: "Who are you?" (User Identification)
This is the magic trick. Now that the system knows you are "screwing a cap off a bottle," it looks at the style of your movement. Do you twist fast? Do you use two hands? Do you hold it at a weird angle?
- Analogy: The waiter recognizes your unique "twisting style" and says, "Ah, that's definitely Captain Smith! He always twists counter-clockwise with a quick flick of the wrist."

4. The Secret Sauce: The "Hand Envelope"

The researchers didn't just use standard math. They invented a new tool called the Inter-Hand Spatial Envelope (IHSE).

Imagine your two hands are holding a balloon. The "envelope" is the invisible shape and size of the space between your hands.

If you are holding a tiny screw, your hands are close together.
If you are holding a big toolbox, your hands are far apart.
The system measures this "invisible balloon" to understand the coordination between your left and right hands. This is a huge clue for identifying who you are, because everyone coordinates their hands differently.

5. Why This is a Big Deal

Most modern AI systems are like elephants: they are huge, eat a lot of data (food), and move slowly. They need massive computers to run.

The I2S system is like a sprinter:

Lightweight: It's tiny (under 4 MB), so it fits easily on a smart glasses chip.
Fast: It makes a decision in 0.1 seconds (faster than a blink).
Accurate: It got the right answer 97.5% of the time.

The Bottom Line

This paper proves that you don't need to see someone's face to know who they are. By watching how they interact with the world around them—how they grab, twist, and hold objects—you can identify them with incredible accuracy.

It's like recognizing a friend not by their face, but by the way they walk into a room and pick up their coffee cup. This technology could soon make our augmented reality glasses smarter, safer, and more personal, ensuring that the right person gets the right help at the right time, even in the most dangerous jobs.

1. Problem Statement

The paper addresses the challenge of user identification and authentication in high-stakes, human-centric environments (e.g., aircraft cockpits, surgical theaters, aerospace maintenance) where Augmented Reality (AR) systems are deployed.

Limitations of Current Biometrics: Traditional physiological biometrics (face, iris, fingerprints) are often unobservable in egocentric (first-person) views. Behavioral biometrics (gait, keystrokes) can be unreliable due to fatigue or require specific interface engagement.
Computational Constraints: Existing deep learning-based Human-Object Interaction (HOI) recognition methods are often computationally expensive, data-intensive, and lack interpretability, making them unsuitable for real-time deployment on resource-constrained edge devices (AR headsets).
The Gap: There is a need for a lightweight, interpretable, and real-time framework that leverages the hands (the primary interaction medium in egocentric views) to perform passive, continuous user authentication based on how a user manipulates objects.

2. Methodology: The I2S Framework

The authors propose I2S (Interact2Sign), a multi-stage framework designed for unobtrusive user identification. The system relies solely on 3D hand pose data extracted from egocentric videos, avoiding the need for RGB images or heavy deep learning models.

A. Data Acquisition and Preparation

Datasets: The study utilizes a custom augmented dataset derived from ARCTIC (301 videos, 9 subjects, 11 articulated objects) and H2O (95 segments, 3 subjects, 2 objects).
Augmentation: H2O data was relabeled to match ARCTIC categories (e.g., lotion $\rightarrow$ ketchup). The final dataset includes 12 subject classes, 22 HOI categories, and bimanual manipulation of 11 articulated objects.
Preprocessing: Videos were segmented into 5-second (grasp) and 11-second (use) windows. Invalid frames were removed.

B. Feature Extraction

Instead of raw deep learning embeddings, the authors extract handcrafted features from 3D hand joint coordinates (21 joints per hand). These are organized into five semantic categories:

Spatial (S): Euclidean norms, planar distances, inter-joint distances, wrist-to-tip distances, and 3D coordinates. (688 features)
Orientation (O): Joint angles and palm normal vectors. (144 features)
Kinematic (K): Frame-to-frame velocities and accelerations, aggregated using mean, skewness, and kurtosis to capture motion dynamics. (756 features)
Frequency-Domain (F): Discrete Fourier Transform (DFT) applied to joint sequences to extract Power Spectral Density, dominant frequency, spectral centroid, and entropy. (504 features)
Inter-Hand Spatial Envelope (IHSE) [Novel]: A new descriptor capturing bimanual coordination. It includes Hand Span (thumb-to-pinky distance) and Inter-Hand Distances (distances between corresponding joints of left and right hands). (92 features)

Aggregation: Features are aggregated across video frames using Dispersion-Aware Central Tendency (DACT) and Range-Sensitive DACT (RS-DACT) to capture statistical extremes (min/max) alongside mean and standard deviation.

C. Multi-Stage Classification Pipeline

The framework uses a sequential pipeline where the output of one stage augments the input of the next. All stages utilize XGBoost classifiers:

Stage 1: Object Detection: Predicts the object class being manipulated. The predicted label is appended to the feature set.
Stage 2: HOI Recognition: Predicts the interaction type (e.g., "grasp" vs. "use") based on the object-augmented features. The predicted HOI label is appended.
Stage 3: User Identification: Predicts the user identity using the features enriched with both object and HOI predictions. This leverages the unique "interaction style" of the user.

3. Key Contributions

I2S Framework: A novel, multi-stage pipeline for user identification that progressively refines semantic understanding (Object $\rightarrow$ HOI $\rightarrow$ User) using lightweight machine learning.
IHSE Descriptor: Introduction of the Inter-Hand Spatial Envelope, a novel feature descriptor specifically designed to capture bimanual coordination and spatial relationships in egocentric settings.
State-of-the-Art Performance: Achieved a 97.52% F1-score for user identification on a complex bimanual dataset, outperforming many deep learning baselines while maintaining interpretability.
Edge-Optimized Design: The system is highly efficient, with a model size of < 4 MB and an inference time of 0.1 seconds, making it suitable for real-time AR edge devices.
Comprehensive Ablation: Extensive analysis of feature combinations to determine optimal descriptors for each stage of the pipeline.

4. Experimental Results

The framework was evaluated using Stratified 5-Fold Cross-Validation on the augmented ARCTIC/H2O dataset.

Overall Performance: The optimal feature combination (SOKI: Spatial + Orientation + Kinematic + IHSE) achieved an average F1-score of 97.52% for user identification.
Stage-wise Breakdown:
- Object Detection: Best F1-score of 95.41% (using SOI features).
- HOI Recognition: Best F1-score of 97.84% (using SOKI features).
- User Identification: Best F1-score of 99.56% (using Spatial features alone, though SOKI provided the best holistic pipeline performance).
Classifier Comparison: XGBoost outperformed Random Forest and SVM, particularly in the HOI recognition stage.
Efficiency vs. Accuracy Trade-off:
- The full SOKI model achieved the highest accuracy (97.52%) but required ~65 seconds for training and 0.07s inference.
- The IHSE-only model was significantly faster (training time reduced by >90%, inference ~0.03s) with a slight drop in accuracy (85.96% overall), demonstrating its viability for ultra-low-latency edge applications.
Feature Analysis: PCA plots confirmed that adding IHSE to frequency descriptors significantly improved class separability.

5. Significance and Impact

Privacy-Preserving Authentication: The system enables passive authentication without requiring the user to look at a camera or perform specific gestures, preserving privacy in sensitive environments.
Edge Compatibility: By avoiding heavy Convolutional Neural Networks (CNNs) and Transformers, the framework is deployable on AR headsets with limited computational power, addressing a critical bottleneck in current AR assistive technologies.
Interpretability: The use of handcrafted features allows for explainable AI (XAI), which is crucial for security-critical applications where understanding why a user was identified is as important as the identification itself.
Future Applications: The research paves the way for personalized AR co-pilots in aviation, surgery, and industrial maintenance, where the system can automatically adapt to the specific operator's interaction style and verify their identity continuously.

Conclusion

The paper successfully demonstrates that 3D hand pose analysis, when combined with handcrafted geometric and kinematic features and a multi-stage classification strategy, can achieve state-of-the-art user identification in egocentric views. The proposed I2S framework offers a robust, lightweight, and interpretable solution for secure, real-time authentication in AR-based assistive technologies.