Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose

This paper introduces I2S, a lightweight, real-time framework that achieves state-of-the-art user identification (97.52% F1-score) in AR-based security systems by analyzing 3D hand poses and human-object interactions through a novel multi-stage feature extraction process.

Muhammad Hamza, Danish Hamid, Muhammad Tahir Akram

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are wearing a pair of smart glasses that act as a super-smart assistant. You're in a busy airplane cockpit or a high-tech operating room, and you need to pick up a tool, open a valve, or adjust a dial. Usually, these glasses would need to see your face to know who you are and what permissions you have. But what if your face is covered by a mask, or the lighting is too dark?

This paper introduces a clever new way to identify you: by watching how your hands dance with objects.

Here is the story of their invention, called I2S (which stands for Interact2Sign), broken down into simple concepts.

1. The Problem: The "Faceless" Identity Crisis

In high-stakes jobs like flying planes or performing surgery, security is everything. You need to prove who you are instantly. Traditional methods (like scanning your face or fingerprint) don't work well in these scenarios because:

  • You might be wearing a helmet or mask.
  • The camera is on your chest or head (looking out, not at you), so it can't see your face.
  • It needs to be fast and work on small, lightweight devices (like a headset), not a giant supercomputer.

2. The Solution: The "Handshake" with the World

The researchers realized that while your face might be hidden, your hands are always busy. When you pick up a wrench, open a jar, or type on a keyboard, you do it in a very specific, unique way. It's like a signature, but instead of ink on paper, it's movement in 3D space.

They built a system that doesn't just look at what you are holding, but how you hold it.

3. How I2S Works: The Three-Step Detective

Think of the I2S system as a three-step detective process that gets smarter at every stage:

  • Step 1: "What is that?" (Object Detection)
    First, the system looks at your hand's shape and position to guess what object you are touching. Is it a coffee cup? A screwdriver? A laptop?

    • Analogy: It's like a waiter seeing you reach for a menu and knowing you are about to order, even before you speak.
  • Step 2: "What are you doing?" (Interaction Recognition)
    Once it knows the object, it asks, "How are you using it?" Are you grasping it tightly? Are you twisting it? Are you holding it steady?

    • Analogy: The waiter now sees you screwing the cap off the bottle, not just holding it.
  • Step 3: "Who are you?" (User Identification)
    This is the magic trick. Now that the system knows you are "screwing a cap off a bottle," it looks at the style of your movement. Do you twist fast? Do you use two hands? Do you hold it at a weird angle?

    • Analogy: The waiter recognizes your unique "twisting style" and says, "Ah, that's definitely Captain Smith! He always twists counter-clockwise with a quick flick of the wrist."

4. The Secret Sauce: The "Hand Envelope"

The researchers didn't just use standard math. They invented a new tool called the Inter-Hand Spatial Envelope (IHSE).

Imagine your two hands are holding a balloon. The "envelope" is the invisible shape and size of the space between your hands.

  • If you are holding a tiny screw, your hands are close together.
  • If you are holding a big toolbox, your hands are far apart.
  • The system measures this "invisible balloon" to understand the coordination between your left and right hands. This is a huge clue for identifying who you are, because everyone coordinates their hands differently.

5. Why This is a Big Deal

Most modern AI systems are like elephants: they are huge, eat a lot of data (food), and move slowly. They need massive computers to run.

The I2S system is like a sprinter:

  • Lightweight: It's tiny (under 4 MB), so it fits easily on a smart glasses chip.
  • Fast: It makes a decision in 0.1 seconds (faster than a blink).
  • Accurate: It got the right answer 97.5% of the time.

The Bottom Line

This paper proves that you don't need to see someone's face to know who they are. By watching how they interact with the world around them—how they grab, twist, and hold objects—you can identify them with incredible accuracy.

It's like recognizing a friend not by their face, but by the way they walk into a room and pick up their coffee cup. This technology could soon make our augmented reality glasses smarter, safer, and more personal, ensuring that the right person gets the right help at the right time, even in the most dangerous jobs.