Sticky-Glance: Robust Intent Recognition for Human Robot Collaboration via Single-Glance

Imagine you are trying to tell a robot which object to pick up, but you can't use your hands. You have to use your eyes. This is a lifeline for people with severe motor disabilities, but it's a tricky game.

Here is the problem: Your eyes are jittery. Even when you try to stare at a cup, your eyes make tiny, involuntary jumps (called micro-saccades). If the robot is too sensitive, it thinks you're looking at the cup next to it. If it's too slow, you have to stare at the cup for five seconds just to make it move, which is exhausting and frustrating.

This paper introduces a new system called "Sticky-Glance" that solves this problem. Here is how it works, explained simply:

1. The "Magnet" Analogy (Sticky-Glance)

Think of the objects on the table not just as physical things, but as magnets.

Old Way: The robot waits for you to stare at a magnet for a long time. If your eye jitters even a tiny bit, the "connection" breaks, and the robot forgets what you want.
Sticky-Glance Way: The robot creates an invisible "sticky zone" around every object. When you glance at an object, even for a split second, the robot doesn't just look at where your eye is right now. It looks at where your eye is going.
- If your eye moves toward the cup, the magnet gets stronger.
- If your eye jitters away but then moves back toward the cup, the magnet holds tight.
- It's like the intent "sticks" to the object. You don't need to stare; a quick, confident glance is enough to "lock on" to the target.

2. The "Dance Partner" Analogy (Continuous Control)

Most robots work like a stop-and-go traffic light. You look at an object, wait for the robot to confirm, say "yes," and then the robot starts moving. This is slow and feels disjointed.

The Sticky-Glance system works like a dance partner:

As soon as you glance at a cup, the robot starts slowly drifting toward it, even before you give the final command.
It's "listening" to your gaze confidence. If you look unsure, it moves slowly. If you look confident, it speeds up.
When you finally say "Pick that up," the robot is already halfway there. This saves nearly 10% of the time because the robot isn't waiting around; it's already in motion, ready to dance.

3. The "Two-Finger" Analogy (Gaze + Speech)

To make sure the robot doesn't grab the wrong thing, the system uses a "two-finger" approach:

Finger 1 (Eyes): You use your eyes to say, "I'm interested in that specific block." (This handles the "Where?").
Finger 2 (Voice): You use your voice to say, "Pick it up." (This handles the "What to do?").

This is much easier than trying to stare at a complex menu on a screen or staring at a specific object for 5 seconds to trigger a menu. It feels natural: "Look at the cup, say 'pick'."

4. The "Translator" (Matching Perspectives)

There is one more tricky part: The robot sees the world from its own camera (low down), and you see it from your glasses (high up). They might see different things.

The system acts like a super-fast translator. It takes the robot's 3D view of the room and instantly matches it with your view. Even if you are standing far away or at a weird angle, the system knows, "Oh, the red block you are looking at is the same red block the robot sees over there." This ensures the robot never gets confused about which object you mean.

Why Does This Matter?

The researchers tested this with people who have limited arm movement.

Speed: Tasks were completed faster because the robot didn't wait for long stares.
Accuracy: It was incredibly accurate (98% success rate), even when objects were moving or jumbled together.
Mental Load: It felt much less tiring for the users. They didn't have to concentrate hard to "hold" a gaze; they could just glance naturally.

In a nutshell: Sticky-Glance turns a jittery, difficult eye-control system into a smooth, natural conversation between human and machine. It understands that a quick glance is a valid command, and it helps the robot get ready before you even finish speaking.

Here is a detailed technical summary of the paper "Sticky-Glance: Robust Intent Recognition for Human Robot Collaboration via Single-Glance."

1. Problem Statement

The paper addresses the challenge of robust gaze-based intent recognition for assistive robotics, specifically for users with severe motor impairments (e.g., upper-limb disabilities) who rely on eye movements for control. Existing systems face three critical limitations:

Noise and Instability: Natural eye movements (micro-saccades), head motion, and gaze jitter cause gaze points to drift, leading to false selections or lost tracking.
Dynamic Environments: Current methods struggle when objects move or viewpoints change, often requiring prolonged fixation (dwell time) to confirm intent, which is slow and cognitively taxing.
Discrete Control: Most systems operate on a "select-then-act" paradigm where the robot remains static until an intent is confirmed. This lack of continuous feedback creates latency and reduces interaction fluidity.

The goal is to create a system that can infer object-level intent from short, noisy glances (as few as 3 samples) in dynamic, multi-object scenes and provide continuous, responsive robot control.

2. Methodology

The proposed framework, Sticky-Glance, consists of four main components:

A. Human Perspective Perception

Input: Uses Meta ARIA glasses for eye tracking and ego-RGB images.
Processing:
- Gaze Projection: Reprojects gaze points from previous frames into the current frame to handle head motion.
- Object Detection: Uses a fine-tuned YOLO26 and ByteTrack for multi-object tracking. Objects are approximated by circumscribed circles (single or multiple for elongated shapes) to create robust geometric regions.

B. Sticky-Glance Intent Prediction Algorithm

This is the core innovation. Instead of relying on temporal smoothing or probabilistic belief propagation, it constructs an object-centric confidence field based on geometric evidence:

Distance Evidence ( $e_{dist}$ ): Measures the distance between the gaze point and the object center. If the gaze is inside the object region, confidence is maximized. If outside, confidence is modulated by whether the gaze is moving toward or away from the object.
Directional Evidence ( $e_{dir}$ ): Uses a tangent cone defined by the previous gaze point and the object geometry. If the gaze displacement vector lies within this cone, it indicates motion toward the object. If the trajectory has passed the object (divergence), it generates negative evidence.
Integration: These evidence terms are integrated over time to update a confidence score $c(t, i)$ . The algorithm distinguishes between "glances" (transient) and "fixations" using a saccade threshold. The "sticky" nature ensures that once intent is established, it remains anchored to the object even if the gaze briefly drifts, preventing premature switching.

C. Multi-Perspective Alignment

To bridge the gap between the human's view (glasses) and the robot's view (RGB-D camera):

Point Cloud Association: The robot builds a 3D point cloud of objects using an RGB-D camera and ICP registration.
Optimal Matching: Instead of relying on ArUco markers (which fail at distance/angles), the system uses LightGlue for feature matching to estimate the ego-camera pose. It then projects 3D object point clouds into the 2D ego-image and uses the Hungarian algorithm to find the optimal correspondence between detected 2D boxes and projected 3D objects.

D. Continuous Shared Control & Interaction

Glance-Say Protocol: Users select objects via gaze and specify actions via speech ("Glance-Say").
Continuous Motion:
- Pre-command Mode: As confidence builds, the robot moves toward a confidence-weighted virtual target. This "slow-following" behavior reduces the distance the robot must travel once the command is finalized.
- Post-command Mode: Upon speech confirmation, the robot commits to the specific object and executes the task at maximum safe speed.
Safety Loop: If the robot reaches a target and the user rejects it via speech, the system automatically switches to the next highest-confidence object, ensuring a safe "human-in-the-loop" correction mechanism.

3. Key Contributions

Sticky-Glance Algorithm: A novel geometric intent stabilization method that achieves robust target assignment with minimal samples (3) and no prolonged fixation, handling micro-saccades and dynamic motion effectively.
Continuous Shared Control: A paradigm shift from discrete triggering to continuous motion generation based on gaze confidence, reducing task duration by nearly 10%.
Glance-Say Interaction: A multi-modal protocol combining gaze for object grounding and speech for action specification, validated with an explicit confirmation step for safety.
Robust Alignment: A feature-based, optimal matching method for human-robot perspective alignment that outperforms marker-based approaches in varying distances and angles.

4. Experimental Results

Experiments were conducted with 16 participants (including those with upper-limb disabilities) across dynamic tracking, alignment, and manipulation tasks.

Intent Recognition Robustness:
- Dynamic Tracking Rate: 0.92 (vs. 0.22–0.81 for baselines like kNN, Fixation, HMM, LSTM).
- Static Selection Accuracy: 0.98 (vs. 0.63–0.91 for baselines).
- Latency: Requires only 3 gaze samples (vs. 20–25 for learning-based methods).
Multi-Perspective Alignment: Maintained >0.84 accuracy at 80cm distance and 180° viewing angles, significantly outperforming ArUco and standard feature matching methods which degraded rapidly.
Task Performance:
- Success Rate: 0.96 (S4, complex overlapping objects) and 0.98 (S3).
- Task Duration: Reduced by ~10% compared to the best baseline (FAM-HRI), averaging 29.5s for complex tasks.
- Command Duration: 1.4s (significantly lower than GUI-based or VLM-based methods).
User Study (NASA-TLX & SUS):
- Cognitive Load: Lowest score (25.57) among all methods, significantly lower than baselines.
- Usability (SUS): Highest score (86.42), indicating strong user preference and reduced learning effort.

5. Significance

The Sticky-Glance system represents a significant advancement in Human-Robot Interaction (HRI) for assistive robotics. By moving away from rigid, dwell-time-dependent selection and discrete control loops, it enables a more natural, fluid, and responsive interaction style. The ability to infer intent from a "single glance" in noisy, dynamic environments makes the technology viable for real-world assistive applications where speed and reliability are critical. The integration of continuous shared control further bridges the gap between human intent and robot execution, reducing cognitive load and physical task time for users with limited motor capabilities.