DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze Estimation

Imagine you are trying to guess where a person is looking just by looking at a photo of their face. This is called gaze estimation. It's useful for things like making sure a student is paying attention during an online exam, or helping a tired driver stay focused on the road.

However, doing this in the real world is tricky. Photos taken "in the wild" (outside a studio) are often blurry, low-resolution, or taken from weird angles. Plus, people move their heads and eyes independently, which confuses computers.

This paper introduces a new, super-smart system called DHECA-SuperGaze that solves these problems using three main tricks. Think of it as upgrading a detective's toolkit.

1. The "Magic Lens" (Super-Resolution)

The Problem: Imagine trying to read a tiny, blurry sign from a mile away. You can't make out the letters, so you can't guess what it says. Similarly, if a camera captures a face from far away, the eyes look like fuzzy smudges. The computer can't tell where the pupil is pointing.

The Solution: The authors added a Super-Resolution (SR) module. Think of this as a magical lens or a high-end photo editor that instantly takes a blurry, low-quality image and "hallucinates" the missing details to make it sharp and crisp.

How they used it: They didn't just sharpen the whole face. They realized that sharpening the head image (the whole face) gives the computer the best context, while keeping the eyes as they were (cropped out) actually worked best. It's like sharpening the map of the city so you can better understand where the specific house (the eye) is located.

2. The "Super-Team" (Dual Head-Eye Cross-Attention)

The Problem: In the past, computers looked at the head and the eyes separately, or they just mashed the two pieces of information together. But the head and eyes talk to each other! If your head is turned left, but your eyes are looking right, you are looking somewhere else entirely. Old systems missed this conversation.

The Solution: The paper introduces a module called DHECA (Dual Head-Eye Cross-Attention).

The Analogy: Imagine a detective (the Head) and a witness (the Eye) trying to solve a crime.
- In old systems, the detective would write a report, the witness would write a report, and a boss would just add them together.
- In this new system, the detective and witness sit in a room and interview each other. The detective asks the witness, "You said you saw the car, but I see the car is parked on the left; how does that fit?" The witness says, "Well, I was looking over my shoulder."
- This "Cross-Attention" allows the computer to constantly check the head's position against the eye's position, refining its guess until it gets it right. It's a two-way conversation that leads to a much smarter conclusion.

3. The "Clean-Up Crew" (Fixing the Dataset)

The Problem: To teach a computer, you need a huge textbook of examples (a dataset). The authors used a famous textbook called Gaze360. But, they discovered a huge typo in the book.

The Analogy: Imagine a teacher giving a student a test. The teacher points to a picture of a cat and says, "This is a dog." If the student studies this, they will fail the real test. The Gaze360 dataset had many pictures where the "face box" (the label saying where the face is) was actually pointing to a different person in the background, not the main subject.
The Solution: The authors acted as a proofreader. They scanned the entire dataset, found the "typos" (the wrong labels), and fixed them. They created a "Rectified" version of the dataset.
The Result: When they trained their new system on this clean, corrected textbook, every system got smarter, not just theirs. It proved that garbage in leads to garbage out, and cleaning the data is half the battle.

The Results: How Good Is It?

The authors tested their new "Super-Detective" against all the other top detectives (previous methods) in two ways:

Same Room Test: Training and testing on the same data.
New Room Test: Training on one dataset and testing on a completely different one (to see if it can generalize).

The Verdict:

Accuracy: Their system was the most accurate, reducing the "guessing error" by nearly half a degree compared to the previous best. In the world of gaze tracking, being off by 0.5 degrees is like hitting a bullseye when others were hitting the outer ring.
Robustness: Even when they tested it on data it had never seen before, it still outperformed everyone else.

Summary

DHECA-SuperGaze is like giving a computer:

High-definition glasses (Super-Resolution) to see blurry faces clearly.
A communication brain (Cross-Attention) that understands how the head and eyes work together.
A corrected textbook (Rectified Data) so it learns from the truth, not mistakes.

By combining these three elements, the system can now guess where someone is looking with incredible accuracy, even in messy, real-world situations.

1. Problem Statement

Unconstrained gaze estimation aims to determine a subject's visual attention direction in real-world ("in-the-wild") environments. Current State-of-the-Art (SOTA) methods face two primary challenges:

Low Resolution: In-the-wild images often suffer from low resolution, making fine-grained eye feature extraction difficult.
Insufficient Head-Eye Modeling: Existing methods often fail to fully model the complex, bidirectional interactions between head orientation and eye appearance. While gaze and head alignment are correlated, they can diverge by up to 35°, necessitating a method that effectively fuses both modalities.
Data Quality Issues: The authors identified significant annotation errors in Gaze360, one of the most widely used gaze estimation datasets. Specifically, bounding boxes for faces and eyes were sometimes misaligned with the target subject, capturing other people in the frame instead.

2. Methodology: DHECA-SuperGaze

The proposed method is a deep learning framework that integrates Super-Resolution (SR) and a novel Dual Head-Eye Cross-Attention (DHECA) module.

A. Data Rectification

Before model training, the authors performed exploratory data analysis on the Gaze360 dataset. They constructed distribution graphs of face locations to identify outliers where bounding boxes did not match the central subject. They manually rectified these annotations by re-running the dlib face detector on mislabeled frames, ensuring the bounding boxes corresponded to the correct subject.

B. Input Pre-processing

The method processes inputs in both static (single image) and temporal (video sequence) settings:

Super-Resolution (SR): The input head image is passed through a GAN-based SR model (Real-DRCT-GAN) to enhance resolution.
Multi-scale Processing: The enhanced SR head image is center-cropped at four different scales (224, 200, 175, 150) and resized back to 224×224 to capture multi-scale features.
Eye Extraction: Eye regions are cropped from the original (non-SR) head image using landmarks detected by dlib.
Temporal Handling: For video, the process is applied to $T=7$ consecutive frames with a specific zoom-in/zoom-out scaling schedule centered on the middle frame.

C. Network Architecture

The core of the method is a hybrid Convolutional-Transformer design:

Backbone: Two parallel ResNet18 CNNs extract visual features: one for the multi-scale SR head images and one for the eye crops.
Tokenization: Features are reshaped into tokens. A classification (CLS) token is added to both branches.
DHECA Module: This is the novel contribution. It employs a Dual Cross-Attention mechanism where:
- Head and Eye tokens are normalized and projected into Query ( $Q$ ), Key ( $K$ ), and Value ( $V$ ) matrices.
- Bidirectional Attention: Unlike previous methods that only use eye-to-head attention, DHECA performs cross-attention in both directions.
  - Head Branch: Uses Head $Q$ and Eye $K, V$ .
  - Eye Branch: Uses Eye $Q$ and Head $K, V$ .
- The output of the cross-attention is summed with the input, normalized, passed through an MLP, and summed again (Residual connection). This block is repeated $N=4$ times.
Prediction Head: The final CLS tokens from both branches are concatenated and passed through an MLP to predict the gaze vector. The model predicts sine and cosine transforms of the yaw ( $\theta$ ) and pitch ( $\phi$ ) angles to handle the circular nature of the data, followed by a two-step trigonometric reconstruction for the final angle.

3. Key Contributions

Dataset Rectification: The authors identified and corrected mislabeled bounding boxes in the Gaze360 dataset, providing a cleaner benchmark for future research. They demonstrated that training on rectified data improves performance across all models.
DHECA Module: A novel dual-branch cross-attention mechanism that enables bidirectional feature refinement between head and eye visual features, addressing the limitations of unidirectional attention.
Optimal SR Configuration: Through ablation studies, the authors determined that the optimal configuration applies Super-Resolution only to the head images while using original resolution crops for the eyes. Applying SR to eyes or both did not yield better results.
SOTA Performance: The method achieves state-of-the-art results on both Gaze360 and GFIE datasets in static and temporal settings.

4. Experimental Results

The method was evaluated on Gaze360 and GFIE datasets using Angular Error (AE) as the metric.

Within-Dataset Performance:
- Static: Reduced AE by 0.48° on Gaze360 and 2.95° on GFIE compared to the previous best static models.
- Temporal: Reduced AE by 0.59° on Gaze360 and 3.00° on GFIE compared to the previous best temporal models.
Cross-Dataset Performance (Generalization):
- The model showed superior generalization, improving AE by more than 1.53° (Gaze360) and 3.99° (GFIE) in cross-dataset tests compared to existing methods.
Ablation Studies:
- Attention: DHECA outperformed "No Attention," "Self-Attention," and the unidirectional "CrossGaze CA" module.
- SR: Using SR on head images significantly lowered error compared to no SR.
- Data Quality: Training on the rectified Gaze360 dataset reduced AE by approximately 0.15° across all models using eye annotations.

5. Significance

This paper advances the field of unconstrained gaze estimation by addressing both algorithmic and data-quality bottlenecks.

Algorithmic Innovation: The DHECA module proves that bidirectional interaction between head and eye features is critical for accurate gaze prediction, especially in unconstrained scenarios where head pose varies significantly.
Practical Impact: By rectifying a major public dataset (Gaze360), the authors provide a more reliable benchmark for the community.
Robustness: The integration of Super-Resolution specifically for head images allows the model to handle low-resolution real-world inputs effectively, making it more suitable for applications like driver monitoring, exam proctoring, and accessibility interfaces where image quality cannot be guaranteed.