DHECA-SuperGaze: Dual Head-Eye Cross-Attention and Super-Resolution for Unconstrained Gaze Estimation

This paper introduces DHECA-SuperGaze, a deep learning framework that enhances unconstrained gaze estimation by integrating super-resolution for low-quality images and a dual head-eye cross-attention module to model head-eye interactions, while also correcting annotation errors in the Gaze360 dataset to achieve state-of-the-art accuracy and robust generalization.

Franko Šikić, Donik Vršnak, Sven Lončarić

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are trying to guess where a person is looking just by looking at a photo of their face. This is called gaze estimation. It's useful for things like making sure a student is paying attention during an online exam, or helping a tired driver stay focused on the road.

However, doing this in the real world is tricky. Photos taken "in the wild" (outside a studio) are often blurry, low-resolution, or taken from weird angles. Plus, people move their heads and eyes independently, which confuses computers.

This paper introduces a new, super-smart system called DHECA-SuperGaze that solves these problems using three main tricks. Think of it as upgrading a detective's toolkit.

1. The "Magic Lens" (Super-Resolution)

The Problem: Imagine trying to read a tiny, blurry sign from a mile away. You can't make out the letters, so you can't guess what it says. Similarly, if a camera captures a face from far away, the eyes look like fuzzy smudges. The computer can't tell where the pupil is pointing.

The Solution: The authors added a Super-Resolution (SR) module. Think of this as a magical lens or a high-end photo editor that instantly takes a blurry, low-quality image and "hallucinates" the missing details to make it sharp and crisp.

  • How they used it: They didn't just sharpen the whole face. They realized that sharpening the head image (the whole face) gives the computer the best context, while keeping the eyes as they were (cropped out) actually worked best. It's like sharpening the map of the city so you can better understand where the specific house (the eye) is located.

2. The "Super-Team" (Dual Head-Eye Cross-Attention)

The Problem: In the past, computers looked at the head and the eyes separately, or they just mashed the two pieces of information together. But the head and eyes talk to each other! If your head is turned left, but your eyes are looking right, you are looking somewhere else entirely. Old systems missed this conversation.

The Solution: The paper introduces a module called DHECA (Dual Head-Eye Cross-Attention).

  • The Analogy: Imagine a detective (the Head) and a witness (the Eye) trying to solve a crime.
    • In old systems, the detective would write a report, the witness would write a report, and a boss would just add them together.
    • In this new system, the detective and witness sit in a room and interview each other. The detective asks the witness, "You said you saw the car, but I see the car is parked on the left; how does that fit?" The witness says, "Well, I was looking over my shoulder."
    • This "Cross-Attention" allows the computer to constantly check the head's position against the eye's position, refining its guess until it gets it right. It's a two-way conversation that leads to a much smarter conclusion.

3. The "Clean-Up Crew" (Fixing the Dataset)

The Problem: To teach a computer, you need a huge textbook of examples (a dataset). The authors used a famous textbook called Gaze360. But, they discovered a huge typo in the book.

  • The Analogy: Imagine a teacher giving a student a test. The teacher points to a picture of a cat and says, "This is a dog." If the student studies this, they will fail the real test. The Gaze360 dataset had many pictures where the "face box" (the label saying where the face is) was actually pointing to a different person in the background, not the main subject.
  • The Solution: The authors acted as a proofreader. They scanned the entire dataset, found the "typos" (the wrong labels), and fixed them. They created a "Rectified" version of the dataset.
  • The Result: When they trained their new system on this clean, corrected textbook, every system got smarter, not just theirs. It proved that garbage in leads to garbage out, and cleaning the data is half the battle.

The Results: How Good Is It?

The authors tested their new "Super-Detective" against all the other top detectives (previous methods) in two ways:

  1. Same Room Test: Training and testing on the same data.
  2. New Room Test: Training on one dataset and testing on a completely different one (to see if it can generalize).

The Verdict:

  • Accuracy: Their system was the most accurate, reducing the "guessing error" by nearly half a degree compared to the previous best. In the world of gaze tracking, being off by 0.5 degrees is like hitting a bullseye when others were hitting the outer ring.
  • Robustness: Even when they tested it on data it had never seen before, it still outperformed everyone else.

Summary

DHECA-SuperGaze is like giving a computer:

  1. High-definition glasses (Super-Resolution) to see blurry faces clearly.
  2. A communication brain (Cross-Attention) that understands how the head and eyes work together.
  3. A corrected textbook (Rectified Data) so it learns from the truth, not mistakes.

By combining these three elements, the system can now guess where someone is looking with incredible accuracy, even in messy, real-world situations.