GazeMoE: Perception of Gaze Target with Mixture-of-Experts

GazeMoE is a novel end-to-end framework that leverages Mixture-of-Experts modules to adaptively fuse multi-modal cues from a frozen vision foundation model, achieving state-of-the-art performance in human gaze target estimation by addressing class imbalance and enhancing robustness through specialized loss functions and data augmentation.

Zhuangzhuang Dai, Zhongxi Lu, Vincent G. Zakka, Luis J. Manso, Jose M Alcaraz Calero, Chen Li

Published 2026-03-09
📖 5 min read🧠 Deep dive

Imagine you are a robot trying to understand what a human is looking at. You can see their eyes, their head, and their hands, but figuring out exactly what they are staring at is like trying to solve a puzzle where some pieces are missing, some are blurry, and sometimes the person is looking at something completely outside the picture frame.

This paper introduces GazeMoE, a new "super-smart" system designed to solve this puzzle better than any previous robot brain. Here is how it works, explained simply:

1. The Problem: The "One-Size-Fits-All" Brain Fails

Imagine a robot with a single, rigid brain. If a person is looking at a cat, the robot uses its "cat-detecting" neurons. If the person is looking at a car, it uses its "car-detecting" neurons.
But in the real world, things get messy:

  • Sometimes the person's eyes are hidden (occluded).
  • Sometimes the camera is a fisheye lens, making everything look warped.
  • Sometimes the person is a child, whose gaze is harder to predict than an adult's.
  • Sometimes they are looking outside the camera's view entirely.

Old systems tried to use one giant brain to handle all these situations. It was like trying to use a single Swiss Army knife to fix a watch, cut a steak, and hammer a nail. It worked okay, but it wasn't perfect.

2. The Solution: The "Specialist Team" (Mixture-of-Experts)

The authors realized that instead of one giant brain, the robot needs a team of specialists. This is the core idea of Mixture-of-Experts (MoE).

Think of GazeMoE as a high-end restaurant kitchen:

  • The Frozen Chef (DINOv2): First, the system uses a pre-trained, frozen "base chef" (a massive AI model called DINOv2) that has already seen millions of images. This chef is great at seeing general things like "that's a face" or "that's a tree," but it doesn't know how to cook the specific "gaze dish" yet.
  • The Specialist Chefs (The Experts): GazeMoE adds a special team of four "routed experts" to the kitchen.
    • Expert 1: Specializes in Eyes.
    • Expert 2: Specializes in Head Position.
    • Expert 3: Specializes in Hand Gestures.
    • Expert 4: Specializes in Context (what's happening in the background).
  • The Head Chef (The Gating Mechanism): This is the smart manager. When a new image comes in, the Head Chef looks at the scene and asks: "Do we have eyes visible? Is the head tilted? Is the background chaotic?"
    • If the eyes are hidden, the Head Chef tells the "Eye Expert" to take a break and asks the "Head Expert" and "Context Expert" to work harder.
    • If the scene is a child playing, it might call on a different mix of experts.

By only waking up the top 2 specialists needed for that specific moment, the system stays fast and efficient, but incredibly accurate. It adapts to the situation, just like a human would.

3. The Training: Learning from Mistakes

To make this team work perfectly, the authors had to teach them two tricky lessons:

  • The "Out-of-Frame" Problem: In many datasets, most people are looking at things inside the photo. Very few are looking outside it. This is like a teacher who only asks questions about apples, but then suddenly asks about oranges. The robot gets confused.
    • The Fix: They used a special "Focal Loss" penalty. Imagine a teacher who gives extra credit for getting the rare "orange" questions right, forcing the robot to pay extra attention to the difficult, rare cases.
  • The "Augmentation" Gym: To make the robot tough, they didn't just show it perfect photos. They threw it into a "gym" where they:
    • Cropped the images randomly.
    • Flipped them upside down.
    • Changed the colors and contrast.
    • Made them look like old, grainy photos.
      This is like training a marathon runner on muddy, rainy, and hilly tracks so they can run perfectly on a sunny day.

4. The Results: The New Champion

The team tested GazeMoE on five different "arenas" (datasets):

  1. Standard TV/Movie scenes.
  2. Children playing (who are notoriously hard to predict).
  3. 360-degree Fisheye lenses (where the world looks like a bubble).
  4. Zero-shot testing (scenarios the robot had never seen before).

The verdict? GazeMoE beat every other robot brain in the competition.

  • It was more accurate at guessing where people were looking.
  • It was better at realizing when someone was looking away from the camera.
  • It handled the weird, distorted fisheye images better than anyone else.
  • It did all this while running fast enough to be used in real-time (about 13 frames per second), which is fast enough for a robot to interact with a human in real life.

Summary

GazeMoE is like upgrading a robot's brain from a single, stubborn general to a flexible team of specialists. By letting the right expert handle the right part of the problem, and by training them on messy, difficult scenarios, the robot can finally understand human attention with human-like reliability. This is a huge step forward for robots that need to work alongside us, whether in factories, homes, or hospitals.