GazeMoE: Perception of Gaze Target with Mixture-of-Experts

Imagine you are a robot trying to understand what a human is looking at. You can see their eyes, their head, and their hands, but figuring out exactly what they are staring at is like trying to solve a puzzle where some pieces are missing, some are blurry, and sometimes the person is looking at something completely outside the picture frame.

This paper introduces GazeMoE, a new "super-smart" system designed to solve this puzzle better than any previous robot brain. Here is how it works, explained simply:

1. The Problem: The "One-Size-Fits-All" Brain Fails

Imagine a robot with a single, rigid brain. If a person is looking at a cat, the robot uses its "cat-detecting" neurons. If the person is looking at a car, it uses its "car-detecting" neurons.
But in the real world, things get messy:

Sometimes the person's eyes are hidden (occluded).
Sometimes the camera is a fisheye lens, making everything look warped.
Sometimes the person is a child, whose gaze is harder to predict than an adult's.
Sometimes they are looking outside the camera's view entirely.

Old systems tried to use one giant brain to handle all these situations. It was like trying to use a single Swiss Army knife to fix a watch, cut a steak, and hammer a nail. It worked okay, but it wasn't perfect.

2. The Solution: The "Specialist Team" (Mixture-of-Experts)

The authors realized that instead of one giant brain, the robot needs a team of specialists. This is the core idea of Mixture-of-Experts (MoE).

Think of GazeMoE as a high-end restaurant kitchen:

The Frozen Chef (DINOv2): First, the system uses a pre-trained, frozen "base chef" (a massive AI model called DINOv2) that has already seen millions of images. This chef is great at seeing general things like "that's a face" or "that's a tree," but it doesn't know how to cook the specific "gaze dish" yet.
The Specialist Chefs (The Experts): GazeMoE adds a special team of four "routed experts" to the kitchen.
- Expert 1: Specializes in Eyes.
- Expert 2: Specializes in Head Position.
- Expert 3: Specializes in Hand Gestures.
- Expert 4: Specializes in Context (what's happening in the background).
The Head Chef (The Gating Mechanism): This is the smart manager. When a new image comes in, the Head Chef looks at the scene and asks: "Do we have eyes visible? Is the head tilted? Is the background chaotic?"
- If the eyes are hidden, the Head Chef tells the "Eye Expert" to take a break and asks the "Head Expert" and "Context Expert" to work harder.
- If the scene is a child playing, it might call on a different mix of experts.

By only waking up the top 2 specialists needed for that specific moment, the system stays fast and efficient, but incredibly accurate. It adapts to the situation, just like a human would.

3. The Training: Learning from Mistakes

To make this team work perfectly, the authors had to teach them two tricky lessons:

The "Out-of-Frame" Problem: In many datasets, most people are looking at things inside the photo. Very few are looking outside it. This is like a teacher who only asks questions about apples, but then suddenly asks about oranges. The robot gets confused.
- The Fix: They used a special "Focal Loss" penalty. Imagine a teacher who gives extra credit for getting the rare "orange" questions right, forcing the robot to pay extra attention to the difficult, rare cases.
The "Augmentation" Gym: To make the robot tough, they didn't just show it perfect photos. They threw it into a "gym" where they:
- Cropped the images randomly.
- Flipped them upside down.
- Changed the colors and contrast.
- Made them look like old, grainy photos.
  This is like training a marathon runner on muddy, rainy, and hilly tracks so they can run perfectly on a sunny day.

4. The Results: The New Champion

The team tested GazeMoE on five different "arenas" (datasets):

Standard TV/Movie scenes.
Children playing (who are notoriously hard to predict).
360-degree Fisheye lenses (where the world looks like a bubble).
Zero-shot testing (scenarios the robot had never seen before).

The verdict? GazeMoE beat every other robot brain in the competition.

It was more accurate at guessing where people were looking.
It was better at realizing when someone was looking away from the camera.
It handled the weird, distorted fisheye images better than anyone else.
It did all this while running fast enough to be used in real-time (about 13 frames per second), which is fast enough for a robot to interact with a human in real life.

Summary

GazeMoE is like upgrading a robot's brain from a single, stubborn general to a flexible team of specialists. By letting the right expert handle the right part of the problem, and by training them on messy, difficult scenarios, the robot can finally understand human attention with human-like reliability. This is a huge step forward for robots that need to work alongside us, whether in factories, homes, or hospitals.

1. Problem Statement

Estimating human gaze targets from visible images is critical for applications like human-robot interaction, driver monitoring, and autism detection. However, current methods face several significant challenges:

Generalization: Models trained on specific datasets (e.g., movies or lab settings) often fail in real-world scenarios with diverse lighting, occlusions, and camera distortions.
Multi-modal Integration: Accurate gaze estimation requires adaptively integrating diverse cues (eye landmarks, head pose, gestures, and scene context). These cues are not always available or reliable (e.g., eyes may be occluded, or head pose may be distorted in panoramic images).
Class Imbalance: Datasets often suffer from a severe imbalance between "in-frame" (target visible) and "out-of-frame" (target outside the camera view) samples, leading to poor performance on the minority class.
Robustness: Existing architectures struggle with extreme conditions such as fisheye lens distortions (360-degree images) or non-standard subjects (e.g., children).

2. Methodology: GazeMoE

The authors propose GazeMoE, an end-to-end framework that leverages a Mixture-of-Experts (MoE) architecture to selectively route and process gaze-related features.

A. Architecture Overview

Encoder: Utilizes a frozen DINOv2 (ViT-Large) pre-trained vision foundation model. This extracts fine-grained, low-level spatial representations of the scene without requiring task-specific pre-training of the encoder.
Decoder: A transformer-based decoder containing MoE modules. Instead of a standard Feed-Forward Network (FFN), the decoder uses a gating mechanism to dynamically select the most relevant "experts" for a given input.
Expert Configuration:
- Shared Expert ( $M=1$ ): Processes common scene features applicable to all inputs.
- Routed Experts ( $N=4$ ): Four specialized sub-networks designed to capture specific gaze cues: eye appearance, head pose, gestures, and contextual saliency.
- Gating Mechanism: For each input, a gating network calculates weights to select the Top-K ( $K=2$ ) most relevant routed experts. This allows the model to adaptively ignore unavailable or noisy cues (e.g., ignoring eye features if eyes are occluded) while leveraging available ones.
Output Heads:
1. Heatmap Head: Predicts the probability distribution of the gaze target location within the frame.
2. Classification Head: Determines if the target is "in-frame" or "out-of-frame."

B. Training Paradigm & Loss Functions

Loss Function:
- Heatmap: Uses Pixel-wise Binary Cross-Entropy (BCE) instead of Mean Squared Error (MSE). BCE is less sensitive to the magnitude of probabilistic errors (e.g., predicting 0.9 vs. 1.0) and better handles noisy target heatmaps.
- Classification: Introduces an Auxiliary Focal Loss to address class imbalance between in-frame and out-of-frame samples. This forces the model to focus on hard-to-classify minority samples.
- Total Loss: $L = L_{heatmap} + \lambda \cdot L_{focal}$ .
Data Augmentation: A comprehensive "full stack" of augmentations is employed to enhance robustness:
- Geometric: Random cropping, horizontal flipping, head bounding box jittering.
- Photometric: Color jittering, random grayscaling, auto-contrast, and sharpness adjustment.

3. Key Contributions

Novel Architecture: First application of a Mixture-of-Experts (MoE) decoder specifically for gaze target estimation, enabling adaptive integration of visual cues based on scene availability.
Optimized Training Strategy:
- Proposes Pixel-wise BCE for heatmap generation, outperforming traditional MSE.
- Introduces Focal Loss to solve the in-frame/out-of-frame class imbalance problem.
- Validates a comprehensive suite of photometric and geometric augmentations essential for generalization.
State-of-the-Art Performance: Demonstrates superior accuracy and robustness across diverse benchmarks, including standard datasets, 360-degree fisheye images, and zero-shot inference on unseen domains.

4. Experimental Results

The model was evaluated on five datasets: GazeFollow, VideoAttentionTarget (VAT), ChildPlay, GazeFollow360, and EYEDIAP.

GazeFollow & VAT: GazeMoE achieved State-of-the-Art (SOTA) results.
- GazeFollow: AUC of 0.959 (vs. 0.958 for Gaze-LLE ViT-L).
- VAT: AUC of 0.939 and Mean L2 error of 0.097, significantly outperforming prior methods.
ChildPlay (Children's Gaze): Achieved SOTA with an AUC of 0.945, demonstrating robustness in cognitive development-specific scenarios.
GazeFollow360 (Fisheye/Distortion): Showed excellent generalization to out-of-distribution data with spherical distortion, achieving an AUC of 0.9232 (approaching human expert performance of 0.935).
EYEDIAP (Zero-Shot): In a zero-shot setting (no fine-tuning), GazeMoE outperformed other models, proving its ability to adapt to unseen environments without retraining.
Efficiency:
- Latency: ~~74.2 ms per sample (~~13 FPS), suitable for real-time robotic applications.
- Parameters: Only 3.4M learnable parameters (comparable to Gaze-LLE), despite the MoE complexity.

5. Significance

Robustness to Real-World Conditions: GazeMoE is the first model to reliably evaluate gaze targets across both standard and fisheye lens imaging domains, addressing a major gap in current literature.
Adaptive Feature Learning: By using MoE, the model mimics human cognitive flexibility, knowing which cues to trust (e.g., head pose vs. eyes) based on the specific scene, rather than forcing a rigid fusion of all features.
Practical Deployment: The combination of high accuracy, low parameter count, and real-time inference speed makes GazeMoE a viable solution for deploying gaze estimation in autonomous systems, robots, and assistive technologies.
Open Source: The authors have released the code and pre-trained models, fostering further research in the field.

In conclusion, GazeMoE represents a significant leap forward in gaze estimation by moving away from rigid, single-pathway architectures toward adaptive, expert-driven systems that can handle the complexity and variability of real-world visual environments.

GazeMoE: Perception of Gaze Target with Mixture-of-Experts

1. The Problem: The "One-Size-Fits-All" Brain Fails

2. The Solution: The "Specialist Team" (Mixture-of-Experts)

3. The Training: Learning from Mistakes

4. The Results: The New Champion

Summary

1. Problem Statement

2. Methodology: GazeMoE

A. Architecture Overview

B. Training Paradigm & Loss Functions

3. Key Contributions

4. Experimental Results

5. Significance

More like this

The Quantification Horizon Theory of Consciousness

Algebras of actions in an agent's representations of the world

Heuristic Multiobjective Discrete Optimization using Restricted Decision Diagrams

PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles

Automated Explanation Selection for Scientific Discovery