DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance

Imagine you are teaching a robot to drive a car. To do this safely, the robot needs to know not just where the car is going, but what the human driver is looking at. If the human glances at a child running near the curb, the robot should know that's important. If the human checks the rearview mirror before changing lanes, the robot needs to catch that too.

For a long time, scientists trying to teach robots this skill had a major blind spot: they were only looking through a narrow window.

Here is a simple breakdown of the new paper, "DriverGaze360," which changes the game.

1. The Problem: The "Tunnel Vision" Trap

Imagine trying to learn how to drive a car while wearing a blindfold that only lets you see a tiny rectangle directly in front of your nose. You wouldn't see the car merging from the left, the pedestrian stepping off the curb behind you, or the cyclist in your blind spot.

That's what previous research was like. They used cameras that only looked straight ahead. They missed the most critical moments of driving:

Checking the side mirrors before turning.
Looking back to merge onto a highway.
Watching a cyclist approach from the rear.

Because they couldn't see the whole picture, their AI models were "tunnel-visioned" and couldn't predict human behavior accurately in complex situations.

2. The Solution: A 360-Degree "Fishbowl"

The researchers built something new called DriverGaze360. Think of it as putting the driver inside a giant, transparent fishbowl where they can see everything around them—front, back, left, right, and up.

The Setup: They put 19 real human drivers in a high-tech driving simulator (like a video game, but it feels real).
The Gear: The drivers wore special glasses that tracked exactly where their eyes moved, 120 times per second.
The View: Instead of one camera, they used five cameras to stitch together a full 360-degree view.
The Data: They collected about 1 million snapshots of where drivers looked. This includes boring highway driving, scary near-miss accidents, and tricky city turns.

The Result: For the first time, we have a massive library of data showing exactly how humans look at the entire world while driving, not just the road ahead.

3. The Brain: The "Detective" AI

Just having the data isn't enough; you need a smart brain to understand it. The researchers created a new AI model called DriverGaze360-Net.

Here is the clever trick they used:

Old Way: The AI tried to guess where the driver was looking, but it was like guessing where a detective is looking in a dark room without any clues.
New Way: They gave the AI a second job. While it guesses where the driver is looking, it also has to identify what objects the driver is looking at (like a red car, a stop sign, or a pedestrian).

The Analogy: Imagine you are teaching a child to find a hidden toy.

Old Method: You just say, "Look here!" (The child guesses randomly).
New Method: You say, "Look at the red ball!" (The child knows exactly what to focus on).

By forcing the AI to identify the specific objects (the "red balls"), it becomes much better at predicting where the human's eyes will go. It learns that "drivers look at cars, not at the sky," or "drivers look at pedestrians, not at the clouds."

4. Why This Matters

This isn't just about better video games. This is about safer self-driving cars.

Explainable AI: When a self-driving car makes a decision, we want to know why. If the car brakes suddenly, we want it to say, "I stopped because I saw the human driver looking at a child crossing the street." This new system helps the car understand that logic.
Safety: It helps cars anticipate human mistakes. If the car knows the human is distracted or looking the wrong way, it can step in to prevent an accident.
Realism: Because this data covers the whole view (including the rear and sides), the AI won't get surprised when a car pulls out from a blind spot.

In a Nutshell

The researchers realized that to teach a robot to drive like a human, they had to stop looking through a keyhole and start looking through a fisheye lens. They built a massive dataset of 360-degree eye-tracking and created a smart AI that learns by identifying the specific things humans care about. This makes future autonomous vehicles more aware, safer, and easier to trust.

1. Problem Statement

Predicting driver attention is essential for developing explainable autonomous driving systems and understanding mixed human-AI traffic scenarios. However, existing research faces two critical limitations:

Narrow Field of View (FoV): Current datasets and models rely on forward-facing cameras, capturing only a fraction of the driver's visual field. They fail to model attention during complex maneuvers like lane changes, turns, or interactions with peripheral entities (pedestrians, cyclists) and rear-view monitoring.
Lack of Object-Level Guidance: Existing methods treat attention prediction and semantic segmentation as separate tasks or use full-scene segmentation as input without distinguishing which specific objects are relevant to the driver's current gaze. This leads to poor spatial awareness in panoramic contexts where attention signals are sparse.

2. Methodology

A. DriverGaze360 Dataset

The authors introduced DriverGaze360, the first large-scale, 360° field-of-view driver attention dataset.

Data Collection: Conducted in a controlled simulation environment (CARLA) using 19 licensed human drivers.
Setup: A 360° visual field was simulated using three front displays and two picture-in-picture displays acting as rear-view mirrors. Eye tracking was performed using Pupil Core glasses (120 Hz) synchronized with the simulator.
Scenarios: The dataset covers 9 hours of driving footage (~1 million gaze-labeled frames) across three scenario types:
1. Unscripted Navigation: Free driving in urban/suburban environments.
2. Goal-Directed Navigation: Following audio-guided routes through specific checkpoints (roundabouts, merges).
3. Safety-Critical Events: Injected near-miss situations (e.g., emergency braking, cut-ins, occluded pedestrians).
Annotation: Gaze points were mapped to the simulator space using AprilTags. Fixation maps were generated by aggregating gaze within a 30-frame window, converted to 2D Gaussians, and normalized.
Unique Feature: Unlike previous datasets where participants watched videos, this involves active driving, capturing realistic behaviors like rear-view mirror checking (6% of total gaze).

B. DriverGaze360-Net Architecture

The proposed model is a Vision Transformer (ViT) based architecture designed for panoramic inputs.

Backbone: Uses Video Swin Transformer (VST) to extract multi-scale spatio-temporal features from sequences of concatenated RGB frames (5 cameras). VST was chosen for its hierarchical structure and shifted window mechanism, reducing computational complexity from quadratic to linear.
Dual-Head Decoding:
1. Attention Decoder: Produces dense 360° gaze probability maps.
2. Attended Object Decoder: An auxiliary semantic segmentation head that predicts the categories of attended objects (vehicles, pedestrians, cyclists, traffic signs/lights).
Attended Object Extraction (Algorithm 1): To train the segmentation head, the system generates "attended object masks" by intersecting the binarized gaze map with instance segmentation masks. Only objects overlapping with the driver's fixation are labeled as "attended," filtering out irrelevant background objects.
Loss Function: The model is jointly optimized using a weighted sum of two losses:
- Attention Loss ( $L_{sal}$ ): Combines Kullback-Leibler Divergence (KLD) and negative Correlation Coefficient (CC).
- Segmentation Loss ( $L_{seg}$ ): Combines negative Dice coefficient, negative Intersection-over-Union (IoU), and Cross-Entropy.

3. Key Contributions

DriverGaze360 Dataset: The first large-scale omnidirectional driver attention dataset collected from active human drivers in simulation, covering diverse safety-critical scenarios and capturing full 360° gaze dynamics (including rear-view monitoring).
DriverGaze360-Net: A novel transformer-based architecture that jointly learns attention maps and attended object segmentation. This "object-level guidance" forces the model to focus on semantically meaningful regions, improving spatial awareness.
Attended Object Extraction Pipeline: A method to map continuous gaze distributions to discrete object instances, creating supervision signals that distinguish between "objects in the scene" and "objects the driver is looking at."

4. Experimental Results

Quantitative Performance

The model was evaluated on the DriverGaze360 dataset and the real-world DADA-2000 dataset (narrow FoV) to test generalization.

DriverGaze360: DriverGaze360-Net achieved State-of-the-Art (SOTA) performance across all metrics compared to baselines (Dr(eye)VE, BDDA, DADANet, ViNet++, FBLNet).
- Improvements: 12.18% improvement in KLD, 4.24% in SIM, 4.51% in CC, and 4.94% in NSS over the second-best method.
DADA-2000 (Generalization): The model generalized effectively to real-world narrow FoV data, outperforming all baselines in SIM, CC, and NSS, with only a negligible 0.48% drop in KLD compared to the best baseline.

Ablation Studies

Impact of Object Guidance: Adding the attended-object segmentation head (AttObjSeg) significantly improved performance over an attention-only baseline.
- KLD: Improved by 5.32%.
- NSS: Improved by 2.45%.
Comparison: The "Attended Object" head outperformed a standard "Semantic Segmentation" head (which segments all objects regardless of gaze), proving that focusing supervision on relevant objects is crucial for panoramic attention prediction.

5. Significance

Holistic Understanding: By capturing the full 360° field of view, this work addresses a major gap in driver behavior modeling, enabling the study of peripheral and rearward attention which is critical for safety (e.g., lane changes, blind spots).
Explainable AI: The joint prediction of attention and specific object categories enhances the interpretability of autonomous systems, allowing them to anticipate human intent more accurately in mixed traffic.
Robustness: The proposed method demonstrates that object-level guidance improves model robustness, not only on synthetic panoramic data but also on real-world narrow FoV datasets, suggesting broad applicability for future ADAS and autonomous driving systems.

The dataset and code are publicly available at https://dfki-av.github.io/drivergaze360.