Towards Driver Behavior Understanding: Weakly-Supervised Risk Perception in Driving Scenes

Imagine you are learning to drive a car, but instead of a human instructor, you have a super-smart computer trying to figure out how humans think. The big question is: How does a driver know when they are in danger?

Most self-driving cars today are like robots that only look for things that might physically hit them (like a wall or another car). But this paper argues that real drivers are more like detectives. They don't just wait for a crash to happen; they read the room. They look at a pedestrian's eyes to see if they are paying attention, or they glance at a truck blocking the road and decide to swerve before anything bad happens.

Here is a simple breakdown of what this research team (from Honda Research Institute) did to teach computers this "detective" skill.

1. The Problem: The "Blind" Robot

Current self-driving systems are great at math, but they struggle with human intuition.

The Old Way: "If a ball rolls into the street, a child might follow. Stop!" (This is just reacting to objects).
The Real Way: "That cyclist is looking at me. They know I'm here. I can slow down gently. But that other cyclist is looking at their phone and walking into the street without looking! I need to slam on the brakes!"

The problem is, we didn't have enough "training data" to teach computers this subtle stuff. Existing datasets were like a library with only one book on driving; they missed the messy, real-world details like "is that pedestrian looking at me?"

2. The Solution: The "RAID" Library

The team created a massive new dataset called RAID (Risk Assessment In Driving scenes). Think of this as a giant, high-definition movie library specifically designed to teach AI how to spot danger.

What's inside? Over 4,600 video clips of real driving in San Francisco.
The Special Sauce: Unlike other libraries, RAID includes labels for:
- The Driver's Reaction: Did they swerve? Did they stop?
- The "Risk" Object: What caused the reaction? (A jaywalker? A parked car with an open door?)
- The Pedestrian's Eyes: This is the game-changer. They labeled whether pedestrians were looking at the car or not looking.

It's like having a driving simulator where every single video tells you not just what happened, but why the driver reacted the way they did.

3. The Method: The "What If?" Game

To teach the computer how to spot danger, the researchers invented a clever trick called Weakly Supervised Learning.

Imagine you are watching a movie and you want to know which character is the villain, but the movie doesn't tell you. You only see the hero jump out of the way.

The AI's Strategy: The computer watches the video and asks, "If I remove this person from the scene, would the driver still have jumped?"
- If the driver still jumps without the person, that person isn't the danger.
- If the driver stops jumping when the person is removed, Bingo! That person is the risk.

The AI plays this "What If?" game thousands of times, learning to identify the specific object that caused the driver to react. It's like a detective eliminating suspects until only the culprit remains.

4. The "Eye Contact" Factor

The paper also focuses heavily on pedestrian attention.

The Metaphor: Think of a pedestrian as a light switch.
- Looking at the car: The switch is ON. The pedestrian knows you are there. The risk is lower because they are communicating with you.
- Looking at their phone: The switch is OFF. They are oblivious. The risk is high because they might step out without warning.

The researchers built a system that can spot a face in a crowd and tell if the eyes are looking at the car. They found that when a pedestrian is looking, the "Risk Score" drops. When they aren't, the score goes up. This helps the car decide how hard to brake.

5. The Results: Smarter Than Before

When they tested their new system (the "Detective AI") against older methods:

It was 20% to 23% better at spotting the real danger.
It understood that a car blocking the lane is different from a car just parked.
It realized that a pedestrian looking at the car is safer than one who isn't.

Why Does This Matter?

This research is a giant leap toward making self-driving cars feel more like experienced human drivers and less like nervous robots.

By teaching cars to understand intent (what people are thinking) and attention (who is looking where), we can build vehicles that don't just avoid crashes, but actually understand the flow of traffic. It's the difference between a car that stops because it sees a red light, and a car that slows down because it sees a distracted dog owner walking a leash near the curb.

In short: They built a massive library of "scary driving moments," taught a computer to play "What If?" to find the culprit, and proved that paying attention to where people are looking makes the car much safer.

1. Problem Statement

The paper addresses the challenge of risk perception in autonomous driving. While traditional systems define risk via collision probability, the authors argue that a driver-centric definition is more accurate: risk is inferred from a driver's behavioral response (e.g., braking, steering away) to external stimuli.

Key limitations in existing research include:

Lack of Diversity: Existing datasets often lack diverse traffic scenarios and rare events.
Missing Behavioral Cues: Prior works ignore pedestrian attentiveness (e.g., eye contact), a critical factor in non-verbal communication and mutual intention understanding.
Weak Supervision: Identifying the specific "risk object" causing a driver's reaction without explicit bounding box labels for the cause is difficult. Most methods rely on heavy annotation or fail to model the causal link between the object and the driver's reaction.

2. Key Contributions

The authors introduce RAID (Risk Assessment In Driving scenes) and a corresponding weakly-supervised framework:

RAID Dataset: A large-scale dataset comprising 4,691 annotated video clips from naturalistic driving in the San Francisco Bay Area.
- Annotations: Includes driver intended maneuvers (Left/Right/Go-Straight), road topology, risk situations (10 classes), driver responses (Continue/Alter), and pedestrian attentiveness (Looking/Not Looking).
- Unique Features: Unlike prior datasets (JAAD, PIE, HDDS), RAID includes face bounding boxes and explicit pedestrian attention labels, enabling the study of gaze-based risk assessment.
Weakly-Supervised Framework: A novel model that identifies risk objects by modeling the causal relationship between a driver's intended maneuver and their actual response (stopping or deviating).
Pedestrian Attention Modeling: A dual approach (classification and detection) using face annotations to quantify how pedestrian gaze influences risk scores.
Joint Risk Assessment: A method to combine risk object identification with pedestrian attentiveness to generate a holistic risk score.

3. Methodology

A. Risk Object Identification (Weakly Supervised)

The problem is formulated as a cause-and-effect task: identifying the object (cause) that triggers a change in driver behavior (effect).

Graph-Based Architecture:
- Input: A sequence of RGB frames and object tracklets.
- Feature Extraction: Uses Mask R-CNN and Deep SORT to detect and track agents. Features are extracted via RoIAlign.
- Spatio-Temporal Graph: Constructs a graph $G_t$ where nodes are traffic agents (including the ego vehicle) and edges represent pairwise relations based on appearance and presence.
- Partial Convolution: To simulate the absence of an agent, the model iteratively masks out agents. If masking a specific agent results in the model predicting a "Continue" (uninterrupted) state, that agent is identified as the risk object.
Driver Action Prediction (Temporal Encoder-Decoder):
- Unlike prior works using simple I3D features, this model uses a ResNet-50 backbone followed by an Encoder-Decoder LSTM.
- The Decoder predicts future driver actions (e.g., turning) based on current frames.
- The Encoder uses these predicted future features to classify the current driver intention. This provides a structured temporal context for the risk identification task.
Final Prediction: The graph relational features are concatenated with the driver action hidden state to predict the driver's response (Continue vs. Alter).

B. Pedestrian Attentiveness

Data: Uses a subset of 695 scenarios with face and body bounding boxes.
Classification: Trains a ResNet-101 on cropped face images to classify "Looking" vs. "Not Looking."
Detection: Modifies a face detector (adding an attention head) to simultaneously predict bounding boxes and attention status using a multi-task loss function.

C. Joint Risk Assessment

The authors propose a formula to adjust the raw risk score ( $s_{roi}$ ) based on pedestrian attention ( $s_{look}$ ):
$s_{risk} = s_{roi} + \frac{(1 - s_{look})}{2}$
This implies that if a pedestrian is looking at the ego vehicle ( $s_{look} \approx 1$ ), the risk score decreases, reflecting reduced uncertainty.

4. Experimental Results

The method was evaluated on the new RAID dataset and the existing HDDS dataset.

Risk Object Identification:
- HDDS: The proposed method achieved 40.41% mean accuracy (mAcc), outperforming the previous state-of-the-art DROID (29.60%) by a significant margin.
- RAID: The method achieved 19.28% mAcc (base) and 22.10% mAcc (with driver action module), surpassing all baselines including Driving Model [15] and Object-level Attention [24].
- Improvement: The inclusion of the driver action module yielded a 20.6% gain on RAID and 23.1% on HDDS over prior SOTA.
Pedestrian Attention:
- Using face annotations for classification significantly outperformed using body pose alone (83.76% mAP vs. 62.10% mAP).
- Detection results showed that "Not Looking" is harder to detect than "Looking," likely due to small face sizes and occlusion in driving scenes.
Driver Action/Response:
- The model achieved high accuracy in predicting driver responses (93.22% mAP for "Alter" vs. "Continue").
- Left-turn prediction was lower than right-turn prediction due to higher contextual complexity (unprotected intersections).

5. Significance and Conclusion

This work represents a significant step forward in human-centric risk perception for autonomous vehicles.

Dataset Impact: RAID fills a critical gap by providing the first large-scale dataset with pedestrian attention and face annotations linked to risk perception, enabling research into non-verbal communication between humans and machines.
Methodological Advance: By leveraging weak supervision (driver response) and explicitly modeling the interplay between driver intent and pedestrian gaze, the framework moves beyond simple collision prediction to understanding the cognitive process of risk assessment.
Future Directions: The authors suggest that incorporating road topology and improving detection for rare events (like open car doors) are necessary next steps to handle the long-tail distribution of real-world driving risks.