ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

Imagine you are driving your car, and a tiny, invisible "super-sense" is watching you. It doesn't use a camera (which feels like a violation of privacy) and it doesn't use a microphone (which might pick up your private conversations). Instead, it uses Ultra-Wideband (UWB) radar—think of it as a bat's sonar that sends out invisible radio pulses to "see" your movements inside the car.

The goal? To catch you when you are being distracted (like texting, smoking, or nodding off) and alert you before an accident happens.

However, building a system to do this has been like trying to fit a square peg into a round hole. This paper, titled ALERT, solves two massive problems holding this technology back.

Problem 1: The "Empty Library"

The Analogy: Imagine you want to teach a robot how to recognize a cat. But you only have one blurry photo of a cat in a library. The robot will never learn well.
The Reality: Scientists had very few real-world examples of drivers getting distracted. Most data was fake (simulated in a computer), which is like teaching a pilot to fly using only a video game. Real roads have bumps, vibrations, and weird angles that fake data misses.

The Solution: The ALERT Dataset
The authors built a massive new library called ALERT.

What it is: They drove a real car around city streets and campuses with 9 volunteers.
The Collection: They recorded over 10,000 samples of 7 different behaviors: normal driving, relaxing (hands off the wheel), nodding off, smoking, drinking, fiddling with the radio, and using a phone.
Why it matters: This is the first time such a huge, realistic dataset exists. It's like giving the robot a library full of high-definition, real-life videos instead of one blurry photo.

Problem 2: The "Wrong Puzzle Piece"

The Analogy: Imagine you have a giant, complex puzzle (the radar data) that is a weird, long rectangle. You want to use a famous puzzle-solver robot (called a Vision Transformer or ViT) that was trained to solve perfect square puzzles (like photos of cats or dogs).
If you just squish or stretch your weird rectangle to fit the square robot, you ruin the picture. You stretch the cat's face until it looks like a pancake. The robot gets confused and fails.

The Reality:

The Robot: Vision Transformers are the current "champions" of AI for recognizing images. They are incredibly smart but picky; they expect data to be a specific size (like a 224x224 pixel square).
The Mismatch: Radar data is messy. It comes in different shapes and sizes depending on how far the driver is or how fast they are moving.
The Old Way: Scientists used to just "squish" the radar data to fit the robot. This destroyed important details, like the speed of a hand movement (Doppler shift) or the exact distance of a body part.

The Solution: ISA-ViT (The "Shape-Shifting" Adapter)
The authors invented a new tool called ISA-ViT (Input-Size-Agnostic Vision Transformer).

How it works: Instead of squishing the data, ISA-ViT acts like a smart tailor. It takes the weirdly shaped radar data and cuts it into perfect "patches" (like slicing a pizza) that fit the robot's brain without stretching or losing any ingredients.
The Secret Sauce: It uses a special trick to keep the robot's "memory" of where things are (positional embeddings) intact, even when the data shape changes. It's like telling the robot, "Even though this pizza slice is bigger, it's still the top-left slice."

The "Double-Check" Strategy (Domain Fusion)

The authors also realized that looking at the data in just one way isn't enough.

Distance View (Range): Tells you where the hand is.
Speed View (Frequency): Tells you how fast the hand is moving.
The Magic: They combined both views. It's like having a security guard who checks both your ID (distance) and your gait (speed). If one looks suspicious, the other confirms it. This "fusion" made the system much smarter at telling the difference between, say, drinking water and smoking a cigarette.

The Results: A Super-Smart Co-Pilot

When they tested this new system:

It learned faster: It got 22% better at recognizing distractions than previous methods.
It's super accurate: It correctly identified distracted driving 97.35% of the time.
It's safe: It can tell the difference between a driver who is just relaxing and one who is texting, which is crucial for not annoying drivers with false alarms.

The Bottom Line

This paper is a huge step forward for car safety.

They built the best map (ALERT dataset) of distracted driving ever.
They built the best compass (ISA-ViT) to navigate that map, even when the terrain changes shape.
They made it all open source, meaning other scientists can use these tools to build even better safety systems for the future.

In short, they taught an AI to "see" drivers using invisible radar waves, without invading their privacy, and they did it so well that it could soon become a standard feature in your next car, keeping you safe from the dangers of distraction.

1. Problem Statement

The paper addresses two critical bottlenecks hindering the adoption of Impulse Radio Ultra-Wideband (IR-UWB) radar for Driver Activity Recognition (DAR):

Lack of Real-World Datasets: Existing datasets are either limited to single activities or collected in simulated environments. Simulated data fails to capture real-world complexities such as road vibrations, vehicle-induced noise, and diverse driving conditions, leading to poor generalization in actual scenarios.
Model Adaptation Challenges: State-of-the-art Vision Transformers (ViTs) are designed for fixed-size image inputs (e.g., $224 \times 224$ ). Applying them to UWB radar data (which has non-standard, variable dimensions) typically requires naive resizing. This process distorts critical radar-specific information (Doppler shifts, phase, attenuation) and misaligns pre-trained Positional Embedding Vectors (PEVs), causing significant performance degradation.

2. Methodology

A. The ALERT Dataset

The authors introduced ALERT, the first comprehensive open dataset for UWB-based DAR collected in real-driving environments.

Data Collection: Collected using a Novelda Xethru X4M06 radar mounted on a vehicle's air vent (minimizing occlusion and mimicking smartphone holder placement).
Scenarios: Data was gathered on two distinct routes: an urban highway route (smooth, high-speed) and a campus route (stop-and-go, varied road surfaces like cobblestones and speed bumps).
Labels: Covers 7 activities: Relaxation (hands-off), Normal Driving, Nodding (drowsiness), Smoking, Drinking, Panel Control, and Smartphone Use.
Scale: 10,220 samples (5-second windows) from 9 volunteers.
Format: Provides both Range-Time (spatial motion) and Frequency-Time (Doppler/velocity) domain representations, allowing for flexible experimental design.

B. Input-Size-Agnostic Vision Transformer (ISA-ViT)

To overcome the input-size mismatch without losing signal integrity, the authors propose ISA-ViT.

Information-Preserving Resizing: Instead of simple interpolation which destroys signal details, ISA-ViT extends the shorter side of the rectangular UWB input to match the longer side. It then calculates a specific patch size ( $k \times k$ ) to divide the input into exactly $14 \times 14$ patches, ensuring the pre-trained PEV sequence remains valid.
Kernel Adaptation: The pre-trained CNN kernel weights (originally $16 \times 16$ for RGB images) are resized to match the new patch size $k$ using interpolation (if $k > 16$ ) or average pooling (if $k < 16$ ). Channel weights are averaged from RGB to single-channel UWB data.
Domain Fusion Strategy:
- Range Domain: Processed by the full ISA-ViT backbone to capture spatial context.
- Frequency Domain: Processed by a lightweight feature extractor to capture motion dynamics.
- Fusion: Features are concatenated with a learnable scalar factor ( $\beta$ ) applied to the frequency features. This allows the model to dynamically balance the contribution of the frequency domain, preventing it from overpowering the more informative range domain.

3. Key Contributions

ALERT Dataset: A publicly available, real-world UWB dataset with 10,220 samples across 7 distracted driving activities, featuring both range and frequency domain data.
ISA-ViT Architecture: A novel framework that adapts pre-trained ViTs to variable-sized UWB inputs by preserving the $14 \times 14$ token grid structure and adapting patch/embedding dimensions without information loss.
Comprehensive Benchmarking: Evaluation of 8 different learning algorithms (CNN, RNN, Transformer) on the ALERT dataset, analyzing factors like observation time, multipath effects, and domain fusion strategies.
Performance Gains: Demonstrated that ISA-ViT significantly outperforms existing methods by effectively handling the domain gap and input size mismatch.

4. Experimental Results

Overall Accuracy: ISA-ViT achieved a 76.28% classification accuracy on the ALERT dataset, surpassing the standard ViT (53.60%) and CNN-based methods (max 71.22%).
Distracted Driving Detection: When focusing specifically on detecting any distracted behavior (vs. normal driving), the system achieved 97.35% accuracy.
Impact of Resizing: Simple resizing methods resulted in ~52% accuracy. ISA-ViT's information-preserving resizing improved this by over 20 percentage points.
Domain Fusion: Combining range and frequency domains improved accuracy by ~10% compared to using single domains. The learnable $\beta$ factor further optimized this fusion.
Few-Shot Adaptation: The model showed significant performance gains (up to 91.75%) with just 30 few-shot samples per driver, demonstrating adaptability to new users.
Computational Cost: While ISA-ViT is more computationally intensive than CNNs (60.01 GFLOPs vs. ~3.8 GFLOPs for MobileNet), it offers a 22.68% accuracy improvement over standard ViT with only a marginal 0.3 GFLOPs increase.

5. Significance

Safety & Privacy: The work promotes a privacy-preserving alternative to camera-based monitoring (no visual/audio capture) that is robust against lighting conditions and environmental noise.
Scalability: By releasing the ALERT dataset and the ISA-ViT code, the authors provide a foundation for future research to develop robust, scalable distracted driving detection systems.
Methodological Advance: The paper establishes a new paradigm for applying pre-trained Vision Transformers to non-image, time-series radar data, solving the critical "input size" and "positional embedding" mismatch problems that have limited previous attempts.

In conclusion, this work bridges the gap between theoretical deep learning models and practical, real-world radar sensing, offering a high-accuracy, privacy-compliant solution for detecting driver distraction.

ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

Problem 1: The "Empty Library"

Problem 2: The "Wrong Puzzle Piece"

The "Double-Check" Strategy (Domain Fusion)

The Results: A Super-Smart Co-Pilot

The Bottom Line

1. Problem Statement

2. Methodology

A. The ALERT Dataset

B. Input-Size-Agnostic Vision Transformer (ISA-ViT)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

OpenKedge: Governing Agentic Mutation with Execution-Bound Safety and Evidence Chains

From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

Sustained Impact of Agentic Personalisation in Marketing: A Longitudinal Case Study

RAMP: Hybrid DRL for Online Learning of Numeric Action Models

Parameterized Complexity Of Representing Models Of MSO Formulas