ALERT Open Dataset and Input-Size-Agnostic Vision Transformer for Driver Activity Recognition using IR-UWB

This paper introduces the ALERT dataset containing 10,220 real-world IR-UWB radar samples and proposes the input-size-agnostic Vision Transformer (ISA-ViT) framework, which effectively addresses data scarcity and input dimensionality challenges to achieve a 22.68% accuracy improvement in driver activity recognition.

Jeongjun Park, Sunwook Hwang, Hyeonho Noh, Jin Mo Yang, Hyun Jong Yang, Saewoong Bahk

Published 2026-02-17
📖 5 min read🧠 Deep dive

Imagine you are driving your car, and a tiny, invisible "super-sense" is watching you. It doesn't use a camera (which feels like a violation of privacy) and it doesn't use a microphone (which might pick up your private conversations). Instead, it uses Ultra-Wideband (UWB) radar—think of it as a bat's sonar that sends out invisible radio pulses to "see" your movements inside the car.

The goal? To catch you when you are being distracted (like texting, smoking, or nodding off) and alert you before an accident happens.

However, building a system to do this has been like trying to fit a square peg into a round hole. This paper, titled ALERT, solves two massive problems holding this technology back.

Problem 1: The "Empty Library"

The Analogy: Imagine you want to teach a robot how to recognize a cat. But you only have one blurry photo of a cat in a library. The robot will never learn well.
The Reality: Scientists had very few real-world examples of drivers getting distracted. Most data was fake (simulated in a computer), which is like teaching a pilot to fly using only a video game. Real roads have bumps, vibrations, and weird angles that fake data misses.

The Solution: The ALERT Dataset
The authors built a massive new library called ALERT.

  • What it is: They drove a real car around city streets and campuses with 9 volunteers.
  • The Collection: They recorded over 10,000 samples of 7 different behaviors: normal driving, relaxing (hands off the wheel), nodding off, smoking, drinking, fiddling with the radio, and using a phone.
  • Why it matters: This is the first time such a huge, realistic dataset exists. It's like giving the robot a library full of high-definition, real-life videos instead of one blurry photo.

Problem 2: The "Wrong Puzzle Piece"

The Analogy: Imagine you have a giant, complex puzzle (the radar data) that is a weird, long rectangle. You want to use a famous puzzle-solver robot (called a Vision Transformer or ViT) that was trained to solve perfect square puzzles (like photos of cats or dogs).
If you just squish or stretch your weird rectangle to fit the square robot, you ruin the picture. You stretch the cat's face until it looks like a pancake. The robot gets confused and fails.

The Reality:

  • The Robot: Vision Transformers are the current "champions" of AI for recognizing images. They are incredibly smart but picky; they expect data to be a specific size (like a 224x224 pixel square).
  • The Mismatch: Radar data is messy. It comes in different shapes and sizes depending on how far the driver is or how fast they are moving.
  • The Old Way: Scientists used to just "squish" the radar data to fit the robot. This destroyed important details, like the speed of a hand movement (Doppler shift) or the exact distance of a body part.

The Solution: ISA-ViT (The "Shape-Shifting" Adapter)
The authors invented a new tool called ISA-ViT (Input-Size-Agnostic Vision Transformer).

  • How it works: Instead of squishing the data, ISA-ViT acts like a smart tailor. It takes the weirdly shaped radar data and cuts it into perfect "patches" (like slicing a pizza) that fit the robot's brain without stretching or losing any ingredients.
  • The Secret Sauce: It uses a special trick to keep the robot's "memory" of where things are (positional embeddings) intact, even when the data shape changes. It's like telling the robot, "Even though this pizza slice is bigger, it's still the top-left slice."

The "Double-Check" Strategy (Domain Fusion)

The authors also realized that looking at the data in just one way isn't enough.

  • Distance View (Range): Tells you where the hand is.
  • Speed View (Frequency): Tells you how fast the hand is moving.
  • The Magic: They combined both views. It's like having a security guard who checks both your ID (distance) and your gait (speed). If one looks suspicious, the other confirms it. This "fusion" made the system much smarter at telling the difference between, say, drinking water and smoking a cigarette.

The Results: A Super-Smart Co-Pilot

When they tested this new system:

  1. It learned faster: It got 22% better at recognizing distractions than previous methods.
  2. It's super accurate: It correctly identified distracted driving 97.35% of the time.
  3. It's safe: It can tell the difference between a driver who is just relaxing and one who is texting, which is crucial for not annoying drivers with false alarms.

The Bottom Line

This paper is a huge step forward for car safety.

  • They built the best map (ALERT dataset) of distracted driving ever.
  • They built the best compass (ISA-ViT) to navigate that map, even when the terrain changes shape.
  • They made it all open source, meaning other scientists can use these tools to build even better safety systems for the future.

In short, they taught an AI to "see" drivers using invisible radar waves, without invading their privacy, and they did it so well that it could soon become a standard feature in your next car, keeping you safe from the dangers of distraction.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →