PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition

Imagine you are driving a car, and a very smart, but slightly overwhelmed, assistant is sitting in the passenger seat. This assistant's job is to watch you and tell you if you are distracted (like texting, eating, or looking at a map) or if you are focused on the road.

The problem is that this assistant is too smart and too slow.

In the world of AI, the "smartest" assistants are called Transformers. They are like brilliant detectives who look at every single pixel of a video, frame by frame, to understand what's happening. But because they look at everything so carefully, they get exhausted. They require so much computer power that they can't run on the small, cheap computers inside a real car. They are like trying to run a supercomputer in a toaster.

This paper introduces a new assistant named PO-GUISE+. Here is how it works, using some simple analogies:

1. The Problem: The "Over-Reading" Assistant

Imagine you are reading a 100-page book to find one specific sentence. A standard AI (the old way) reads every single word on every page, even the words about the weather or the furniture, just in case they are important. This takes forever and uses up all your brainpower.

In video terms, the AI looks at every single pixel of the driver's face, the steering wheel, the window, and the backseat, even if the driver is just sitting still. This is called quadratic complexity—the more video you give it, the harder it works, and it quickly becomes too heavy for a car computer.

2. The Solution: The "Smart Filter" (Token Selection)

To fix this, the researchers taught the AI to be a smart filter. Instead of reading the whole book, the AI learns to skip the boring parts and only read the exciting chapters.

In AI language, they call these "tokens" (little chunks of the video). The new method, PO-GUISE+, decides which chunks of the video to keep and which to throw away.

3. The Secret Sauce: "Who, What, and Where"

Previous smart filters were okay, but they had a blind spot. They knew who was in the video (the driver's body pose) but they didn't know what the driver was touching.

PO-GUISE+ adds a third sense: Object Interaction.

Think of it like this:

Old AI: "I see a hand moving. Is it waving? Is it scratching? I'm not sure, so I'll keep watching the whole hand."
PO-GUISE+: "I see a hand moving, AND I see it holding a phone. Therefore, I only need to watch the hand and the phone. I can ignore the rest of the car!"

By knowing where the driver's body is (Pose) and what object they are touching (Object), the AI can instantly ignore 70% of the video. It throws away the empty seats, the windows, and the dashboard, focusing only on the driver and the thing they are holding.

4. The Result: Fast, Cheap, and Accurate

Because PO-GUISE+ throws away so much unnecessary data, it becomes incredibly fast and efficient.

The Car Computer Test: The researchers tested this on a Jetson, which is a tiny, low-power computer board used in robots and cars.
The Speed: The old, heavy AI was too slow to run smoothly. PO-GUISE+ runs at 33 to 57 frames per second. That means it can watch the driver in real-time without lagging, just like a human eye.
The Accuracy: Even though it ignores most of the video, it is actually better at spotting distractions than the heavy models. It's like a sniper who only needs one clear shot to hit the target, rather than a machine gunner spraying bullets everywhere.

5. Why This Matters

Driver distraction is a leading cause of car accidents. We need systems that can watch drivers 24/7 to warn them or take control if they fall asleep or get distracted.

Before: We had smart systems that were too expensive and power-hungry to put in every car.
Now: PO-GUISE+ is like a lightweight, super-efficient detective that fits in a small box, runs on a car battery, and is smarter than the old giants.

Summary Analogy

Imagine you are trying to find a specific person in a crowded stadium.

The Old Way: You scan every single seat in the stadium, looking at every person's face, even if they are sitting in the wrong section. It takes hours.
PO-GUISE+: You are told, "The person is wearing a red hat and holding a blue umbrella." You immediately ignore everyone else and only look at the people with red hats and blue umbrellas. You find them instantly, with zero wasted effort.

This paper proves that by teaching AI to look at the right things (the driver and the object) and ignore the wrong things (the empty car interior), we can make life-saving safety technology affordable and fast enough for every car on the road.

1. Problem Statement

Driver Distraction Detection is critical for road safety, yet current state-of-the-art solutions face a significant bottleneck: computational efficiency.

The Challenge: While Vision Transformers (ViTs) have achieved superior accuracy in human action recognition, their quadratic computational complexity ( $O(N^2)$ ) relative to the number of spatio-temporal tokens makes them too resource-intensive for real-time deployment in onboard vehicle systems (edge devices).
Limitations of Existing Solutions:
- Standard token pruning methods (e.g., Top-K attention) often discard tokens that are irrelevant to the general class but crucial for specific tasks.
- Previous pose-guided methods (like the authors' earlier PO-GUISE) improved efficiency by using human pose but ignored object interactions. In driver distraction, the interaction with objects (e.g., holding a phone, eating, adjusting a radio) is the primary indicator of distraction. Ignoring this leads to suboptimal performance, especially under strict computational budgets.

2. Methodology: PO-GUISE+

The authors propose PO-GUISE+, a multi-task video transformer framework designed to reduce computational costs while maintaining or improving accuracy by integrating driver pose and interacting object location into the token selection process.

A. Multi-Task Architecture

The model operates on a pre-trained ViT backbone (specifically VideoMAEv2 and InternVideo2) and performs three tasks simultaneously from a single video clip:

Distraction Classification: Identifying the specific distracted action.
Pose Estimation: Locating driver body landmarks.
Object Localization: Identifying the location of the object the driver is interacting with.

B. Heatmap Token Representation

Instead of relying on external detectors during inference, the model learns to generate motion heatmaps internally:

Input: A video clip is tokenized into visual tokens ( $X_{vis}$ ), a class token ( $X_{cls}$ ), and learnable heatmap tokens ( $X_{hm}$ ).
Motion Heatmaps: The model generates a single motion heatmap per clip that averages the movement of body joints and the interacting object across all frames. This captures temporal dynamics without needing frame-by-frame detection.
Training: The model is trained using a joint loss function combining Cross-Entropy (for classification) and Mean Squared Error (for heatmap regression), balanced dynamically using Nash-MTL (Nash Multi-Task Learning).

C. Pose-and-Object-Guided Token Selection

The core innovation is a two-step token selection module integrated into the transformer layers:

Token Pruning: Visual tokens ( $X_{vis}$ ) are discarded based on their attention scores to the class token, pose heatmap tokens, and object heatmap tokens. By explicitly attending to object locations, the model retains tokens critical for distinguishing actions like "drinking" vs. "talking on the phone."
Token Merging: The discarded tokens are analyzed for similarity. Similar tokens are merged (averaged) to recover information, minimizing data loss.

Control: The process is governed by two keep rates: $\rho$ (pruning rate) and $\lambda$ (merging rate), allowing fine-tuning of the computational budget.

3. Key Contributions

Novel Token Selection Strategy: First method to integrate object interaction and human pose simultaneously to guide token pruning in a multi-task transformer, specifically tailored for driver distraction.
Detector-Free Inference: Unlike methods requiring external pose or object detectors (e.g., ViTPose, YOLO) at runtime, PO-GUISE+ is self-contained. It uses external tools only for generating training pseudo-labels; the inference model generates its own guidance features.
Efficiency-Accuracy Trade-off: The method significantly reduces computational demands (GFLOPs) while outperforming baselines, particularly at low token keep rates where information loss is usually catastrophic.
Real-World Benchmarking: Extensive evaluation on NVIDIA Jetson Orin NX hardware, demonstrating feasibility for real-time onboard deployment.

4. Experimental Results

The model was evaluated on three major datasets: Drive&Act, 100-Driver, and 3MDAD.

Accuracy Improvements:
- Drive&Act: PO-GUISE+ achieved 70.35% macro accuracy (vs. 69.47% for PO-GUISE and 68.27% for the baseline VideoMAEv2) while reducing GFLOPs by 30% (from 360 to 251).
- 100-Driver: Achieved 93.54% accuracy (vs. 91.30% baseline) with 251 GFLOPs.
- 3MDAD: Surpassed the previous state-of-the-art (MIFI) by 9.52% in accuracy while reducing computational cost by 28 GFLOPs.
Efficiency on Edge Devices (Jetson Orin NX):
- The optimized model achieved 72.62% accuracy with an inference speed of 33 FPS using only 3.8GB of memory.
- A lightweight configuration (ViT-S backbone) reached 57 FPS with 57.42% accuracy, outperforming lightweight CNNs (like I3D) by a significant margin (11+ points) in accuracy for similar memory footprints.
Ablation Studies:
- Including object heatmaps (PO-GUISE+) provided a 3.54% accuracy boost over the pose-only version (PO-GUISE) at the same computational cost.
- The method acts as a regularizer; at very high token keep rates, performance slightly dipped, suggesting the pruning mechanism helps prevent overfitting.

5. Significance and Conclusion

PO-GUISE+ addresses the critical gap between high-accuracy AI models and the resource constraints of automotive edge computing.

Safety Impact: By enabling high-accuracy distraction detection on low-power hardware, it facilitates the deployment of robust Driver Monitoring Systems (DMS) in mass-market vehicles.
Technical Advancement: It demonstrates that semantic guidance (using pose and object location) is superior to generic token pruning for specialized tasks.
Future Work: The authors note that while the model is robust to lighting variations (daytime), future iterations will focus on multi-modal strategies (combining RGB and NIR) for all-weather/nighttime operation and longer temporal contexts to resolve ambiguous actions.

In summary, PO-GUISE+ sets a new state-of-the-art for efficient driver action recognition, proving that specialized token selection guided by task-relevant semantics can drastically reduce computational costs without sacrificing the accuracy required for safety-critical applications.