PO-GUISE+: Pose and object guided transformer token selection for efficient driver action recognition

The paper introduces PO-GUISE+, a multi-task video transformer that leverages driver pose and object interaction information to efficiently select tokens, achieving state-of-the-art distracted driving recognition accuracy on multiple datasets while significantly reducing computational costs for real-world onboard deployment.

Ricardo Pizarro, Roberto Valle, Rafael Barea, Jose M. Buenaposada, Luis Baumela, Luis Miguel Bergasa

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are driving a car, and a very smart, but slightly overwhelmed, assistant is sitting in the passenger seat. This assistant's job is to watch you and tell you if you are distracted (like texting, eating, or looking at a map) or if you are focused on the road.

The problem is that this assistant is too smart and too slow.

In the world of AI, the "smartest" assistants are called Transformers. They are like brilliant detectives who look at every single pixel of a video, frame by frame, to understand what's happening. But because they look at everything so carefully, they get exhausted. They require so much computer power that they can't run on the small, cheap computers inside a real car. They are like trying to run a supercomputer in a toaster.

This paper introduces a new assistant named PO-GUISE+. Here is how it works, using some simple analogies:

1. The Problem: The "Over-Reading" Assistant

Imagine you are reading a 100-page book to find one specific sentence. A standard AI (the old way) reads every single word on every page, even the words about the weather or the furniture, just in case they are important. This takes forever and uses up all your brainpower.

In video terms, the AI looks at every single pixel of the driver's face, the steering wheel, the window, and the backseat, even if the driver is just sitting still. This is called quadratic complexity—the more video you give it, the harder it works, and it quickly becomes too heavy for a car computer.

2. The Solution: The "Smart Filter" (Token Selection)

To fix this, the researchers taught the AI to be a smart filter. Instead of reading the whole book, the AI learns to skip the boring parts and only read the exciting chapters.

In AI language, they call these "tokens" (little chunks of the video). The new method, PO-GUISE+, decides which chunks of the video to keep and which to throw away.

3. The Secret Sauce: "Who, What, and Where"

Previous smart filters were okay, but they had a blind spot. They knew who was in the video (the driver's body pose) but they didn't know what the driver was touching.

PO-GUISE+ adds a third sense: Object Interaction.

Think of it like this:

  • Old AI: "I see a hand moving. Is it waving? Is it scratching? I'm not sure, so I'll keep watching the whole hand."
  • PO-GUISE+: "I see a hand moving, AND I see it holding a phone. Therefore, I only need to watch the hand and the phone. I can ignore the rest of the car!"

By knowing where the driver's body is (Pose) and what object they are touching (Object), the AI can instantly ignore 70% of the video. It throws away the empty seats, the windows, and the dashboard, focusing only on the driver and the thing they are holding.

4. The Result: Fast, Cheap, and Accurate

Because PO-GUISE+ throws away so much unnecessary data, it becomes incredibly fast and efficient.

  • The Car Computer Test: The researchers tested this on a Jetson, which is a tiny, low-power computer board used in robots and cars.
  • The Speed: The old, heavy AI was too slow to run smoothly. PO-GUISE+ runs at 33 to 57 frames per second. That means it can watch the driver in real-time without lagging, just like a human eye.
  • The Accuracy: Even though it ignores most of the video, it is actually better at spotting distractions than the heavy models. It's like a sniper who only needs one clear shot to hit the target, rather than a machine gunner spraying bullets everywhere.

5. Why This Matters

Driver distraction is a leading cause of car accidents. We need systems that can watch drivers 24/7 to warn them or take control if they fall asleep or get distracted.

  • Before: We had smart systems that were too expensive and power-hungry to put in every car.
  • Now: PO-GUISE+ is like a lightweight, super-efficient detective that fits in a small box, runs on a car battery, and is smarter than the old giants.

Summary Analogy

Imagine you are trying to find a specific person in a crowded stadium.

  • The Old Way: You scan every single seat in the stadium, looking at every person's face, even if they are sitting in the wrong section. It takes hours.
  • PO-GUISE+: You are told, "The person is wearing a red hat and holding a blue umbrella." You immediately ignore everyone else and only look at the people with red hats and blue umbrellas. You find them instantly, with zero wasted effort.

This paper proves that by teaching AI to look at the right things (the driver and the object) and ignore the wrong things (the empty car interior), we can make life-saving safety technology affordable and fast enough for every car on the road.