Imagine you are trying to guess if someone is hesitating or feeling conflicted about a decision just by watching a video of them. Maybe they are talking about whether to quit their job or change their diet.
This is the challenge Team LEYA tackled in the 10th ABAW competition. Their goal was to build a computer program that can spot these subtle "I'm not sure" moments in videos.
Here is how they did it, explained simply:
The Problem: The "Mixed Signal" Puzzle
Hesitation is tricky. It's not like a big smile (happiness) or a loud scream (anger). It's subtle.
- The Analogy: Imagine a person saying, "I love this new job!" with a huge smile, but their voice is shaking, and they are nervously tapping their foot.
- The Conflict: Their words say "Yes," but their body and voice say "Maybe not." This mismatch is called ambivalence. To catch it, a computer can't just listen to the words; it has to watch the face, the background, and the tone of voice all at once.
The Solution: The "Four-Sense Detective Team"
Team LEYA didn't rely on just one sense. They built a team of four specialized AI detectives, each looking at a different part of the video:
The Scene Detective (The Background):
- What it does: It looks at the whole room, not just the person. Is the room chaotic? Is the lighting dim?
- The Metaphor: Think of this as the "vibe check." Sometimes the environment tells you if someone is stressed or unsure, even if they are trying to hide it. They used a high-tech model called VideoMAE to understand the flow of the scene.
The Face Detective (The Expression):
- What it does: It zooms in on the person's face to catch micro-expressions.
- The Metaphor: This is like a forensic artist looking for a tiny twitch of the eyebrow or a fleeting frown that lasts only a split second. They took thousands of snapshots of the face and averaged them out to find the "emotional fingerprint."
The Voice Detective (The Tone):
- What it does: It analyzes the audio, looking for pauses, shaky tones, or changes in pitch.
- The Metaphor: This is the "lie detector" for sound. It listens for the hesitation in a voice that says "I'm fine" but sounds like they are about to cry. They used a model called Mamba to understand the rhythm of the speech.
The Text Detective (The Words):
- What it does: It reads the transcript of what the person is saying.
- The Metaphor: This is the "logic checker." It looks for words like "maybe," "I guess," or "on the one hand." In this specific challenge, the words turned out to be the strongest single clue.
The "Brain": Fusing the Clues
Having four detectives is great, but they need a Chief Investigator to put the clues together.
- The Fusion Model: The team built a special AI "brain" (a Transformer) that takes the reports from all four detectives.
- The "Prototype" Trick: To make the Chief Investigator even smarter, they added a special tool called Prototype-Augmentation.
- The Analogy: Imagine the Chief Investigator has a mental "file cabinet" with perfect examples of "Total Certainty" and "Total Hesitation." When a new video comes in, the system compares the clues against these perfect examples to see which one it matches better. This helps the AI make a more confident guess.
The Results: Teamwork Wins
The team tested their system on a dataset of real people answering questions.
- Solo Acts: When they let just the "Text Detective" work alone, it got about 70% of the answers right. The "Face Detective" alone was much worse (around 62%).
- The Power of the Team: When they combined all four detectives, the score jumped to 83% on the practice tests.
- The Grand Finale: For the final competition, they didn't just use one Chief Investigator. They created an Ensemble—a council of five different investigators who all voted on the answer. This "wisdom of the crowd" approach helped them reach 71.43% accuracy on the final, unseen test data.
The Big Takeaway
The paper proves that to understand human hesitation, you can't just look at one thing. You need to listen to the words, watch the face, hear the tone, and check the surroundings.
The Lesson: Just like a human detective needs to look at all the evidence to solve a mystery, an AI needs to combine all the senses to understand the complex, messy reality of human emotion. By combining these different "senses" and using a smart voting system, Team LEYA built a much better "hesitation detector" than anyone else in the competition.
Get papers like this in your inbox
Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.