HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

The HSEmotion team presents a fast, hybrid approach for the ABAW-10 competition that combines pre-trained EfficientNet-based models with a simple MLP and temporal smoothing for facial expression, valence-arousal, and action unit tasks, alongside pre-trained architectures for fine-grained violence detection, achieving significant improvements over existing baselines across all four challenges.

Andrey V. Savchenko, Kseniia Tsypliakova

Published 2026-03-16
📖 5 min read🧠 Deep dive

Imagine you are trying to understand the mood of a crowd at a chaotic street festival. Some people are shouting, some are crying, some are just standing there, and the lighting is terrible. Now, imagine you have to do this for thousands of people, frame by frame, in a video, and you have to be fast and accurate.

That is essentially what the HSEmotion Team did for the ABAW-10 Competition. This competition is like the "Olympics" for computers trying to understand human emotions and behaviors in real-world videos.

Here is a breakdown of their winning strategy, explained simply with some creative analogies.

The Big Picture: Four Different Challenges

The team tackled four different tasks, which are like four different games in the same tournament:

  1. Facial Expression Recognition (FER): Guessing if a person is happy, sad, angry, etc., just by looking at their face.
  2. Valence-Arousal (VA) Estimation: Measuring two things: Valence (how positive or negative the feeling is) and Arousal (how calm or excited the person is). Think of it as a 2D map of emotions.
  3. Action Unit (AU) Detection: Spotting tiny muscle movements (like a twitch of the eyebrow or a lip curl) that happen before a full emotion shows up.
  4. Fine-Grained Violence Detection: Watching a whole video scene to decide if a fight or violent act is happening, not just looking at faces.

The Secret Sauce: The "Smart Filter" Pipeline

For the first three tasks (the face-related ones), the team didn't try to build a giant, complex robot brain from scratch. Instead, they built a Smart Filter System.

1. The "Expert Eye" (Pre-trained Models)

Imagine you have a super-expert art critic who has seen millions of photos of faces. This critic is so good that they can instantly tell if a face looks "happy" or "sad" with 99% confidence.

  • The Team's Move: They used a pre-trained AI model (called EfficientNet) that acts like this expert. It looks at a video frame and says, "I'm 95% sure this person is smiling."

2. The "Safety Net" (The MLP Classifier)

But what if the expert is unsure? Maybe the lighting is bad, or the person is wearing sunglasses. The expert might hesitate.

  • The Team's Move: If the expert's confidence is low, the system passes the image to a "junior assistant" (a simple neural network called an MLP). This assistant was trained specifically on the competition's dataset (AffWild2) to learn the specific quirks of the people in the videos.

3. The "Confidence Check" (The Threshold)

Here is the clever part. The system has a rule:

  • If the Expert is super confident (>90%): Trust the Expert immediately. No need to ask the junior assistant.
  • If the Expert is unsure: Ask the junior assistant to take a look.
    This saves time and reduces errors. It's like a doctor who handles common colds instantly but refers complex, rare cases to a specialist.

4. Smoothing the "Jitter" (Sliding Window)

Videos are made of thousands of frames. Sometimes, a computer might think a person is "angry" for one frame, then "neutral" for the next, then "angry" again, just because of a tiny glitch. This looks like a shaking camera.

  • The Team's Move: They used a Sliding Window. Imagine looking at the last 5 seconds of a video at once. If the computer thinks the person is angry for 4 out of 5 seconds, it smooths out the result and says, "Okay, they are definitely angry." This removes the jitter and makes the prediction stable.

5. Listening to the Voice (Audio Fusion)

Sometimes a face is hidden, but you can hear the tone of voice.

  • The Team's Move: They added a microphone to the mix. They analyzed the audio (using a tool called wav2vec) and blended it with the visual results. It's like solving a mystery by looking at the suspect and listening to their voice simultaneously.

The Violence Detection: A Different Game

For the Violence Detection task, looking at faces wasn't enough. You need to see the whole body and the scene.

  • The Strategy: Instead of just looking at faces, they looked at the entire frame.
  • The Tool: They used a powerful visual encoder called ConvNeXt (trained on millions of general images) to understand the scene.
  • The Time Machine: They added a TCN (Temporal Convolutional Network). Think of this as a time-traveling lens that looks at how people move over time. It doesn't just see a punch; it sees the wind-up and the follow-through.
  • The Result: This combination was so effective that it beat the competition's baseline by a huge margin. It was like upgrading from a security camera that just takes snapshots to a system that understands the story of the fight.

Why This Matters

The HSEmotion team's approach is special because it is simple but smart.

  • Old way: Build a massive, heavy, complex AI that eats up all your computer power and is hard to fix.
  • Their way: Use a pre-trained "expert," add a simple "assistant," and use logic to decide who to listen to.

The Analogy:
If the competition was a cooking contest:

  • Other teams tried to build a giant, automated robot chef that could cook anything but took hours to set up.
  • The HSEmotion team used a master chef (the pre-trained model) who knows the basics perfectly. If the master chef is unsure about a specific local ingredient, they ask a local guide (the MLP). They taste the dish, adjust the seasoning (smoothing), and serve it.

The Outcome:
They achieved top-tier results, proving that you don't always need the biggest, heaviest AI to win. Sometimes, a well-organized team of a "smart expert" and a "quick assistant" working together is the most powerful tool of all.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →