HSEmotion Team at ABAW-10 Competition: Facial Expression Recognition, Valence-Arousal Estimation, Action Unit Detection and Fine-Grained Violence Classification

Imagine you are trying to understand the mood of a crowd at a chaotic street festival. Some people are shouting, some are crying, some are just standing there, and the lighting is terrible. Now, imagine you have to do this for thousands of people, frame by frame, in a video, and you have to be fast and accurate.

That is essentially what the HSEmotion Team did for the ABAW-10 Competition. This competition is like the "Olympics" for computers trying to understand human emotions and behaviors in real-world videos.

Here is a breakdown of their winning strategy, explained simply with some creative analogies.

The Big Picture: Four Different Challenges

The team tackled four different tasks, which are like four different games in the same tournament:

Facial Expression Recognition (FER): Guessing if a person is happy, sad, angry, etc., just by looking at their face.
Valence-Arousal (VA) Estimation: Measuring two things: Valence (how positive or negative the feeling is) and Arousal (how calm or excited the person is). Think of it as a 2D map of emotions.
Action Unit (AU) Detection: Spotting tiny muscle movements (like a twitch of the eyebrow or a lip curl) that happen before a full emotion shows up.
Fine-Grained Violence Detection: Watching a whole video scene to decide if a fight or violent act is happening, not just looking at faces.

The Secret Sauce: The "Smart Filter" Pipeline

For the first three tasks (the face-related ones), the team didn't try to build a giant, complex robot brain from scratch. Instead, they built a Smart Filter System.

1. The "Expert Eye" (Pre-trained Models)

Imagine you have a super-expert art critic who has seen millions of photos of faces. This critic is so good that they can instantly tell if a face looks "happy" or "sad" with 99% confidence.

The Team's Move: They used a pre-trained AI model (called EfficientNet) that acts like this expert. It looks at a video frame and says, "I'm 95% sure this person is smiling."

2. The "Safety Net" (The MLP Classifier)

But what if the expert is unsure? Maybe the lighting is bad, or the person is wearing sunglasses. The expert might hesitate.

The Team's Move: If the expert's confidence is low, the system passes the image to a "junior assistant" (a simple neural network called an MLP). This assistant was trained specifically on the competition's dataset (AffWild2) to learn the specific quirks of the people in the videos.

3. The "Confidence Check" (The Threshold)

Here is the clever part. The system has a rule:

If the Expert is super confident (>90%): Trust the Expert immediately. No need to ask the junior assistant.
If the Expert is unsure: Ask the junior assistant to take a look.
This saves time and reduces errors. It's like a doctor who handles common colds instantly but refers complex, rare cases to a specialist.

4. Smoothing the "Jitter" (Sliding Window)

Videos are made of thousands of frames. Sometimes, a computer might think a person is "angry" for one frame, then "neutral" for the next, then "angry" again, just because of a tiny glitch. This looks like a shaking camera.

The Team's Move: They used a Sliding Window. Imagine looking at the last 5 seconds of a video at once. If the computer thinks the person is angry for 4 out of 5 seconds, it smooths out the result and says, "Okay, they are definitely angry." This removes the jitter and makes the prediction stable.

5. Listening to the Voice (Audio Fusion)

Sometimes a face is hidden, but you can hear the tone of voice.

The Team's Move: They added a microphone to the mix. They analyzed the audio (using a tool called wav2vec) and blended it with the visual results. It's like solving a mystery by looking at the suspect and listening to their voice simultaneously.

The Violence Detection: A Different Game

For the Violence Detection task, looking at faces wasn't enough. You need to see the whole body and the scene.

The Strategy: Instead of just looking at faces, they looked at the entire frame.
The Tool: They used a powerful visual encoder called ConvNeXt (trained on millions of general images) to understand the scene.
The Time Machine: They added a TCN (Temporal Convolutional Network). Think of this as a time-traveling lens that looks at how people move over time. It doesn't just see a punch; it sees the wind-up and the follow-through.
The Result: This combination was so effective that it beat the competition's baseline by a huge margin. It was like upgrading from a security camera that just takes snapshots to a system that understands the story of the fight.

Why This Matters

The HSEmotion team's approach is special because it is simple but smart.

Old way: Build a massive, heavy, complex AI that eats up all your computer power and is hard to fix.
Their way: Use a pre-trained "expert," add a simple "assistant," and use logic to decide who to listen to.

The Analogy:
If the competition was a cooking contest:

Other teams tried to build a giant, automated robot chef that could cook anything but took hours to set up.
The HSEmotion team used a master chef (the pre-trained model) who knows the basics perfectly. If the master chef is unsure about a specific local ingredient, they ask a local guide (the MLP). They taste the dish, adjust the seasoning (smoothing), and serve it.

The Outcome:
They achieved top-tier results, proving that you don't always need the biggest, heaviest AI to win. Sometimes, a well-organized team of a "smart expert" and a "quick assistant" working together is the most powerful tool of all.

1. Problem Statement

The paper addresses four distinct affective behavior analysis tasks within the 10th Affective Behavior Analysis in-the-Wild (ABAW-10) competition. These tasks involve analyzing unconstrained, real-world video data, which presents challenges such as occlusion, varying lighting, pose changes, class imbalance, and noisy annotations. The specific tasks are:

Facial Expression Recognition (FER): Frame-wise classification of 8 basic emotions (Neutral, Anger, Disgust, Fear, Happiness, Sadness, Surprise, Other) on the AffWild2 dataset.
Valence-Arousal (VA) Estimation: Predicting continuous values for valence and arousal (range $[-1, 1]$ ) for each frame on AffWild2.
Action Unit (AU) Detection: Multi-label detection of 12 specific facial micro-expressions (AUs) on AffWild2.
Fine-Grained Violence Detection (VD): Frame-wise classification of violent vs. non-violent content in video clips using the DVD dataset. This task requires analyzing full frames (body motion, interactions, context) rather than just facial regions.

2. Methodology

The authors propose a unified, lightweight, and calibration-aware pipeline for facial tasks (FER, VA, AU) and a specialized architecture for violence detection.

A. Facial Analysis Pipeline (FER, VA, AU)

The core strategy relies on pre-trained facial embeddings combined with a simple classifier and post-processing techniques.

Feature Extraction:
- Utilizes lightweight architectures from the EmotiEffLib library (e.g., EfficientNet-B0, MT-DDAMFN, MobileViT) pre-trained on the AffectNet dataset.
- These models extract embeddings from cropped facial regions (224x224) of every video frame.
Classification & Calibration:
- MLP Classifier: A Multi-Layer Perceptron (MLP) is trained on the AffWild2 embeddings.
- Generalized Logit Adjustment (GLA): To address severe class imbalance, biases in the final layer are calibrated using GLA. The biases are optimized on the validation set to maximize the F1 score.
- Confidence-Based Filtering: A hybrid inference strategy is employed:
  - If the pre-trained model's confidence ( $P_{max}$ ) exceeds a high threshold (e.g., 0.8–0.9), its prediction is used directly.
  - Otherwise, the embedding is fed into the MLP with calibrated logits.
- Audio Fusion (Optional): For FER, audio features (wav2vec 2.0) are extracted and blended with visual predictions via a weighted sum.
Temporal Smoothing:
- To mitigate frame-level noise, predictions are smoothed using a sliding window of fixed size ( $T+1$ frames).
- For FER, the average probability within the window determines the final class.
- For VA, the MLP is trained with a combined loss of Mean Squared Error (MSE) and Concordance Correlation Coefficient (CCC).
- For AU detection, a weighted Binary Cross-Entropy loss is used, with per-AU threshold tuning (searching $[0.1, 0.9]$ ) to optimize F1 scores.

B. Fine-Grained Violence Detection (VD) Pipeline

Unlike facial tasks, VD analyzes the full video frame to capture body dynamics.

Architecture:
- Backbone: ConvNeXt-T (pre-trained on ImageNet-1K) extracts 768-dimensional per-frame features.
- Temporal Head: A 5-layer dilated Temporal Convolutional Network (TCN) or BiLSTM processes the sequence of features.
- Multimodal Extension: A skeleton-based variant extracts pose keypoints (MediaPipe Pose), velocities, and interaction distances. These are projected and fused with RGB features via cross-attention, processed by a BiLSTM.
Training & Inference:
- Trained with weighted cross-entropy (class imbalance compensation) and TrivialAugmentWide augmentation.
- Inference uses a sliding window (stride 16) over the video, averaging probabilities before thresholding at 0.5.

3. Key Contributions

Lightweight Pipeline: Demonstrates that a simple pipeline (Pre-trained Embeddings + MLP + Smoothing) can outperform complex temporal deep learning architectures (like Transformers or heavy 3D CNNs) in terms of both accuracy and computational efficiency.
Confidence Filtering & GLA: Introduces a robust mechanism to leverage high-confidence pre-trained predictions while using a calibrated MLP for uncertain cases, effectively handling class imbalance and dataset shift between AffectNet and AffWild2.
ConvNeXt for Violence Detection: Establishes that a strong 2D backbone (ConvNeXt-T) combined with lightweight temporal modeling (TCN) significantly outperforms 3D video architectures (e.g., SlowFast, R(2+1)D) for frame-level violence detection.
Framework Migration: The team transitioned from TensorFlow 2.x to PyTorch, making their codebase more accessible to the current research community.
Open Source: Released source code for FER, VA, and AU tasks on GitHub, and separate code for the VD task.

4. Experimental Results

The approach achieved significant improvements over existing baselines on the AffWild2 and DVD validation sets.

Facial Expression Recognition (FER):
- Achieved 47.40% Macro F1 and 57.98% Accuracy on the validation set.
- This represents a substantial improvement over the baseline (25.0% F1) and previous state-of-the-art methods (e.g., CLIP+TCN at 46.51%).
- The combination of EmotiEffNet, GLA, filtering, and smoothing yielded the best results.
Valence-Arousal (VA) Estimation:
- Achieved a mean CCC ( $P_{VA}$ ) of 0.562 (CCC V: 0.510, CCC A: 0.615).
- Outperformed the ResNet-50 baseline (0.22) and vanilla EfficientNet (0.543), demonstrating the efficacy of the MT-DDAMFN backbone with smoothing.
Action Unit (AU) Detection:
- Achieved a Macro F1 of 54.7%.
- This closes the gap with top-tier fusion methods (which reach ~58%) while remaining computationally much lighter. The use of embeddings + logits + smoothing + best thresholds was critical.
Fine-Grained Violence Detection (VD):
- The ConvNeXt-T + TCN model achieved a Macro F1 of 0.783 on the DVD validation set.
- This is a massive improvement over the ABAW-9 baseline (ResNet-50 + BiLSTM at 0.640) and outperforms all tested 3D video backbones (best 3D was 0.647).
- Multimodal skeleton fusion improved the non-violent class F1 to 0.828 but did not surpass the best RGB-only setup.

5. Significance

This work highlights a pragmatic shift in affective computing: efficiency and calibration often outweigh architectural complexity.

Practicality: The proposed methods are computationally efficient, making them suitable for real-time, in-the-wild applications (e.g., driver monitoring, mental health tools) where resources are limited.
Robustness: By explicitly addressing class imbalance (via GLA) and annotation noise (via confidence filtering and smoothing), the system is more robust to the "in-the-wild" conditions typical of real-world datasets.
Benchmarking: The results challenge the assumption that complex 3D video transformers are necessary for video-based behavior analysis, showing that strong 2D spatial features combined with simple temporal aggregation are sufficient for high performance in specific domains like violence detection.