Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

Imagine you are trying to guess how a person is feeling just by watching a video of them. It's tricky! They might be smiling but actually sad (a "fake" smile), or they might be shouting because they are angry, or maybe they are just shouting because they are excited. The lighting could be bad, someone might walk in front of the camera, or the person might turn their head away.

This paper is about a team of researchers (Team RAS) who built a super-smart computer program to solve this puzzle. They entered a high-stakes competition called the 10th ABAW Challenge, where the goal is to guess two specific feelings:

Valence: Is the person happy (positive) or sad (negative)?
Arousal: Is the person calm (low energy) or excited/agitated (high energy)?

Here is how their "detective team" works, explained with some fun analogies:

The Three Detectives

Instead of relying on just one way to guess the emotion, the team hired three different "detectives" to look at the video. They believe that if you combine their opinions, you get a much better answer.

1. The Face Detective (The Visual Expert)

What it does: This detective only looks at the person's face. It zooms in on every single frame of the video to see micro-expressions (tiny twitches of the mouth or eyebrows).
The Tool: It uses a specialized brain called GRADA. Think of this as a veteran actor who has studied thousands of movies to know exactly what a "sad eye" or a "nervous smile" looks like.
The Job: It translates facial movements into numbers that represent "how happy" or "how intense" the face looks at that exact second.

2. The Behavior Detective (The Storyteller)

What it does: This is the team's secret weapon. Instead of just looking at pixels, this detective uses a Visual Language Model (Qwen3)—basically, a super-intelligent AI that can "watch" a video and write a description of what's happening.
The Trick: The researchers asked the AI: "Describe this person's mood based on their posture, gestures, and the scene around them."
The Analogy: Imagine a human observer sitting next to you watching the video. They might say, "He's leaning back, crossing his arms, and looking at the clock; he seems impatient." The AI does this automatically, turning those observations into a "behavior report."
The Timekeeper: Since behavior changes over time, they use a Mamba model. Think of Mamba as a very efficient librarian who reads these "behavior reports" in order to understand the story of the emotion, not just isolated snapshots.

3. The Audio Detective (The Voice Analyst)

What it does: This detective listens to the voice. But there's a catch: in real life, videos are often noisy, or the person might be silent.
The Filter: Before analyzing the voice, the team uses a clever trick. They check if the person's mouth is actually moving (using a tool called MediaPipe). If the mouth isn't moving, the audio is likely just background noise (like a dog barking or wind), so the detective ignores it.
The Tool: It uses WavLM, a model trained to understand the "tone" of speech. It listens for the energy in the voice (is it a whisper or a scream?) and the mood (is it a cheerful tone or a grumpy one?).

The Boss: The Fusion Strategy

Now, the team has three different opinions. How do they decide on the final answer? They tried two different "Boss" strategies to combine the detectives' reports:

Strategy A: The "Expert Panel" (Directed Cross-Modal MoE)
Imagine a roundtable meeting where the Face, Behavior, and Audio detectives argue with each other.

The "Boss" (a gating mechanism) listens to them.
If the video is dark and the face is hard to see, the Boss says, "Ignore the Face Detective; listen to the Audio and Behavior detectives!"
If the person is silent, the Boss says, "Ignore the Audio; focus on the Face!"
It dynamically weights who gets to speak the loudest based on who has the best information at that moment.

Strategy B: The "Reliable Frame" (Reliability-Aware Audio-Visual)
This strategy is a bit more structured.

It trusts the Face and Behavior detectives to decide the exact moment of the emotion (frame-by-frame).
It uses the Audio detective as a "background context." It doesn't change the frame-by-frame decision directly but adds a layer of "confidence" or "context" to the decision.
Analogy: It's like watching a movie with subtitles. The visuals tell you what is happening, but the subtitles (audio context) help you understand the nuance, even if you can't hear the sound clearly.

The Result

The team tested their system on a massive dataset of real-world videos (people in parks, offices, cars, etc.).

The Winner: The Reliable Frame (Strategy B) approach worked best.
The Score: They achieved a score of 0.658 (on a scale where 1.0 is perfect). This is a very strong result, beating many previous attempts.

Why This Matters

The big takeaway is that combining different types of information is better than just looking at the face.

Sometimes a person's face is blank, but their voice is shaking (Audio wins).
Sometimes the audio is noisy, but the person is slumping in their chair (Behavior wins).
By letting a "Storyteller AI" (Qwen) describe the behavior and mixing it with face and voice data, the computer becomes much better at understanding human emotions in the messy, real world.

In short, Team RAS built a digital emotion detective squad that doesn't just look at a face; it listens, observes body language, and reads the room to figure out how someone truly feels.

1. Problem Statement

The paper addresses the challenge of Continuous Emotion Recognition (CER) under In-The-Wild (ITW) conditions, specifically focusing on estimating Valence (pleasantness) and Arousal (intensity).

Challenges: Real-world video data presents significant variability in appearance, head pose, illumination, occlusions, and subject-specific expression patterns.
Gap: While deep neural networks and transformers have advanced the field, the use of Multimodal Visual Language Models (VLMs) to derive behavior-oriented representations for continuous VA estimation remains under-explored. Existing methods often rely solely on visual or audio features without leveraging the rich contextual cues provided by VLMs.

2. Methodology

The proposed approach is a multimodal pipeline that integrates three complementary modalities: Face, Behavior, and Audio. The system processes these streams through specific encoders and temporal models before fusing them using two distinct strategies.

A. Modality Encoders

Face Modality:
- Detection: Uses a YOLO-based detector with manual identity curation to ensure a single subject per video sequence.
- Feature Extraction: Employs GRADA, an EfficientNet-B1 based encoder pre-trained on ImageNet and fine-tuned on a large-scale mixed affective corpus. It outputs 256-dimensional frame-level embeddings.
- Temporal Modeling: A Transformer-based sequence regression model processes overlapping temporal windows (400 frames) to predict VA.
Behavior Modality (Novel Contribution):
- Encoder: Utilizes Qwen3-VL-4B-Instruct, a multimodal VLM.
- Input: Processes 16 uniformly thinned frames per segment alongside a specific text prompt asking the model to analyze facial expressions, gestures, posture, and scene context to describe valence and arousal.
- Extraction: Two settings are tested: Visual-only (visual tokens) and Multimodal (joint video-text input).
- Temporal Modeling: Segment-level embeddings are processed by Mamba (a State-Space Model) to capture both short-term fluctuations and long-range dependencies. Mamba is chosen for its efficiency in modeling temporal dynamics compared to standard Transformers.
Audio Modality:
- Preprocessing: Audio is segmented into 4-second windows with 2-second overlap.
- Cross-Modal Filtering: A critical step using MediaPipe to detect mouth-opening dynamics. Audio segments are retained only if they correlate with visual speech presence, filtering out non-speech noise common in ITW datasets.
- Feature Extraction: Uses WavLM-Large (pre-trained on MSP-Podcast). Only the top 4 transformer layers are fine-tuned to prevent overfitting.
- Pooling: Attention-statistics pooling (weighted mean and standard deviation) aggregates frame-level features into chunk-level descriptors before regression.

B. Fusion Strategies

The paper proposes and compares two fusion mechanisms:

Directed Cross-Modal Mixture-of-Experts (DCMMOE):
- Mechanism: Projects all modalities into a shared latent space. It creates "experts" for every ordered pair of modalities (e.g., Face $\to$ Audio, Audio $\to$ Face) using cross-attention.
- Gating: A learnable gating network assigns adaptive weights to these experts based on the input state, allowing the model to prioritize the most informative cross-modal interactions dynamically.
- Output: Fused representation is decoded into VA estimates.
Reliability-Aware Audio-Visual (RAAV) Fusion:
- Mechanism: A frame-centric approach. Visual (Face) and Behavior features are fused at the frame level using a masked reliability-aware gating mechanism.
- Audio Role: Audio is treated as auxiliary contextual evidence rather than a primary frame-level driver. It is integrated via a bottleneck latent representation that attends to the fused visual sequence.
- Rationale: This asymmetric design acknowledges that visual modalities determine temporal resolution, while audio provides complementary window-level context.

3. Key Contributions

VLM Integration for Behavior: First known application of a large-scale Multimodal VLM (Qwen3-VL) to extract behavior-oriented embeddings for continuous VA estimation, demonstrating that multimodal VLM embeddings outperform visual-only embeddings.
Mamba for Temporal Dynamics: Introduction of Mamba (State-Space Models) for modeling temporal dynamics in behavior segments, offering an alternative to standard Transformers.
Cross-Modal Filtering: Implementation of a visual-guided audio filtering mechanism (using mouth dynamics) to mitigate noise in the Aff-Wild2 dataset.
Dual Fusion Architectures: Proposal and comparison of two distinct fusion strategies: a symmetric, interaction-heavy DCMMOE and an asymmetric, reliability-aware RAAV strategy.

4. Experimental Results

Experiments were conducted on the Aff-Wild2 dataset (10th ABAW Challenge) using the Concordance Correlation Coefficient (CCC) as the metric.

Unimodal Performance:
- The Video-based model (GRADA + Transformer) was the strongest unimodal baseline (Avg CCC: 0.6189 on dev set).
- The Multimodal Qwen3 + Mamba model (Model 3) significantly outperformed the Visual-only Qwen3 model (Model 2), validating the utility of text-guided behavior analysis.
- The Audio model performed the weakest among unimodal approaches.
Multimodal Fusion Performance:
- DCMMOE: Combining Video, Multimodal Behavior, and Audio yielded an Avg CCC of 0.6487 (Dev) and 0.61 (Test).
- RAAV (Best Result): The Reliability-Aware Audio-Visual strategy achieved the highest performance:
  - Development Set: Avg CCC of 0.6576 (Valence: 0.6078, Arousal: 0.7073).
  - Test Set: Avg CCC of 0.62.

5. Significance and Conclusion

State-of-the-Art Performance: The proposed method achieves competitive results (CCC ~0.62 on the test set) compared to existing SOTA approaches in the ABAW challenge, confirming the effectiveness of the multimodal framework.
Behavioral Context: The results highlight that incorporating behavioral descriptions via VLMs provides richer affective cues than raw visual features alone, particularly for arousal estimation.
Fusion Strategy Impact: The RAAV strategy proved superior to the DCMMOE, suggesting that for continuous VA estimation, treating audio as a contextual enhancer to frame-level visual/behavioral fusion is more effective than treating all modalities as equal, symmetric inputs.
Robustness: The cross-modal filtering and adaptive gating mechanisms effectively handle the noise and variability inherent in "in-the-wild" data.

In summary, Team RAS successfully demonstrated that a hybrid approach combining traditional deep learning (ResNet/GRADA), modern state-space models (Mamba), and large multimodal models (Qwen3), fused via reliability-aware strategies, significantly advances the state of continuous emotion recognition in uncontrolled environments.

Team RAS in 10th ABAW Competition: Multimodal Valence and Arousal Estimation Approach

The Three Detectives

The Boss: The Fusion Strategy

The Result

Why This Matters

1. Problem Statement

2. Methodology

A. Modality Encoders

B. Fusion Strategies

3. Key Contributions

4. Experimental Results

5. Significance and Conclusion

More like this

Interpretable Tau-PET Synthesis from Multimodal T1-Weighted and FLAIR MRI Using Partial Information Decomposition Guided Disentangled Quantized Half-UNet

SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses

MultiModalPFN: Extending Prior-Data Fitted Networks for Multimodal Tabular Learning

"Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation

OpenGLT: A Comprehensive Benchmark of Graph Neural Networks for Graph-Level Tasks