Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Imagine you are trying to guess how a friend is feeling just by watching them. If you only look at their face, you might get it wrong because they could be smiling while crying (a "smile" that isn't happy). If you only listen to their voice, you might miss the fact that they are shaking with anger.

This paper is about building a super-smart computer program that acts like a super-observer. It doesn't just look at the face or listen to the voice; it does both at the same time, and it does it in a way that mimics how humans actually understand emotions.

Here is the breakdown of their "recipe" for understanding emotions, explained simply:

1. The Two Super-Experts (The Backbones)

Instead of teaching the computer from scratch, the authors hired two "experts" who have already read millions of books and watched millions of videos.

The Visual Expert (CLIP): This is like a librarian who has seen every picture in the world and knows exactly what a "sad face" or a "surprised face" looks like.
The Audio Expert (Wav2Vec 2.0): This is like a music critic who has heard every sound and knows the difference between a happy laugh and a nervous giggle.

The computer uses these two experts as its eyes and ears, but it doesn't let them change their minds (they are "frozen") because they are already so good at their jobs.

2. The Time Machine (Temporal Modeling)

Emotions aren't just a single photo; they are a movie. A smile might start small and grow big, or a frown might appear suddenly.

The Problem: If you just look at one frame, you miss the story.
The Solution: The authors added a Temporal Convolutional Network (TCN). Think of this as a movie editor. Instead of looking at one still photo, the editor watches a short clip (30 to 60 seconds) to see how the expression evolves. It helps the computer understand that a "frown" followed by a "tear" is different from a "frown" followed by a "laugh."

3. The Great Conversation (Bi-directional Cross-Attention)

This is the most clever part. Usually, computers just glue the "face data" and the "voice data" together like two separate puzzle pieces. But the authors wanted them to talk to each other.

The Analogy: Imagine a detective (the Visual Expert) and a witness (the Audio Expert) trying to solve a crime.
- One-way: The detective asks the witness, "What did you hear?"
- Bi-directional (The Paper's Method): They have a two-way conversation. The detective asks, "What did you hear?" AND the witness asks, "What did you see?"
- Why it matters: If the face is blurry (bad lighting), the witness (voice) can say, "Hey, I heard a scream, so they must be scared!" If the voice is quiet, the detective (face) can say, "I see wide eyes, so they must be shocked!" They fill in each other's gaps.

4. The Translator (Text-Guided Contrastive Learning)

To make sure the Visual Expert and the Audio Expert are on the same page, the authors added a Translator.

They use text (like the word "Angry" or "Happy") as a bridge.
The computer is trained to make sure the picture of an angry face and the sound of an angry voice both point to the same text label. It's like forcing the picture and the sound to hold hands with the word "Angry" so they all agree on what is happening.

The Results: Did it work?

The team tested this system in the 10th ABAW Challenge, which is basically the "Olympics" of emotion recognition. The test videos were messy, real-world scenarios (bad lighting, noisy crowds, people moving around).

The Old Way: The official baseline (the previous best attempt) got a score of 0.25.
The New Way: Their system got a score of 0.33.

While 0.33 might not sound like a perfect score, in the world of messy, real-world emotion recognition, that is a huge jump. It proves that when you combine a movie editor (time), a two-way conversation (fusion), and a translator (text), the computer gets much better at guessing how people really feel.

In a Nutshell

This paper teaches a computer to be a better detective by:

Using two experts who already know faces and voices.
Watching the whole movie instead of just one frame.
Making the face and voice experts talk to each other to solve the mystery.
Using words to make sure they all agree on the answer.

It's a step forward in making computers that can truly understand human feelings, even in a noisy, chaotic world.

1. Problem Statement

The paper addresses the challenge of Expression (EXPR) Recognition in unconstrained, "in-the-wild" video data (specifically for the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge).

Core Difficulties: Real-world emotion recognition is hindered by significant variations in facial appearance, head pose, illumination, occlusion, and background noise.
Limitations of Unimodal Approaches: Relying solely on visual cues (facial expressions) or audio cues (speech) is often insufficient because emotional states are complex and dynamic. Visual cues can be ambiguous, while audio provides complementary prosodic information.
Existing Gaps: Current fusion strategies often fail to model bidirectional interactions between modalities effectively, and many methods lack robust temporal modeling to capture the evolution of expressions over time.

2. Methodology

The authors propose a unified multimodal framework that integrates large-scale pre-trained models with specialized fusion and temporal layers. The architecture consists of five key components:

A. Feature Extraction (Frozen Backbones)

To leverage powerful pre-trained representations without overfitting on limited challenge data, the authors use frozen backbones:

Visual Stream: Uses CLIP (ViT-B/32) to encode video frames. Each frame is processed independently to produce a visual feature sequence ( $X_v$ ).
Audio Stream: Uses Wav2Vec 2.0 (Base) to encode raw audio waveforms (resampled to 16 kHz), producing an audio feature sequence ( $X_a$ ).
Text Prompts: Text templates (e.g., "A face expressing [Emotion]") are encoded via the CLIP text encoder to serve as semantic anchors.

B. Temporal Modeling & Adaptation

Visual Temporal Modeling: Since static frame features miss temporal dynamics, the visual features are passed through a Temporal Convolutional Network (TCN) with 6 layers using dilated causal convolutions. This aggregates temporal context efficiently without sacrificing resolution.
Audio Adaptation: Audio features are projected to match the visual embedding dimension (512) via an Audio Adapter (Linear + LayerNorm + ReLU + Dropout).

C. Bi-directional Cross-Attention Fusion

To overcome the limitations of simple concatenation, the authors introduce a symmetric cross-attention module:

Direction 1 (Visual $\to$ Audio): Visual features query audio features to incorporate prosodic context.
Direction 2 (Audio $\to$ Visual): Audio features query visual features to refine visual representations based on speech cues.
This design ensures that if one modality is noisy or ambiguous, the other can provide complementary contextual information.

D. Classification Head

The fused features from both directions are aggregated via mean pooling along the temporal dimension. The resulting pooled visual and audio vectors are concatenated and fed into a lightweight MLP classifier (3 linear layers with ReLU and Dropout) to predict one of the 8 emotion classes.

E. Text-Guided Contrastive Learning

To enforce semantic alignment, a contrastive loss is added during training:

The pooled visual feature is projected into the CLIP embedding space.
A bidirectional contrastive loss ( $L_{con}$ ) is computed between the projected visual features and the corresponding CLIP text embeddings.
Total Loss: $L = L_{cls} + \lambda L_{con}$ , where $L_{cls}$ is the cross-entropy loss for emotion classification and $\lambda = 0.1$ .

3. Key Contributions

Temporal Visual Modeling: Integration of a TCN over CLIP features to capture dynamic facial expression sequences, addressing the limitation of frame-level static analysis.
Bi-directional Cross-Modal Fusion: A novel symmetric cross-attention mechanism that allows visual and audio streams to interact mutually, enhancing cross-modal contextualization.
Text-Guided Semantic Alignment: The incorporation of a contrastive objective using CLIP text features to ensure visual representations are semantically aligned with emotion categories, improving robustness in noisy environments.

4. Experimental Results

The model was evaluated on the ABAW 10th EXPR Challenge validation set.

Baseline Comparison: The proposed method significantly outperformed the official baseline (which used a pre-trained VGGFace model).
- Official Baseline Macro F1: 0.2500
- Proposed Model (30 frames): Macro F1 = 0.3224
- Proposed Model (60 frames): Macro F1 = 0.3334 (Accuracy: 53.71%)
Key Findings:
- The multimodal approach yielded a substantial improvement over unimodal modeling.
- Increasing the temporal window from 30 to 60 frames improved performance, confirming that longer temporal contexts are beneficial for recognizing evolving expressions in the wild.
- The combination of TCN, bidirectional attention, and contrastive learning provided a more discriminative representation than the baseline.

5. Significance

This work demonstrates the efficacy of combining large-scale pre-trained models (CLIP, Wav2Vec 2.0) with specialized architectural designs (TCN, Cross-Attention) for robust affective computing.

Robustness: The framework effectively handles the "in-the-wild" challenges of noise and variation by leveraging complementary audio-visual cues.
Semantic Alignment: The use of text-guided contrastive learning bridges the gap between raw visual data and semantic emotion concepts, a promising direction for future emotion recognition systems.
State-of-the-Art Performance: By achieving a Macro F1 score of 0.3334, the method sets a strong baseline for future research in the ABAW challenge and real-world emotion recognition applications.