Stage-Adaptive Reliability Modeling for Continuous Valence-Arousal Estimation

Imagine you are trying to guess how a friend is feeling just by watching a video of them talking. Sometimes, they are smiling brightly, but the microphone is broken, so you can't hear their voice. Other times, their face is hidden by a hand or a shadow, but their voice is crystal clear.

In the real world, the "clues" we get from video (sight) and audio (sound) are often messy, unreliable, or missing pieces. This is the big problem the paper "SAGE" tries to solve.

Here is a simple breakdown of what the researchers did, using some everyday analogies.

The Problem: The "Bad Signal" Dilemma

Most computer programs that try to read emotions work like a person who blindly trusts everything they see and hear.

The Flaw: If the video is blurry (maybe the person turned their head), the computer still tries to guess the emotion based on that blurry face. If the audio is full of static, the computer still tries to guess based on the noise.
The Result: The computer gets confused and makes wild guesses because it's listening to "bad signals" just as loudly as "good signals."

The Solution: Meet SAGE (The Smart Editor)

The researchers created a new system called SAGE (Stage-Adaptive Reliability Modeling). Think of SAGE not as a robot that sees and hears, but as a smart editor or a conductor managing a band.

Here is how SAGE works, step-by-step:

1. The Two Musicians (Audio and Video)

Imagine two musicians playing a duet to tell a story about emotions:

The Visual Musician: Plays the face (smiles, frowns).
The Audio Musician: Plays the voice (tone, pitch, speed).

In a normal video, sometimes the Visual Musician is having a bad day (the camera is shaky), and sometimes the Audio Musician is having a bad day (there is loud construction noise outside).

2. The "Stage-Adaptive" Conductor

Old systems treated both musicians equally all the time. If the Visual Musician was playing a terrible, out-of-tune note, the system still amplified it.

SAGE is different. It acts like a smart conductor who is watching the performance in real-time.

The "Stage" Concept: The researchers realized that reliability changes over time. In the first 5 seconds, the face might be clear. In the next 5 seconds, the person might cover their mouth.
The Adjustment: SAGE constantly asks, "Who is playing better right now?"
- If the face is blurry, SAGE turns the volume down on the Visual Musician and turns the volume up on the Audio Musician.
- If the audio is full of static, SAGE does the opposite.

3. The "Confidence Score"

SAGE doesn't just guess; it calculates a confidence score for every single second of the video.

Analogy: Imagine you are driving in the rain. You have two tools: your eyes and your GPS.
- If it's pouring rain and you can't see the road, your eyes have a low confidence score. You rely more on the GPS.
- If your GPS signal is lost, the GPS has a low confidence score. You rely more on your eyes.
SAGE does this instantly, thousands of times per second, deciding which "sense" to trust more at that exact moment.

Why This Matters

The researchers tested SAGE on a huge dataset of real-world videos (people talking in cafes, on the street, in offices) called Aff-Wild2.

The Result: SAGE was better at guessing emotions than previous methods.
The Takeaway: The secret wasn't making the computer "smarter" or giving it a bigger brain. The secret was teaching it when to trust what it sees and when to trust what it hears.

Summary in One Sentence

SAGE is a smart emotion detector that knows when to ignore a blurry face or noisy audio, acting like a wise editor who only lets the clearest, most reliable clues decide how a person is feeling.

1. Problem Statement

Continuous Valence-Arousal (VA) estimation in real-world ("in-the-wild") environments faces significant challenges due to inconsistent modality reliability.

The Core Issue: In natural interactions, the quality of audio and visual signals fluctuates over time. For instance, visual cues may be occluded or the subject may turn away, while audio may be noisy or the subject may be silent.
Limitation of Existing Methods: Current state-of-the-art approaches primarily focus on modeling temporal dynamics or feature interactions (e.g., via cross-attention or gating). However, they often overlook the fact that modality reliability is stage-dependent. Without explicitly estimating and calibrating the confidence of each modality at specific time steps, unreliable signals can dominate the fusion process, leading to unstable and inaccurate predictions.

2. Methodology: The SAGE Framework

The authors propose SAGE (Stage-Adaptive reliability modeling framework), a novel architecture designed to explicitly estimate and calibrate modality-wise confidence during multimodal integration. The framework consists of four main stages:

A. Multimodal Feature Extraction

Visual Stream: Uses a ResNet-50 (pretrained on ImageNet) to extract frame-level visual representations from cropped facial images.
Audio Stream: Uses a pretrained WavLM-base model to generate self-supervised acoustic embeddings from raw waveforms.
Output: Temporal sequences $X_v$ and $X_a$ representing visual and audio features.

B. Temporal Encoding

Temporal Convolutional Networks (TCNs): Applied to both streams to capture short-term temporal dependencies.
Fusion: The encoded features are concatenated to form a unified multimodal representation $X$ .

C. Stage-Adaptive Reliability Modeling (The Core Innovation)

This stage replaces standard fusion with a Reliability-Guided Fusion (RGF) mechanism followed by a Temporal Refinement Transformer:

Reliability Estimation: For every time step $t$ , the model computes a scalar reliability logit ( $g_t$ ) based on the fused features. These logits are normalized via a Softmax function to produce a reliability distribution vector $\alpha$ , where $\sum \alpha_t = 1$ .
Dynamic Reweighting: The multimodal features are reweighted by $\alpha_t$ . This ensures that time steps with low reliability (e.g., occlusion or silence) contribute less to the final prediction, preventing unreliable signals from dominating.
Temporal Refinement: The reliability-adjusted representation is passed through a Self-Attention Transformer. This module captures long-range temporal dependencies and refines cross-modal interactions, specifically enhancing robustness under modality imbalance.

D. Regression Head

A frame-wise Multi-Layer Perceptron (MLP) maps the refined representations to continuous Valence and Arousal scores.

3. Key Contributions

Novel Framework (SAGE): Introduction of a stage-adaptive framework that explicitly separates reliability estimation from feature representation, allowing for dynamic calibration of modality confidence.
Reliability-Guided Weighting: A strategy that quantifies cross-modal confidence to dynamically rebalance audio and visual contributions. This effectively mitigates the impact of noise, occlusion, and modality imbalance.
Empirical Validation: Extensive experiments on the Aff-Wild2 dataset (10th ABAW Competition) demonstrate that reliability-driven modeling is a critical factor for robust emotion estimation, often outperforming complex architectures that lack explicit reliability modeling.

4. Experimental Results

The model was evaluated on the Aff-Wild2 benchmark using the Concordance Correlation Coefficient (CCC) as the primary metric.

Validation Set (Fold-0):
- Valence CCC: 0.509
- Arousal CCC: 0.674
- Average CCC: 0.591
- Comparison: SAGE achieved competitive results against strong baselines like Situ-RUCAIM3 (0.628 avg) and JCA (0.623 avg), despite using a relatively streamlined framework without external datasets or ensemble strategies.
Test Set (Official ABAW Evaluation):
- Average CCC: 0.58
- Comparison: The method outperformed several established multimodal approaches (e.g., MM-CV-LC, HFUT-MAC) and achieved results comparable to top-tier methods like GRJCA and HGRJCA, validating its generalization capabilities.

5. Significance and Conclusion

The paper establishes that unstable modality contributions are a primary bottleneck in real-world emotion recognition, often more so than insufficient temporal modeling capacity.

Design Principle: The authors argue that reliability-aware modeling is a fundamental design principle for robust multimodal systems.
Impact: By dynamically adjusting to interaction stages (e.g., ignoring audio when speech is absent or visual when the face is occluded), SAGE produces more stable affect trajectories.
Future Direction: The work suggests that future research in affective computing should prioritize explicit reliability calibration over simply increasing architectural complexity or model size.