Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

Imagine you are trying to understand the mood of a group of friends having a lively conversation at a noisy coffee shop. You have three ways to gather information:

What they say (Text).
How they sound (Audio).
What their faces look like (Visual).

In a perfect world, all three would be crystal clear. But in the real world, the coffee shop is loud (audio is noisy), the lighting is bad (faces are blurry), and sometimes people mumble. If you just blindly combine all this messy information, you might get the wrong idea about how they are feeling.

This paper proposes a new "super-smart" system to solve exactly this problem. Let's break down how it works using simple analogies.

1. The Problem: The "Noisy Coffee Shop"

Current AI systems often treat the audio and video signals as if they are just as reliable as the text. But in reality, audio and video are full of "static" (background noise, bad lighting, camera shakes).

The Issue: If the AI tries to listen to a mumbled voice over a loud blender, it might think the person is angry when they are actually just confused.
The Imbalance: Text is usually the most reliable source of emotion (words carry the clearest meaning), but current AI often gives equal weight to the messy audio and video, which dilutes the good information.

2. The Solution: The "ReDiFu" System

The authors call their new model ReDiFu (Relational Graph-driven Differential Denoising and Diffusion Attention Fusion). That's a mouthful, so let's translate it into three simple steps:

Step A: The "Difference Detective" (Differential Denoising)

The Analogy: Imagine you are watching a security camera. If the camera is shaking, the whole room looks like it's vibrating. But if a person actually moves, only they move.

How it works: Instead of looking at the audio or video frame-by-frame, this system looks at the difference between the current moment and the moment just before it.
The Magic: Background noise (like a constant hum or a flickering light) stays the same, so when you subtract the "before" from the "now," the noise cancels out. Only the changes (like a sudden laugh or a frown) remain. It's like using noise-canceling headphones that only let the new sounds through.

Step B: The "Social Map" (Relation Subgraphs)

The Analogy: Imagine a conversation is a dance. Sometimes you are dancing with yourself (thinking about your own past words), and sometimes you are dancing with a partner (reacting to what they just said).

How it works: The system builds two separate maps (graphs):
1. Intra-speaker: How does this person's current mood relate to what they said 5 minutes ago?
2. Inter-speaker: How does this person react to what the other person just said?
The Benefit: By separating these two types of relationships, the AI understands the flow of the conversation much better than if it just looked at a giant, messy pile of data.

Step C: The "Text-Led Diffusion" (Diffusion Attention Fusion)

The Analogy: Imagine the Text is the Captain of a ship, and Audio/Visual are the Crew.

Old Way: The Captain and the Crew all shout their opinions at once, and the AI tries to guess who is right.
New Way: The Captain (Text) is the boss. The system lets the Audio and Visual data "diffuse" (flow) into the Captain's understanding.
How it works: The AI uses the clear text as a guide. It asks, "Based on what was said, does this blurry face or mumbled voice make sense?" It then carefully blends the audio and video into the text stream, but only the parts that match the text's story. This ensures the messy data doesn't override the clear data.

3. The Result: A Clearer Picture

When the researchers tested this system on real-world datasets (like movie scripts and TV show dialogues), it worked significantly better than previous methods.

It got rid of the static: By using the "Difference Detective," it ignored the background noise.
It understood the context: By using the "Social Map," it knew who was talking to whom.
It trusted the right source: By letting Text lead the "Diffusion," it ensured the final emotion guess was based on the most reliable information.

Summary

Think of this paper as teaching an AI to be a better listener in a noisy room. Instead of getting confused by the background noise or trying to guess the mood from a blurry face, it:

Filters out the static by looking for changes.
Maps out who is talking to whom.
Uses the spoken words as the "truth anchor" to make sense of the messy sounds and images.

The result is an AI that can accurately tell if someone is happy, sad, or angry, even when the recording isn't perfect.

1. Problem Statement

Multimodal Conversation Emotion Recognition (MCER) aims to classify emotional states by leveraging textual, acoustic, and visual features. However, the authors identify two critical challenges in existing approaches:

Noise Interference in Audio-Visual Modalities: Real-world audio and video signals are often corrupted by environmental noise (e.g., background noise, lighting changes, motion artifacts). Existing models often fuse these noisy features directly or rely on implicit weighting, leading to information distortion and degraded recognition performance.
Modality Imbalance and Lack of Explicit Guidance: There is a significant disparity in data quality and information capacity between modalities. Text is generally the most reliable carrier of emotional semantics, while audio and visual modalities are more prone to noise. Most existing fusion methods treat modalities implicitly via attention weights, failing to explicitly leverage the dominant role of text to guide the fusion process and mitigate the "modality gap."

2. Methodology

The authors propose a novel framework named ReDiFu (Relation-aware Denoising and Diffusion Fusion), which consists of three core components:

A. Differential Transformer for Denoising (Audio & Visual)

To address noise in audio and visual streams, the authors introduce a Differential Transformer module.

Mechanism: Instead of differencing raw feature values, this module computes the difference between two attention maps: the current attention distribution ( $\alpha$ ) and a temporally shifted reference distribution ( $\alpha_{ref}$ ).
Logic: Emotion-discriminative cues are reflected in the variations of relational patterns over time, whereas stationary noise results in similar attention distributions across adjacent time steps.
Gated Filtering: To prevent the suppression of useful dynamic changes or the retention of random noise jumps, a learnable gating mechanism is applied to the residual output. This filters out non-stationary noise while preserving salient emotional dynamics.

B. Relation-Aware Graph Subgraphs (Text)

To model emotional dependencies within the textual modality, the authors construct two distinct subgraphs:

Inter-speaker Subgraph ( $G_{inter}$ ): Captures emotional dependencies between different speakers.
Intra-speaker Subgraph ( $G_{intra}$ ): Captures emotional inertia and dynamics within a single speaker's history.

Implementation: These are modeled using Graph Attention Networks (GAT). Edges are defined based on a time-window mechanism and speaker identity. Crucially, the model assigns learnable embeddings to edge types (self-loop, forward, backward) to distinguish between different relational contexts.

C. Text-Guided Cross-Modal Diffusion Attention Fusion

To resolve modality imbalance, a Diffusion Attention Fusion mechanism is introduced where text acts as the semantic anchor.

Process:
1. Intra-modal Self-Attention: Captures dependencies within each modality.
2. Cross-Modal Diffusion: The text modality unidirectionally absorbs information from audio and visual streams. This is achieved by constructing a cross-modal attention matrix that combines normalized correlations with original self-attention matrices.
3. Gating Strategy: A gating mechanism dynamically computes fusion weights, allowing the text stream to adaptively integrate the most relevant acoustic and visual features while suppressing irrelevant noise.
Outcome: This ensures that the robust textual semantics guide the fusion, aligning the noisy modalities with the text and reducing the modality gap.

3. Key Contributions

Differential Denoising Mechanism: A novel approach that operates in the attention domain rather than the feature value domain. It explicitly suppresses stationary relational redundancy (noise) while enhancing temporally consistent emotional cues.
Text-Dominant Diffusion Fusion: A structural innovation that explicitly models the dominance of the textual modality. By using text as an anchor to guide the diffusion of audiovisual information, the method effectively addresses modality imbalance and improves semantic alignment.
Dual-Subgraph Modeling: The separation of inter-speaker and intra-speaker dependencies into independent subgraphs allows for finer-grained modeling of complex conversational dynamics compared to unified graph approaches.
Comprehensive Validation: Extensive experiments demonstrating superior performance and robustness against noise and modality imbalance.

4. Experimental Results

The model was evaluated on two standard datasets: IEMOCAP (dyadic interactions) and MELD (multi-party interactions).

Performance:
- IEMOCAP: Achieved 75.17% weighted Accuracy (w-Acc) and 74.87% weighted F1 (w-F1), outperforming state-of-the-art baselines like DialogueRNN, MMGCN, and DialogueTRM.
- MELD: Achieved 66.52% w-Acc and 66.62% w-F1, again surpassing existing methods.
Ablation Studies:
- Removing the Differential module caused significant performance drops, confirming its role in noise suppression.
- Removing the Graph module led to the largest degradation, highlighting the importance of modeling speaker relationships.
- Removing the Diffusion module reduced performance, validating the text-guided fusion strategy.
Robustness:
- The model showed superior performance on emotion shift samples and emotion dependency tasks compared to baselines.
- In scenarios with noisy or missing speaker annotations, the model maintained stability, though performance decreased slightly, indicating the value of speaker priors.
Efficiency: The model maintains a competitive parameter count (~11.48M) and inference time compared to heavier graph-based models, with memory usage well-controlled.

5. Significance

This paper makes a significant contribution to the field of Affective Computing and Multimodal Learning by:

Shifting the Paradigm of Denoising: Moving from simple feature smoothing to relational attention differencing, which preserves dynamic emotional cues while removing noise.
Addressing Modality Imbalance: Providing a concrete architectural solution (text-guided diffusion) to the inherent imbalance in multimodal data, moving beyond implicit weighting to explicit structural guidance.
Enhancing Interpretability: The use of relational subgraphs and differential attention maps offers clearer insights into how the model processes speaker interactions and filters noise, making the "black box" of multimodal fusion more transparent.

The proposed framework sets a new state-of-the-art for MCER, particularly in real-world scenarios where data quality is inconsistent and noise is prevalent.