Relational graph-driven differential denoising and diffusion attention fusion for multimodal conversation emotion recognition

This paper proposes a relation-aware denoising and diffusion attention fusion model for multimodal conversation emotion recognition that employs a differential Transformer to suppress noise in audio and video, utilizes relation subgraphs to capture speaker-dependent emotional dependencies, and leverages a text-guided diffusion mechanism to achieve robust, semantically aligned multimodal fusion.

Ying Liu, Yuntao Shou, Wei Ai, Tao Meng, Keqin Li

Published 2026-03-30
📖 5 min read🧠 Deep dive

Imagine you are trying to understand the mood of a group of friends having a lively conversation at a noisy coffee shop. You have three ways to gather information:

  1. What they say (Text).
  2. How they sound (Audio).
  3. What their faces look like (Visual).

In a perfect world, all three would be crystal clear. But in the real world, the coffee shop is loud (audio is noisy), the lighting is bad (faces are blurry), and sometimes people mumble. If you just blindly combine all this messy information, you might get the wrong idea about how they are feeling.

This paper proposes a new "super-smart" system to solve exactly this problem. Let's break down how it works using simple analogies.

1. The Problem: The "Noisy Coffee Shop"

Current AI systems often treat the audio and video signals as if they are just as reliable as the text. But in reality, audio and video are full of "static" (background noise, bad lighting, camera shakes).

  • The Issue: If the AI tries to listen to a mumbled voice over a loud blender, it might think the person is angry when they are actually just confused.
  • The Imbalance: Text is usually the most reliable source of emotion (words carry the clearest meaning), but current AI often gives equal weight to the messy audio and video, which dilutes the good information.

2. The Solution: The "ReDiFu" System

The authors call their new model ReDiFu (Relational Graph-driven Differential Denoising and Diffusion Attention Fusion). That's a mouthful, so let's translate it into three simple steps:

Step A: The "Difference Detective" (Differential Denoising)

The Analogy: Imagine you are watching a security camera. If the camera is shaking, the whole room looks like it's vibrating. But if a person actually moves, only they move.

  • How it works: Instead of looking at the audio or video frame-by-frame, this system looks at the difference between the current moment and the moment just before it.
  • The Magic: Background noise (like a constant hum or a flickering light) stays the same, so when you subtract the "before" from the "now," the noise cancels out. Only the changes (like a sudden laugh or a frown) remain. It's like using noise-canceling headphones that only let the new sounds through.

Step B: The "Social Map" (Relation Subgraphs)

The Analogy: Imagine a conversation is a dance. Sometimes you are dancing with yourself (thinking about your own past words), and sometimes you are dancing with a partner (reacting to what they just said).

  • How it works: The system builds two separate maps (graphs):
    1. Intra-speaker: How does this person's current mood relate to what they said 5 minutes ago?
    2. Inter-speaker: How does this person react to what the other person just said?
  • The Benefit: By separating these two types of relationships, the AI understands the flow of the conversation much better than if it just looked at a giant, messy pile of data.

Step C: The "Text-Led Diffusion" (Diffusion Attention Fusion)

The Analogy: Imagine the Text is the Captain of a ship, and Audio/Visual are the Crew.

  • Old Way: The Captain and the Crew all shout their opinions at once, and the AI tries to guess who is right.
  • New Way: The Captain (Text) is the boss. The system lets the Audio and Visual data "diffuse" (flow) into the Captain's understanding.
  • How it works: The AI uses the clear text as a guide. It asks, "Based on what was said, does this blurry face or mumbled voice make sense?" It then carefully blends the audio and video into the text stream, but only the parts that match the text's story. This ensures the messy data doesn't override the clear data.

3. The Result: A Clearer Picture

When the researchers tested this system on real-world datasets (like movie scripts and TV show dialogues), it worked significantly better than previous methods.

  • It got rid of the static: By using the "Difference Detective," it ignored the background noise.
  • It understood the context: By using the "Social Map," it knew who was talking to whom.
  • It trusted the right source: By letting Text lead the "Diffusion," it ensured the final emotion guess was based on the most reliable information.

Summary

Think of this paper as teaching an AI to be a better listener in a noisy room. Instead of getting confused by the background noise or trying to guess the mood from a blurry face, it:

  1. Filters out the static by looking for changes.
  2. Maps out who is talking to whom.
  3. Uses the spoken words as the "truth anchor" to make sense of the messy sounds and images.

The result is an AI that can accurately tell if someone is happy, sad, or angry, even when the recording isn't perfect.