Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

This paper proposes a robust Audio-Visual Target Speaker Extraction framework that leverages emotion-aware multiple enrollment fusion, demonstrating that training with high modality missing rates significantly enhances performance stability against real-world signal loss while achieving optimal results by fusing single-frame facial images with frame-level lip features.

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming Li

Published Thu, 12 Ma
📖 4 min read☕ Coffee break read

Imagine you are at a loud, chaotic party (the "cocktail party"). You want to hear your friend, Alice, clearly, but there are dozens of other people talking, music playing, and noise everywhere. Your brain is amazing at this; it can focus on Alice's voice and tune out the rest. This paper is about teaching a computer to do the exact same thing.

The task is called Audio-Visual Target Speaker Extraction. In simple terms, it's a computer program that listens to a noisy recording and tries to isolate just one person's voice.

Here is the breakdown of how this paper solves the problem, using some everyday analogies:

1. The Problem: "Blind" Computers

To help the computer find Alice, we usually give it a "hint" or a "clue" about what she looks or sounds like.

  • The Audio Clue: A recording of Alice's voice.
  • The Visual Clues:
    • Lip Movements: Watching her lips move (very precise, like reading lips).
    • Face: A photo of her face (tells us who she is).
    • Expressions: Her facial emotions (tells us how she feels).

The Catch: In the real world, things go wrong. Alice might turn her head, someone might walk in front of the camera, or the video might glitch. If the computer relies only on watching her lips, and the video freezes or gets blocked, the computer goes "blind" and fails to find her voice.

2. The Solution: The "Swiss Army Knife" Approach

The authors realized that relying on just one type of clue is risky. Instead, they built a system that uses multiple clues at once, like a Swiss Army knife. If one tool is broken, the others can still do the job.

They tested four types of clues:

  1. Lip Reading: Watching the mouth (Frame-level).
  2. Face ID: A steady photo of the face (Utterance-level).
  3. Emotions: Watching facial expressions (Frame-level).
  4. Voiceprint: A sample of her voice (Utterance-level).

3. The Secret Sauce: "Training in the Rain"

This is the most important part of the paper.

  • The Old Way: Imagine training a soccer player only on a perfect, sunny day on a pristine field. When they finally play a match in the pouring rain with mud everywhere, they slip and fall because they never practiced in bad conditions.

    • In the paper: If you train the computer on perfect videos where the face is always visible, it fails miserably when the video gets blocked.
  • The New Way (This Paper): The authors decided to train the computer in the rain. They deliberately covered up (occluded) 80% of the video frames during training. They forced the computer to learn how to find Alice even when she was mostly hidden.

    • The Result: When they tested the computer on real-world messy videos, it didn't panic. It knew how to use the little bit of lip movement it could see, combined with the photo of her face, to figure out who was speaking.

4. The Best Combination: "The Photo + The Lip"

The researchers found that the perfect team wasn't necessarily all the clues. The winning combination was:

  • One steady photo of the face (to know who it is).
  • The moving lips (to know what they are saying).

They found that adding "emotions" (expressions) didn't help much, but adding the photo was a game-changer. It acted as a safety net. If the lip video got blocked, the computer could still remember, "Oh, that's Alice's face," and keep the voice clear.

5. The Takeaway

This paper teaches us that to build a robust AI for the real world, you can't just train it on perfect data. You have to simulate disasters during training.

By teaching the computer to handle missing information (like a blocked camera), they created a system that is:

  1. Strong: It works great when everything is perfect.
  2. Resilient: It keeps working even when the video is glitchy, blocked, or incomplete.

In a nutshell: They built a "super listener" that doesn't just listen; it watches, but it's smart enough to keep listening even when it can't see perfectly, because it was trained to expect the unexpected.