Emotion-Aware Prefix: Towards Explicit Emotion Control in Voice Conversion Models

This paper proposes an Emotion-Aware Prefix for a two-stage voice conversion backbone that significantly improves emotion conversion accuracy from 42.40% to 85.50% while preserving speaker identity, linguistic integrity, and speech quality through joint control of sequence modulation and acoustic realization.

Haoyuan Yang, Mu Yang, Jiamin Xie, Szu-Jui Chen, John H. L. Hansen

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a magical recording studio where you can take a person's voice and make them say anything you want, but in the voice of someone else. This is called Voice Conversion.

Now, imagine you want to do something even harder: You want that new voice to not just sound like the target person, but also to feel a specific emotion—like pure joy, deep sadness, or intense anger—just like the original speaker did.

This is the problem the researchers at the University of Texas at Dallas tried to solve. Their paper, "Emotion-Aware Prefix," is like inventing a new set of "magic instructions" that tell the computer exactly how to inject emotion into a voice without breaking the voice itself.

Here is the breakdown using simple analogies:

1. The Problem: The "Good Actor, Bad Director"

Think of existing voice conversion models (like the one they improved, called VEVO) as a very talented actor.

  • The Talent: The actor is great at mimicking a specific person's voice (their accent, pitch, and tone).
  • The Flaw: If you ask the actor to "be happy," they might just smile a little. They don't fully become the character. They rely on vague hints from the recording, like "the speaker sounded energetic," but they miss the specific, high-intensity emotion you want. It's like asking a chef to "make it spicy," and they just add a pinch of salt instead of the hot sauce you wanted.

2. The Solution: The "Emotion-Aware Prefix"

The researchers added a special tool called the Emotion-Aware Prefix. Think of this as a Director's Script that is handed to the actor before they even start speaking.

  • How it works: Before the computer generates the voice, it looks at a reference clip (a sample of the target emotion). It extracts the "feeling" from that clip and turns it into a digital "instruction tag."
  • The "Deep-Prefix" Trick: Instead of just whispering this instruction once at the start, the researchers used a technique called Deep-Prefix Prompting. Imagine this as the Director standing next to the actor at every single step of the performance, constantly reminding them, "Remember, you are furious right now!" or "You are overjoyed!" This ensures the emotion stays consistent from the first word to the last.

3. The Two-Stage Factory

The system they improved works like a two-step factory:

  • Stage 1: The Architect (Sequence Modulation)
    This stage decides the structure of the speech. It's like the architect drawing the blueprint. It decides: "Is this sentence going to be shouted? Is it going to be whispered? Where are the pauses?"

    • The Discovery: The researchers found that this is where the biggest change happens. If you tell the Architect to "build a house of sadness," the whole structure changes.
  • Stage 2: The Builder (Acoustic Realization)
    This stage takes the blueprint and actually builds the house (the sound waves). It focuses on the texture and quality of the voice.

    • The Discovery: The Builder needs to know the emotion too, but if the Architect didn't get the memo, the Builder can't fix it. However, if both the Architect and the Builder are on the same page, the result is perfect.

4. The "Identity Shield" (Why the Voice Doesn't Break)

One of the biggest fears in voice conversion is that if you change the emotion too much, the voice will stop sounding like the target person. It might start sounding like a robot or a different person entirely.

The researchers found a secret ingredient: Acoustic Decoupling.

  • The Analogy: Imagine you are painting a portrait.
    • Without Decoupling: You try to paint the emotion and the person's face on the same wet canvas at the same time. If you add too much "anger" (red paint), the face gets distorted and looks like a monster.
    • With Decoupling (Their Method): You first sketch the face (the identity) on a sturdy canvas. Then, you paint the emotion (the red paint) on a separate, transparent layer on top. You can change the emotion as much as you want, but the face underneath remains perfectly intact.

5. The Results: A Massive Leap

Before this new method, the computer could only get the emotion right about 42% of the time. It was guessing.
After adding the "Emotion-Aware Prefix" and the "Director's Script," the success rate jumped to 85.5%.

  • What this means: The computer can now reliably make a voice sound genuinely happy, sad, or angry, while still sounding exactly like the person you asked it to mimic.

Summary

In short, the researchers built a smart instruction manual (the Prefix) that tells a voice-conversion AI exactly how to feel. They discovered that you need to control the "big picture" (the structure) and the "details" (the sound) at the same time, and you need to keep the person's identity separate from the emotion so the voice doesn't get lost.

It's the difference between a robot reading a script with a flat voice and a method actor who can truly feel the lines while sounding exactly like the character you chose.