Causal Prosody Mediation for Text-to-Speech:Counterfactual Training of Duration, Pitch, and Energy in FastSpeech2

Imagine you are directing a play. You have a script (the text), a specific actor (the speaker), and a mood for the scene, like "angry" or "sad" (the emotion).

In old-school computer speech (Text-to-Speech or TTS), the computer was like a robot actor who could read the script perfectly but struggled to act. If you told it to be "angry," it might just shout the words, or worse, it might change the actor's voice entirely, making it sound like a different person, or it might mumble the words so you can't understand them. It was a messy mix where the emotion, the voice, and the words got tangled up together.

This paper introduces a new way to teach computers how to act: Causal Prosody Mediation.

Here is the simple breakdown of how it works, using some everyday analogies:

1. The Problem: The "Magic Wand" vs. The "Director"

In the old way, if you wanted the computer to sound angry, you might just hand it a "Magic Wand" labeled "Angry." The computer would grab the wand and wave it, but it didn't know how to use it. It might change the actor's face (voice quality), stumble over the words (intelligibility), or just scream randomly. It was a "black box" where the emotion went in, and a weird sound came out, but no one knew exactly what changed.

2. The Solution: The "Middleman" (The Mediator)

The authors realized that in real life, when a human gets angry, they don't magically change their DNA. Instead, they change their prosody:

Duration: They speak faster or slower.
Pitch: Their voice goes higher or lower.
Energy: They speak louder or softer.

The paper treats these three things (Duration, Pitch, Energy) as a strict middleman.

The Rule: The "Emotion" is only allowed to talk to the "Speech" through this middleman.
The Analogy: Imagine the Emotion is a boss, the Speech is a worker, and the Prosody is a messenger. The boss (Emotion) is forbidden from talking directly to the worker (Speech). The boss must write a note to the messenger (Prosody), and the messenger delivers the instructions to the worker.
- Boss: "Be Angry!"
- Messenger: "Okay, I will tell the worker to speak faster, louder, and higher."
- Worker: (Speaks faster, louder, higher).

If the boss tries to whisper a secret directly to the worker (changing the voice timbre or messing up the words), the system catches it and says, "No! You must go through the messenger!"

3. The Training: "What If?" Scenarios

How do you teach a computer this rule? You use Counterfactual Training. This is a fancy way of saying "What If?" games.

During training, the computer plays a game:

Scenario A: The computer hears the sentence "I love ice cream" said in a Happy voice.
Scenario B (The Counterfactual): The computer asks, "What if I said the exact same sentence with the exact same voice, but in a Sad mood?"

The computer tries to answer this.

The Mistake: If the computer just changes the voice to sound like a different person, it fails.
The Lesson: The computer learns that to make it sound "Sad," it must only change the messenger's notes (make the pitch lower, the speed slower, the energy softer). It learns that the words "I love ice cream" must stay exactly the same, and the actor's voice must stay the same. Only the delivery changes.

4. The Result: A Controllable Actor

Because of this strict training, the new system (called CPM) is amazing at two things:

Switching Moods: You can take a sentence spoken by a specific actor and instantly make them sound happy, angry, or sleepy without changing their voice or messing up the words.
Staying Clear: Because the computer isn't allowed to mess with the words to show emotion, the speech remains very clear and easy to understand (low "Word Error Rate").

Summary Analogy: The Paint Mixer

Think of the old TTS models as a paint mixer where you pour in "Red" (Anger) and "Blue" (Voice) and "White" (Text). Sometimes the Red paint leaks into the White paint, turning the words pink and unreadable.

This new method builds a filter between the colors.

You pour in "Red" (Anger).
The filter forces the Red to only mix with the "Brush Strokes" (Pitch/Speed/Loudness).
The "White Paint" (The Words) and the "Canvas" (The Speaker's Voice) stay pure and untouched.

The Bottom Line:
This paper teaches computers that emotion is about how you say something, not what you say or who is saying it. By forcing the computer to follow this rule using "What If?" games, they created a speech system that is not only more expressive and natural but also much easier to control and understand.

1. Problem Statement

While modern non-autoregressive Text-to-Speech (TTS) models like FastSpeech2 (FS2) have achieved high naturalness, they struggle with controllable expressiveness.

The One-to-Many Problem: A single text input can be spoken in many ways (varying rhythm, intonation, emotion). Standard FS2 handles this via variance predictors (duration, pitch, energy) but does not explicitly model emotion.
Entanglement: Existing emotion-conditioned TTS methods often condition the network on emotion embeddings. However, these approaches lack a principled mechanism to ensure that emotion is expressed only through prosodic changes. Consequently, models often entangle emotion with speaker identity or linguistic content, leading to degraded intelligibility, inconsistent speaker voices, or unnatural distortions when switching emotions.
Goal: The authors aim to create a TTS system where emotion influences speech strictly through prosody (duration, pitch, energy), ensuring that the linguistic content and speaker identity remain invariant to emotional changes.

2. Methodology: Causal Prosody Mediation (CPM)

The core innovation is the application of Causal Inference and Counterfactual Training to TTS. The authors propose a Structural Causal Model (SCM) and derive two novel loss functions to enforce it.

A. Structural Causal Model (SCM)

The authors define a causal graph with the following variables:

$X$ (Text): Linguistic content.
$S$ (Speaker): Speaker identity.
$E$ (Emotion): Intended emotional state (e.g., angry, sad).
$M$ (Prosody): Mediating features (Duration, Pitch, Energy).
$Y$ (Speech): The synthesized waveform/spectrogram.

Hypothesis: Emotion ( $E$ ) affects Speech ( $Y$ ) only through the mediator $M$ . There should be no direct path from $E \to Y$ .

Ideal Path: $E \to M \to Y$
Undesired Path: $E \to Y$ (Direct influence bypassing prosody, causing voice/timbre changes).

B. Architecture: Emotion-Augmented FastSpeech2

The model is built upon FastSpeech2 with specific modifications:

Encoder: Receives text, speaker embedding, and emotion embedding.
Variance Adaptor: Contains predictors for duration, pitch, and energy. These are conditioned on the emotion embedding to generate emotion-specific prosody patterns.
Decoder: Receives the expanded sequence (with prosody features) and speaker embedding. Crucially, the emotion embedding is NOT fed directly into the decoder. This architectural choice forces the model to rely on the prosody features to convey emotion.

C. Counterfactual Training Objectives

To enforce the SCM, the authors introduce two loss terms derived from counterfactual reasoning ("What if the same sentence were spoken with a different emotion?"):

Indirect Path Constraint (IPC):
- Goal: Eliminate the direct effect of emotion on speech ( $E \to Y$ ).
- Mechanism: During training, the model is fed a counterfactual scenario where the emotion label is changed ( $E \to E'$ ), but the prosody features ( $M$ ) are fixed to the original ground truth.
- Loss: The model is penalized if the output speech changes when only the emotion label changes (while prosody is held constant). This forces the decoder to ignore the emotion signal unless it is mediated through $M$ .
Counterfactual Prosody Constraint (CPC):
- Goal: Ensure the indirect path ( $E \to M \to Y$ ) is strong and expressive.
- Mechanism: The model generates speech with a new emotion $E'$ , allowing prosody ( $M$ ) to change naturally.
- Loss Components:
  - Content Consistency: Ensures the linguistic content remains identical (measured via Word Error Rate between original and counterfactual).
  - Emotion Recognition: An auxiliary classifier checks if the generated speech correctly reflects the target emotion $E'$ .
- Effect: Forces the model to use the "degrees of freedom" in duration, pitch, and energy to express the new emotion, rather than altering the words or speaker voice.

Total Loss: $L_{total} = L_{TTS-base} + \beta_{IPC} \cdot L_{IPC} + \beta_{CPC} \cdot L_{CPC}$

3. Key Contributions

Causal Modeling in TTS: First application of a Structural Causal Model to explicitly treat prosody as a mediator for emotion in TTS.
Counterfactual Training: Introduction of IPC and CPC loss functions that enforce full mediation of emotion through prosody, a novel approach in TTS training.
Disentanglement: The method successfully separates emotional prosody from linguistic content and speaker identity, allowing for "same utterance, different emotion" generation without compromising intelligibility or speaker consistency.
General Framework: The approach is general, requiring no adversarial training or complex reference encoders, and integrates seamlessly into the FastSpeech2 architecture.

4. Experimental Results

The model was evaluated on multi-speaker datasets (LibriTTS, VCTK, EmoV-DB) against baselines: Vanilla FS2, FS2 with naive emotion embedding, and a post-hoc editing method (Counterfactual Activation Editing - CAE).

Objective Metrics:
- Intelligibility (WER): CPM achieved the lowest Word Error Rate (3.1%) compared to baselines (3.5% - 4.2%), proving that emotion manipulation did not degrade pronunciation.
- Speaker Similarity: CPM maintained high speaker similarity (0.88), significantly outperforming post-hoc CAE editing (0.79), which often altered voice characteristics.
- Content Consistency: CPM achieved a Content Consistency Score of 0.96, indicating near-perfect preservation of text across emotion switches.
- Emotion Accuracy: An external classifier identified the intended emotion in CPM outputs 94% of the time, compared to 80% for the naive baseline.
Subjective Metrics (MOS):
- CPM achieved a Mean Opinion Score of 4.45/5, significantly higher than the naive baseline (4.21) and CAE (4.00).
- Listeners rated CPM outputs as more natural and authentic, noting that emotional cues (e.g., higher pitch for "happy," slower tempo for "sad") were distinct and not "forced."
Ablation Studies:
- Removing IPC led to a drop in content consistency and emotion accuracy, confirming that without it, the model reverts to entangling emotion with timbre.
- Removing CPC resulted in a model that preserved content but failed to express emotion (monotonic output), confirming CPC is necessary for expressiveness.

5. Significance and Implications

Controllability: The paper demonstrates that causal constraints allow for precise, interpretable control over speech synthesis. Users can manipulate pitch, duration, and energy to alter emotion without breaking the sentence structure or changing the speaker's voice.
Interpretability: By enforcing a causal graph, the model's internal logic becomes transparent. One can inspect the prosody features to understand why a specific emotion was generated.
Future Directions: The work suggests a pathway for "what-if" analysis in generative AI. It highlights the potential of counterfactual training to solve entanglement problems in other multi-factor generation tasks.
Limitations: The current model assumes emotion is fully captured by pitch, duration, and energy. It may miss subtle voice quality changes (e.g., breathiness, creakiness) associated with certain emotions. Additionally, the counterfactual training doubles the computational cost.

In conclusion, this paper establishes a new paradigm for expressive TTS by moving from heuristic conditioning to causal mediation, resulting in systems that are more robust, controllable, and human-like in their emotional delivery.