ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Imagine you are a bouncer at an exclusive club. Your job is to tell the difference between real human guests and high-tech robots trying to sneak in by mimicking human voices.

For a long time, bouncers (AI detectors) have been very good at spotting "standard" robots. They know the usual robotic glitches. But recently, the bad guys started using "Emotional Robots." These new robots don't just sound like humans; they sound like humans who are happy, sad, angry, or whispering. They are so good at acting that the old bouncers get confused and let them in.

This paper introduces a new, super-smart bouncer named ProSDD. Here is how it works, using simple analogies:

The Problem: Learning from the Wrong Book

Most current AI detectors are trained by showing them thousands of examples of "Fake" voices and asking, "Is this fake?"

The Flaw: It's like teaching a student to spot a fake painting only by showing them bad forgeries. The student learns to look for specific "mistakes" in those forgeries. But if the forger changes their style (like adding emotion), the student fails because they didn't learn what a real masterpiece actually feels like. They learned the "glitches" of the fake, not the "soul" of the real.

The Solution: ProSDD's Two-Stage Training

The authors realized that humans don't spot fakes by memorizing glitches; we spot them because we have an internal sense of how real human voices should flow. We know how a real voice changes pitch when someone is excited or tired.

ProSDD mimics this human intuition in two stages:

Stage 1: The "Real Voice" Boot Camp

Before the AI ever sees a fake voice, it goes to a special school where it only listens to real humans.

The Analogy: Imagine a music student who spends months just listening to real jazz musicians. The teacher doesn't ask, "Is this jazz?" Instead, the teacher says, "Listen to this singer. If I cover up a part of the song, can you guess what the pitch and energy should be next, based on who is singing?"
The Goal: The AI learns the "muscle memory" of real human speech. It learns how a real voice naturally wiggles, rises, and falls (prosody) depending on the speaker's mood. It builds a deep, internal map of what "real" feels like.

Stage 2: The "Detective" Exam

Now, the AI is ready to take the test. It is shown both real and fake voices.

The Analogy: The AI is now a detective. It still has to guess "Real or Fake?" But here's the trick: It has to keep doing its Stage 1 homework at the same time.
Every time it looks at a voice, it asks two questions:
1. "Is this fake?" (The main job).
2. "Does this voice follow the natural rules of human emotion and pitch I learned in Stage 1?" (The homework).
If a voice sounds fake but also breaks the natural rules of human emotion (e.g., the pitch jumps in a way a human never would), the AI catches it immediately.

Why This is a Game-Changer

The paper tested ProSDD against the world's toughest challenges, including "EmoFake" (emotional fakes) and "EmoSpoof" (fakes with different speaking styles).

The Old Way: When faced with emotional fakes, the old detectors got confused and failed miserably (like a bouncer letting in a robot wearing a very convincing costume).
ProSDD: Because it learned the "soul" of real speech first, it could spot the emotional fakes easily. It reduced the error rate by huge margins (sometimes cutting mistakes in half or more).

The Big Takeaway

The secret sauce isn't a more complex computer brain or a bigger list of "fake" examples. It's teaching the AI to appreciate real human speech first.

By forcing the AI to understand the natural, messy, emotional flow of real human voices before it tries to catch liars, it becomes much harder for a fake voice to fool it. It's the difference between memorizing a list of "bad words" and truly understanding the English language.

In short: ProSDD doesn't just learn what a fake looks like; it learns what a real human sounds like, making it nearly impossible for a robot to pretend to be human.

1. Problem Statement

Speech Deepfake Detection (SDD) systems currently perform well on standard benchmarks (e.g., ASVspoof 2019/2021) but fail to generalize to expressive and emotional spoofing attacks.

Current Limitations: Most state-of-the-art SDD systems rely on Self-Supervised Learning (SSL) backbones (like XLS-R) fine-tuned solely with a spoof classification objective. This approach encourages models to learn dataset-specific artifacts rather than the fundamental, transferable cues of natural speech.
The Gap: While modern synthesis models (TTS/VC) produce highly natural and emotional speech, they often exhibit subtle prosodic inconsistencies (variations in pitch, energy, and voice activity). Human listeners detect fakes by recognizing deviations from the internalized variability of real speech, but current models lack this perceptual capability.
Goal: To develop a detection framework that generalizes across emotional and expressive domains by explicitly modeling natural prosodic variability rather than relying on complex classifiers or dataset-specific artifacts.

2. Methodology: ProSDD Framework

The authors propose ProSDD, a two-stage supervised masked prediction framework designed to enrich model embeddings with speaker-conditioned prosodic variation.

Core Architecture

The framework utilizes a pre-trained XLS-R backbone. It introduces a Supervised Masked Prediction objective where the model predicts prosodic features conditioned on speaker identity, rather than just raw audio frames.

Target Construction:
For every frame $t$ , the target is a concatenation of:

Speaker Embedding ($spk$): A 192-dimensional utterance-level embedding (averaged and L2-normalized) extracted via a pre-trained ECAPA-TDNN.
Prosodic Embedding ( $f_t$ ): A 256-dimensional frame-level embedding capturing Pitch ( $F_0$ ), Voice Activity, and Energy, extracted via a prosody encoder.

Target Dimension: $D = 448$ ( $192 + 256$ ).

Stage I: Prosody-Driven Representation Learning (Real Speech Only)

Data: Trained exclusively on bona fide (real) speech (LibriSpeech).
Objective: The model learns to reconstruct the speaker-conditioned prosodic targets for masked frames.
Loss Function: Uses InfoNCE loss (contrastive learning).
- Positive Sample: The correct speaker-prosody pair for the masked frame.
- Negative Samples:
  - Intra-speaker: Same speaker, different prosody (different frame).
  - Inter-speaker: Different speaker, same prosody.
Goal: To force the backbone to internalize the structured variability of natural speech (how pitch and energy change across speakers and utterances) before ever seeing a fake sample.

Stage II: Joint Optimization (Spoof Classification + Prosodic Supervision)

Initialization: Weights from Stage I are used to initialize the backbone.
Data: Trained on mixed Real and Fake data (ASVspoof 2019/2024).
Two-Pass Strategy:
1. Masked Pass: Computes the supervised masked prediction loss (same as Stage I) to preserve prosodic structure.
2. Classification Pass: Computes the spoof classification loss (Real vs. Fake) using unmasked representations.
Total Loss: $L_{total} = \alpha L_{cls} + \beta L_{SSL}$ , where $\beta$ acts as a regularizer to maintain prosodic modeling while prioritizing spoof detection.
Classifier: A lightweight head (Linear $\to$ Dropout $\to$ ReLU $\to$ Linear) is used to ensure performance gains come from the enriched backbone, not architectural complexity.

3. Key Contributions

ProSDD Framework: A novel two-stage approach that structures SSL representations through speaker-conditioned prosodic variation (pitch, energy, voice activity) to enhance generalization.
Pre-training Paradigm Shift: Demonstrates that learning structured prosodic variation from real speech only prior to spoof classification significantly improves robustness against expressive and emotional attacks.
Cross-Domain Robustness: Shows that enriched backbone representations enable strong performance across domains (e.g., TTS training $\to$ VC testing) without relying on complex classifier architectures.
Open Source: The code and models are publicly released to support reproducibility.

4. Experimental Results

The method was evaluated on standard benchmarks (ASVspoof 2019, 2021, 2024) and emotional/expressive benchmarks (EmoFake, EmoSpoof-TTS).

Key Performance Metrics (Equal Error Rate - EER):

Training Setting	Benchmark	Baseline (XLSR-SLS)	ProSDD	Improvement
Trained on ASVspoof 2019	ASVspoof 2024	25.43%	16.14%	-9.29%
	EmoFake	8.84%	3.70%	-5.14%
	EmoSpoof-TTS	18.92%	9.54%	-9.38%
Trained on ASVspoof 2024	ASVspoof 2024	39.62%	7.38%	-32.24%
	EmoFake	58.57%	25.06%	-33.51%
	EmoSpoof-TTS	25.92%	11.96%	-13.96%

Significance: ProSDD reduced the EER on ASVspoof 2024 from 39.62% to 7.38% (when trained on 2024 data) and achieved a 50% relative reduction on emotional datasets like EmoFake and EmoSpoof-TTS.
Ablation Study: Removing the Stage I pre-training or the masked prediction objective caused severe performance degradation (e.g., EER on ASVspoof 2019 jumped from 0.42% to 6.78%), confirming that real-only prosodic pre-training is critical for generalization.

5. Significance and Conclusion

Paradigm Shift: ProSDD challenges the prevailing trend of relying solely on classification objectives for SDD. It argues that models must first understand the "structure" of natural speech (prosodic variability) to effectively detect deviations.
Human-Inspired Detection: By mimicking how humans internalize variability in authentic speech to detect fakes, the model achieves superior robustness against emotional and expressive attacks, which are currently the most difficult for AI detectors.
Efficiency: The method achieves state-of-the-art results using a lightweight classifier, proving that the value lies in the representation learning strategy rather than model complexity.
Future Impact: This approach offers a viable path for building SDD systems that can withstand the rapidly evolving capabilities of emotional TTS and voice conversion systems.