Affect Decoding in Phonated and Silent Speech Production from Surface EMG

Imagine you are trying to guess how someone is feeling just by watching their face, but they aren't saying a word. Or, imagine trying to guess their mood even when they are shouting, whispering, or just moving their lips silently.

This paper is like a detective story where the investigators use muscle sensors (like tiny stethoscopes for muscles) to figure out if a person is frustrated, polite, or neutral, even when they aren't speaking out loud.

Here is the breakdown of their adventure, explained with some everyday analogies:

1. The Big Question: Can Muscles "Talk" About Feelings?

Usually, when we want to know how someone feels, we listen to their voice. If they sound angry, they are angry. But what if they can't speak? What if they are in a noisy room, or they have lost their voice, or they are just mouthing words silently?

The researchers asked: Do the tiny muscles in our face and neck change their "dance moves" when we feel frustrated or polite, even if no sound comes out?

2. The Experiment: The "Silent Movie" vs. The "Loud Movie"

To find out, they gathered 12 volunteers and put small sensors on their faces and necks. These sensors act like high-tech fitness trackers for your jaw and throat, recording every tiny twitch of your muscles.

They asked the volunteers to do three things:

The Scripted Scene (Task 1 & 3): They read sentences from a screen. Sometimes they had to say them normally, sometimes with a "polite" tone, and sometimes with a "frustrated" tone. Crucially, they did this twice: once out loud (like a normal movie) and once silently (like a silent movie or mouthing words).
The Improv Scene (Task 2): They had a fake conversation with a computer agent. The agent was programmed to be either super nice or super annoying to make the volunteers naturally feel polite or frustrated.

3. The Discovery: The "Muscle Signature"

The researchers analyzed the muscle data and found some cool things:

Muscles Know the Mood: Even when people were just moving their lips without making a sound, their muscles still showed a clear "signature" of frustration. It's like how a dancer's body language changes when they are sad versus happy, even if they aren't speaking.
Frustration is Loud (Even Silently): The sensors were really good at spotting frustration. They could tell the difference between a calm person and an angry one with about 85% accuracy, even if the person was silent.
The "Silent" Advantage: Interestingly, the sensors worked just as well (sometimes even better) when people were silent compared to when they were shouting. This suggests that the "feeling" is built into the movement of the speech, not just the sound.

4. The Challenge: Everyone is Different

Here is the tricky part: People are weird.
When the computer tried to learn from Person A and then guess Person B's mood, it got a bit confused. It's like trying to teach a dog to fetch a ball, but then asking it to fetch a different ball for a different dog; the style of fetching changes.

The "One-Size-Fits-All" Problem: The sensors were great at guessing your mood if they had seen you before. But guessing a stranger's mood was harder because everyone's face and neck muscles move differently.
The "Silent Speech" Hope: However, the study found that if you train the computer on normal speaking, it can actually understand silent speaking pretty well. This is huge for future technology!

5. Why Does This Matter? (The Real-World Superpower)

Why should we care about reading silent muscle movements?

For People Who Can't Speak: Imagine someone who has lost their voice due to surgery or illness. They could still "speak" silently, and this technology could not only read the words but also tell the listener, "Hey, they are actually really frustrated right now," adding emotional depth to their communication.
For Noisy Environments: If you are in a loud factory or a crowded party, your voice might get lost. But your muscles don't care about the noise. This tech could let you communicate your feelings clearly even in a hurricane.
For Better AI: Current voice assistants (like Siri or Alexa) only hear the words. They don't know if you are annoyed or happy. This research helps build AI that understands the feeling behind the words, making interactions feel more human.

The Bottom Line

This paper proves that emotions are physical. They aren't just in your voice; they are written in the tiny movements of your face and neck. Even when you are silent, your muscles are still "talking" about how you feel.

The researchers are essentially building a universal translator for human emotion, one that works whether you are shouting, whispering, or saying nothing at all.

Here is a detailed technical summary of the paper "Affect Decoding in Phonated and Silent Speech Production from Surface EMG."

1. Problem Statement

While affect (emotion) is a critical component of human communication, its relationship to the underlying neuromuscular execution of speech remains underexplored. Traditional affect recognition relies heavily on acoustic analysis, which fails in scenarios where speech is silent, distorted, or unavailable (e.g., silent speech interfaces, speech prosthetics, or noisy environments).
The core research questions address:

RQ1: Can affective states be decoded from surface electromyography (sEMG) during speech production?
RQ2: Does affective modulation persist in silent speech (articulation without phonation) compared to phonated speech?
RQ3: How robust are these affective signatures when moving from controlled (scripted) to spontaneous speech contexts and across different speakers?

2. Methodology

2.1. Dataset: ST-Case

The authors introduced the ST-Case (SAIL-TUM Corpus on Affective Speech & EMG) dataset:

Participants: 12 healthy adults (9 female, mean age 26.2).
Total Utterances: 2,780 (1,588 phonated, 1,192 silent).
Tasks:
1. Task 1 & 3 (Controlled): Prompted reading of sentences related to an apartment search. Participants articulated sentences in three modes: Neutral, Polite, and Frustrated. Each sentence was spoken once aloud (phonated) and once silently.
2. Task 2 (Spontaneous): A "Wizard-of-Oz" conversation regarding car insurance. One agent elicited politeness; the other elicited frustration. Only phonated speech was recorded here.
Sensors: 8-channel bipolar sEMG (Brain Products actiCHamp Plus) placed on facial and neck muscles (e.g., Zygomaticus Major, Mentalis, Infrahyoid). Audio was recorded at 48 kHz.

2.2. Pre-processing

EMG: High-pass filtered (100 Hz), notch filtered (50 Hz + harmonics), downsampled to 1 kHz, and outlier-clipped.
Audio: Denoised and downsampled to 16 kHz.
Alignment: Onset/offset markers were applied to align EMG and audio signals.

2.3. Feature Extraction

Three types of feature representations were used:

Handcrafted Structural Features: Mean rectified values, standard deviation, peak amplitude, RMS, median frequency, spectral entropy, and cross-channel correlations (92-dimensional vector).
Time-Domain (TD-0) Features: Low/high-frequency components, zero-crossing rates, and statistical moments (mean, std, percentiles) calculated over 27ms windows.
Deep Learning Embeddings: Features extracted from BioCodec, a foundation model pre-trained on wrist gesture sEMG, used in a zero-shot manner.
Speech Baselines: For comparison, acoustic features were extracted using eGeMAPS (prosodic) and Vox-Profile (deep learning speech emotion embeddings).

2.4. Experimental Setup

Models: Support Vector Machine (SVM) with RBF kernel for handcrafted features; Linear Probe with L2 regularization for embeddings.
Evaluation Strategies:
- Intra-subject: 5-fold cross-validation (preventing sentence leakage).
- Inter-subject: Leave-One-Subject-Out (LOSO) to test generalization.
- Cross-Mode: Training on phonated data to test on silent data (and vice versa).
- Lexical Control: Comparing performance on "unique" sentences vs. "repeated" sentences (same sentence spoken with different emotions) to rule out lexical cues.

3. Key Results

3.1. Affect Decoding from EMG (RQ1)

Intra-subject: EMG features significantly outperformed acoustic features. The TD-0 features achieved the highest performance with an AUC of 0.845 for distinguishing frustration.
Inter-subject: Performance dropped (AUC ~0.57 for EMG), indicating high inter-speaker variability. However, EMG still outperformed acoustic baselines in many configurations.
Lexical Robustness: When tested on repeated sentences (where lexical content is identical but emotion changes), EMG models maintained moderate discriminability (AUC > 0.7), whereas speech models (Vox-Profile) collapsed to random chance (AUC ~0.47). This confirms EMG captures affective modulation independent of lexical content.

3.2. Phonated vs. Silent Speech (RQ2)

Generalization: Affective signatures persist in silent speech. Intra-subject performance for silent speech was comparable to, and in some cases slightly better than, phonated speech.
Cross-Mode Transfer: Models trained on phonated speech successfully transferred to silent speech (Intra-subject AUC ~0.70–0.76). The reverse (Silent $\to$ Phonated) was less effective for handcrafted features but worked well for BioCodec embeddings.
Channel Analysis: The Frontalis muscle (E6) showed the highest discriminability in both modes. Interestingly, silent speech showed increased reliance on submental and lower-neck channels, suggesting a shift in motor strategy when phonation is absent.

3.3. Controlled vs. Spontaneous Speech (RQ3)

Performance Shift: In the spontaneous Task 2, acoustic models (Vox-Profile) outperformed EMG models (AUC 0.743 vs. 0.630). This suggests spontaneous speech provides richer acoustic cues for emotion.
EMG Generalization: Despite the drop, EMG models achieved performance comparable to handcrafted prosodic features (eGeMAPS) and showed improved balanced accuracy compared to the controlled inter-subject setting, indicating partial generalization.
Spatial Redistribution: In spontaneous speech, the spatial distribution of affective signals shifted. While facial channels dominated in controlled settings, submental and lower facial sites showed enhanced performance in spontaneous settings, suggesting different neuromuscular patterns for naturalistic vs. acted speech.

4. Key Contributions

Novel Dataset: Introduction of the ST-Case dataset, the first to systematically compare affective modulation across phonated and silent speech using high-density sEMG and audio.
Validation of Silent Speech Affect: Demonstrated that affective signatures are embedded in articulatory motor execution and persist even without vocalization, validating the potential for affect-aware silent speech interfaces.
Lexical Independence: Proved that sEMG-based affect decoding is robust against lexical confounds, unlike acoustic methods which often rely on sentence-specific prosodic cues.
Cross-Modal Insights: Showed that while acoustic features dominate in spontaneous speech, sEMG offers a viable alternative for decoding affect in environments where audio is unavailable or distorted.

5. Significance and Implications

Assistive Technology: This work supports the development of Silent Speech Interfaces (SSIs) and speech prosthetics that can convey not just what a user is saying, but how they feel (e.g., frustration vs. politeness), making communication more natural and expressive.
Neuromuscular Understanding: The study reveals that affect is "embodied" in the motor execution of speech, not just the acoustic output. It highlights that different muscles (facial vs. laryngeal/neck) contribute differently depending on whether speech is phonated, silent, controlled, or spontaneous.
Future Directions: The results suggest that while simple handcrafted features work well for controlled tasks, deep learning embeddings (like BioCodec) may be crucial for handling the heterogeneity of real-world, spontaneous speech. Future work should focus on larger, more diverse cohorts to improve inter-subject generalization.

Limitations

Sample Size: The cohort (N=12) is small and demographically imbalanced, limiting the statistical power of inter-subject conclusions.
Induced Emotion: Emotions were experimentally induced (acted) rather than naturally occurring, which may not fully reflect ecological validity.
Confounding Factors: The study could not fully disentangle speech-related muscle activity from co-activated facial expressions (e.g., smiling while speaking politely).