Reconstruction of the Vocal Tract from Speech via Phonetic Representations Using MRI Data

Imagine you are trying to figure out exactly how a person is shaping their mouth, tongue, and throat just by listening to their voice. This is like trying to guess the shape of a unique, squishy mold inside a speaker's head just by hearing the sound it makes. This is the challenge of Acoustic-to-Articulatory Inversion.

In this paper, the researchers set out to answer a very practical question: To get the best "guess" of the mouth shape, is it better to listen to the raw sound, or is it better to first translate that sound into a list of words and sounds (phonetics) and then try to guess?

Here is the story of their experiment, explained simply.

The Setup: The "Mouth Scanner"

The researchers had a special dataset. They recorded a French woman speaking while inside an MRI machine (a giant camera that takes pictures of the inside of your body).

The Audio: They recorded her voice.
The Video: They took high-speed "movies" of her mouth moving inside her head.
The Goal: They wanted to teach a computer to look at the audio and predict the shape of the mouth in the video, without needing the video to exist.

The Three Competitors

To see which method worked best, they trained four different computer "students" (models) to do the job. Think of these students as detectives trying to solve the mystery of the mouth shape.

The "Raw Ear" Student (The Baseline):
- Method: This student just listens to the raw sound waves (specifically, a type of audio fingerprint called MFCCs). It doesn't know what words are being spoken; it just hears the "texture" of the sound.
- Analogy: Imagine trying to guess the shape of a pipe by listening to the wind whistle through it. You don't know the words, just the sound.
The "Fast Typist" Student (Wav2Vec 2.0):
- Method: This student uses a super-smart AI to instantly transcribe the speech into a list of sounds (phonemes). It's fast but makes occasional mistakes because it's automatic.
- Analogy: A speech-to-text app that types out what you said. It's quick, but sometimes it misses a nuance or gets a sound slightly wrong.
The "Strict Librarian" Student (Astali):
- Method: This student takes the text and forces it to line up perfectly with the audio timing. It breaks the speech into neat, rigid blocks of sounds.
- Analogy: A librarian who cuts a sentence into individual words and pastes them onto a timeline. It's very organized, but it treats every sound like a separate, rigid brick, ignoring how sounds blend together.
The "Expert Editor" Student (Manual Correction):
- Method: This student takes the "Librarian's" work and has a human expert go through it, fixing the timing and splitting tricky sounds (like the pause before a "p" sound) that the computer missed.
- Analogy: A human editor who fixes the librarian's work, ensuring every sound starts and stops exactly where it should.

The Race Results

The researchers let these students try to predict the mouth shapes and measured how far off their guesses were (in millimeters).

The Winner: The "Raw Ear" Student (the Baseline) won easily.
- It was the most accurate, guessing the mouth shape with an average error of only 1.51 mm.
- Why? Because speech is a continuous, flowing river of sound. The "Raw Ear" student could hear the subtle, tiny shifts in the sound that happen between sounds (coarticulation). It didn't need to stop and categorize the sound into a box to understand it.
The Runner-Up: The "Expert Editor" Student came in second.
- It was the best of the "phonetic" group, but still slightly worse than the Raw Ear.
- Why? Even with a human fixing the timing, breaking speech into discrete "blocks" of sounds throws away some of the fine details. It's like trying to describe a smooth curve using only straight Lego bricks; you can get close, but you'll never be perfect.
The Losers: The "Fast Typist" and "Strict Librarian" students did the worst.
- They were too rigid or too error-prone. The "Fast Typist" made transcription errors, and the "Librarian" was too rigid, treating sounds like separate boxes rather than a flowing stream.

The Big Lesson

The paper teaches us a valuable lesson about information loss.

Think of the speech signal as a high-resolution photograph.

The Phonetic approach is like taking that photo and turning it into a low-resolution pixelated cartoon. You lose the fine details (the smooth curves of the tongue) to get a simple, easy-to-read image (the list of sounds).
The Raw Audio approach keeps the high-resolution photo.

The researchers found that keeping the high-resolution photo (the raw sound) is better for predicting the mouth shape than trying to work with the simplified cartoon (the phonetic list).

However, there is a silver lining: If you must use the cartoon (phonetics), it helps a lot to have a human expert fix the drawing. The "Expert Editor" did much better than the automatic machines, proving that accuracy in timing and human insight still matters, even if it can't quite beat the raw sound data.

In a Nutshell

If you want to reconstruct a speaker's mouth movements from their voice, don't try to translate the voice into words first. Just listen to the sound itself. The sound holds all the secret clues that get lost when you try to break it down into a list of sounds. But, if you do have to use a list of sounds, make sure a human expert double-checks it!

Here is a detailed technical summary of the paper "Reconstruction of the Vocal Tract from Speech via Phonetic Representations Using MRI Data."

1. Problem Statement

The paper addresses the acoustic-to-articulatory inversion problem: reconstructing the complete geometry of the vocal tract (articulatory contours) from an audio speech signal. While deep learning has advanced this field, a key question remains regarding the utility of phonetic segmentation as an intermediate representation.

The authors investigate whether incorporating phonetic information (via transcription and alignment) improves reconstruction accuracy compared to using raw acoustic features directly. Specifically, they aim to determine if the significant effort required for manual correction of phonetic alignments is justified, or if automatic methods (transcription or forced alignment) yield comparable results.

2. Methodology

Dataset and Preprocessing

Data Source: A high-resolution real-time MRI (rt-MRI) corpus recorded at Nancy University Hospital. It features a native French female speaker, containing ~3.5 hours of speech (2,100 sentences) across 153 sequences.
Imaging: Acquired using a Siemens Prisma 3T scanner with a spatial resolution of 136×136 pixels (higher than the standard 68×68), providing 8mm mid-sagittal slices.
Articulatory Extraction: An automatic tracking method based on a Recurrent Convolutional Neural Network (RCNN) was used to segment the MRI frames into 8 articulators: upper/lower lips, tongue, soft palate (velum), pharyngeal wall, epiglottis, arytenoid cartilage, and vocal folds. Each contour is represented by 50 points.
Audio: Recorded at 16 kHz and denoised.

Experimental Conditions (Input Features)

The study compares four distinct input strategies for training the inversion model:

Baseline (MFCC): Uses the first 13 Mel-Frequency Cepstral Coefficients (plus delta and delta-delta) extracted directly from the denoised speech signal.
Wav2Vec 2.0 (Automatic Transcription): Uses phoneme probability distributions derived from a pre-trained, fine-tuned Wav2Vec 2.0 model. This provides a soft, probabilistic representation of phonemes without manual intervention.
Astali (Forced Alignment): Uses hard, one-hot encoded phoneme vectors derived from forced alignment (using the Astali tool) between the audio and sentence-level transcriptions.
Expert-Corrected (Manual Alignment): Uses forced alignment that has been manually refined by an expert. This includes precise boundary adjustments and the separation of voiceless plosive closures from bursts.

Model Architecture

Structure: A feedforward neural network consisting of:
- Two Dense layers (300 units each).
- Two Bidirectional LSTM (Bi-LSTM) layers (300 units each).
- One Dense output layer (800 units), mapping to the 100 coordinates for each of the 8 articulators.
Training: Trained for 300 epochs using the Adam optimizer, Mean Squared Error (MSE) loss, and early stopping. The dataset was split 80/10/10 (train/val/test).

3. Key Contributions

Comparative Analysis of Phonetic Granularity: The paper provides a systematic comparison of three levels of phonetic processing (raw automatic transcription, forced alignment, and expert-corrected alignment) against a direct acoustic baseline.
High-Resolution MRI Utilization: The study utilizes a superior rt-MRI dataset (136×136 resolution) and focuses on contour extraction rather than raw image inversion, allowing for more precise articulatory modeling.
Evaluation of Annotation Effort: It quantifies the trade-off between the time/effort invested in manual phonetic correction and the resulting gains in articulatory prediction accuracy.

4. Results

Performance was evaluated using Root Mean Square Error (RMSE) and Median Error (in millimeters) across all articulators.

Baseline Superiority: The MFCC-based baseline achieved the best overall performance with a mean RMSE of 1.51 mm. It outperformed all phonetic-based models across 7 out of 8 articulators.
Phonetic Model Performance:
- Expert-Corrected: Performed best among phonetic models (Mean RMSE: 1.61 mm), slightly outperforming the baseline only on the velum (soft palate).
- Wav2Vec 2.0: Mean RMSE of 1.67 mm.
- Astali (Forced Alignment): Mean RMSE of 1.68 mm.
Statistical Significance: All phonetic-based models showed statistically significant differences ( $p < 0.05$ ) compared to the baseline, confirming that the baseline is superior.
Input Representation Impact: The Wav2Vec 2.0 model (probabilistic/soft labels) slightly outperformed the Astali model (hard one-hot labels), suggesting that preserving uncertainty and temporal smoothness is better than discrete, hard segmentation.

5. Significance and Discussion

Acoustic vs. Phonetic Information: The results indicate that continuous acoustic representations (MFCCs) are more effective for articulatory inversion than discrete phonetic representations. The authors argue that phonetic segmentation introduces an "excessive simplification" of the speech signal, discarding crucial intra-phonemic and coarticulatory acoustic details necessary for precise geometric reconstruction.
Diminishing Returns of Manual Effort: While expert-corrected alignments yielded the best results among phonetic models, they still failed to match the baseline. This suggests that the massive effort required for manual correction is not justified if the goal is purely maximizing reconstruction accuracy.
Nature of Input Matters: The study highlights that the type of phonetic input matters. Soft, probabilistic representations (Wav2Vec) preserve more temporal continuity and uncertainty than hard one-hot encodings, leading to slightly better performance.
Conclusion: For high-fidelity vocal tract reconstruction, relying on the raw acoustic signal is superior to intermediate phonetic segmentation. Phonetic information may be useful for linguistic interpretation but acts as a bottleneck for geometric accuracy due to the underdetermined nature of the phoneme-to-articulation mapping.