Acoustic-to-Articulatory Inversion of Clean Speech Using an MRI-Trained Model

This study demonstrates that acoustic-to-articulatory inversion models trained on denoised MRI data can effectively reconstruct vocal tract shapes from clean speech, achieving performance comparable to MRI-based methods with an RMSE of 1.56 mm.

Sofiane Azzouz, Pierre-André Vuissoz, Yves Laprie

Published Fri, 13 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to figure out exactly how someone's mouth, tongue, and throat are moving just by listening to their voice. This is called Acoustic-to-Articulatory Inversion. It's like being a detective who can look at a fingerprint (the sound) and perfectly reconstruct the hand that made it (the mouth movements).

For a long time, scientists tried to solve this puzzle, but they needed a "gold standard" to train their computer models. Usually, this meant putting a microphone inside a giant, loud MRI machine while a person spoke. The MRI camera could see the tongue moving in real-time, providing the perfect answer key.

However, there was a huge problem: MRI machines are incredibly loud. The audio recorded inside them is full of static and scanner noise, like trying to hear a whisper in a rock concert. Even after cleaning up the audio, it still sounds "robotic" and unnatural compared to normal speech.

The Big Question

The researchers asked: "Can we train our detective using the noisy MRI audio, but then have it solve cases using clean, quiet speech recorded in a normal room?"

If the answer is yes, we can finally use this technology in real life (like in voice assistants or medical apps) without needing a giant MRI machine.

The Experiment: The "Twin" Speakers

To test this, the team used a clever setup:

  1. The MRI Twin: A woman spoke a list of sentences inside the MRI machine. The camera recorded her tongue, and the microphone recorded her noisy voice.
  2. The Clean Twin: The same woman, speaking the exact same sentences, but in a quiet, soundproof room.

Because it was the same person saying the same words, the researchers could line up the two recordings perfectly, like matching two different maps of the same city.

The Training Methods

They tried three different ways to train their AI model:

  1. The Perfect Match (M2M): Train on noisy MRI audio, test on noisy MRI audio. (This is the "control group" and the best possible score).
  2. The Mismatch (M2C): Train on noisy MRI audio, but test on clean, quiet audio. (This is what happens if you try to use a model trained in an MRI machine in the real world without fixing it).
  3. The Clean Swap (C2C): Train on clean audio, test on clean audio. (This is the goal: a model that works in the real world).

The Secret Sauce: Phonetic Alignment

The tricky part was that the woman spoke slightly faster or slower in the MRI machine compared to the quiet room. To fix this, the researchers didn't just match the sounds; they matched the phonemes (the tiny building blocks of speech, like "b," "a," or "t").

Think of it like syncing two different versions of a movie. One version has a slow scene, the other has a fast scene. Instead of just stretching the whole movie, they paused at every specific word and adjusted the timing so the actors' lips matched the dialogue perfectly. This ensured the AI learned the right connection between the sound and the tongue movement, regardless of speed.

The Results: A Happy Ending

Here is what they found:

  • The Mismatch (M2C) failed a bit: When they took a model trained on noisy MRI data and tested it on clean speech, the accuracy dropped. The AI got confused because the "voice" it learned didn't quite match the "voice" it was hearing.
  • The Clean Swap (C2C) worked amazingly: When they trained the model on clean speech, it performed almost as well as the "Perfect Match" model trained on the noisy MRI data.
    • The error rate was about 1.56 millimeters.
    • To put that in perspective: The MRI camera itself can only see details as small as 1.62 millimeters.
    • The Analogy: It's like trying to draw a picture of a face. The camera (MRI) can only see details down to the size of a grain of sand. The AI's drawing was only slightly less accurate than the camera's own view. That is incredibly precise!

Why This Matters

Before this study, people thought you had to use noisy MRI data to get good results. This paper proves that you don't.

By using a smart alignment method and training on clean speech, we can now build systems that understand how our mouths move just by listening to us speak in a normal room. This opens the door for:

  • Better Voice Assistants: That understand accents and speech disorders better.
  • Medical Tools: Helping people with speech issues practice without needing a hospital scan.
  • Animation: Creating realistic talking avatars just from audio files.

In short, the researchers took a technology that was stuck in a noisy, expensive machine and successfully moved it out into the real world, proving that clean speech is just as powerful as the noisy kind for teaching computers how we speak.