Text-only adaptation in LLM-based ASR through text denoising

This paper introduces a lightweight, architecture-free text-only adaptation method for LLM-based ASR that frames domain adaptation as a text denoising task, effectively preserving speech-text alignment while achieving significant performance improvements over state-of-the-art approaches.

Andrés Carofilis, Sergio Burdisso, Esaú Villatoro-Tello, Shashi Kumar, Kadri Hacioglu, Srikanth Madikeri, Pradeep Rangappa, Manjunath K E, Petr Motlicek, Shankar Venkatesan, Andreas Stolcke

Published Fri, 13 Ma
📖 4 min read☕ Coffee break read

Imagine you have a brilliant, world-class translator named LLM. This translator is amazing at reading books and writing essays. You also have a specialized Microphone (the speech encoder) that turns spoken words into a strange, garbled code. Finally, you have a Bridge (the projector) that connects the Microphone to the Translator.

When you train this system, the Bridge learns to translate the Microphone's garbled code into a format the Translator can understand. The Translator then acts like a detective, cleaning up the garbled code to write a perfect transcript.

The Problem: The "New Job" Dilemma

Now, imagine you want this team to work in a new field, like "Medical Insurance" or "Farm Machinery."

  • The Ideal Scenario: You have thousands of hours of recordings of people talking about these topics, along with perfect transcripts. You can retrain the whole team, and they become experts.
  • The Real-World Problem: You don't have those recordings. They are too expensive or hard to get. But, you do have millions of text documents (articles, manuals, chat logs) about these topics.

If you try to teach the Translator (LLM) just by reading these new text documents, something goes wrong. The Translator gets so focused on the new text that it forgets how to listen to the Microphone. The Bridge breaks. The team can write great essays about farming, but they can no longer transcribe a farmer's voice. This is called "Catastrophic Forgetting."

The Solution: The "Noise" Game

The authors of this paper came up with a clever trick. Instead of just feeding the Translator clean text, they decided to play a game of "Text Denoising."

Here is the analogy:

Imagine the Translator is a master editor who is used to fixing typos in rough drafts.

  1. The Old Way: You give the editor a clean article and say, "Learn this topic." The editor learns the topic but forgets how to fix typos.
  2. The New Way (This Paper): You take the clean article, intentionally ruin it, and give it to the editor. You scramble the letters, repeat words, and add typos. You say, "Here is a messy draft of a farming article. Please fix it and make it perfect."

By forcing the editor to fix the mess, they stay sharp at their job (cleaning up the Microphone's garbled code) while simultaneously learning the new vocabulary of the farming world.

How They Did It (The Recipe)

To make this work without breaking the team, they mixed their training "batches" (groups of practice examples) like a smoothie with four specific ingredients:

  1. The Original Audio (The Anchor): Real recordings of people speaking. This keeps the connection between the Microphone and the Translator strong.
  2. The "Simulated Noise" (The Bridge's Voice): They took real audio, ran it through their system to see what "garbled code" it produced, and then turned that code back into text. This teaches the Translator what the Bridge actually sounds like.
  3. The "Fake Noise" (The Practice): They took clean text from the new domain and randomly scrambled it (like a child typing on a keyboard). This teaches the Translator to fix typos in the new language.
  4. The Target Text (The Goal): Clean text from the new domain, but presented as a "messy" input that needs fixing.

By mixing all these together, the system learns two things at once:

  • "I still know how to listen to the microphone."
  • "I also know how to fix messy text about farming/insurance."

The Results

They tested this on two different worlds:

  1. Banking and Insurance: Where the new topics were similar to what they already knew.
  2. Agriculture and Animation: Where the topics were totally different.

The Outcome:

  • Their new method improved accuracy by up to 22%.
  • It beat other methods that tried to do the same thing.
  • In the best cases, the "Text-Only" team performed almost as well as a team that had access to thousands of hours of actual audio recordings.

The Takeaway

This paper is like teaching a musician to play a new genre of music. Instead of just giving them sheet music (which makes them forget how to play their instrument), you give them sheet music that has been scribbled on, torn, and taped back together. By forcing them to reconstruct the music, they learn the new genre without forgetting how to play their instrument.

It's a lightweight, smart way to upgrade AI systems using only the text data we already have, saving us the cost and trouble of recording new audio.