Text-only adaptation in LLM-based ASR through text denoising

Imagine you have a brilliant, world-class translator named LLM. This translator is amazing at reading books and writing essays. You also have a specialized Microphone (the speech encoder) that turns spoken words into a strange, garbled code. Finally, you have a Bridge (the projector) that connects the Microphone to the Translator.

When you train this system, the Bridge learns to translate the Microphone's garbled code into a format the Translator can understand. The Translator then acts like a detective, cleaning up the garbled code to write a perfect transcript.

The Problem: The "New Job" Dilemma

Now, imagine you want this team to work in a new field, like "Medical Insurance" or "Farm Machinery."

The Ideal Scenario: You have thousands of hours of recordings of people talking about these topics, along with perfect transcripts. You can retrain the whole team, and they become experts.
The Real-World Problem: You don't have those recordings. They are too expensive or hard to get. But, you do have millions of text documents (articles, manuals, chat logs) about these topics.

If you try to teach the Translator (LLM) just by reading these new text documents, something goes wrong. The Translator gets so focused on the new text that it forgets how to listen to the Microphone. The Bridge breaks. The team can write great essays about farming, but they can no longer transcribe a farmer's voice. This is called "Catastrophic Forgetting."

The Solution: The "Noise" Game

The authors of this paper came up with a clever trick. Instead of just feeding the Translator clean text, they decided to play a game of "Text Denoising."

Here is the analogy:

Imagine the Translator is a master editor who is used to fixing typos in rough drafts.

The Old Way: You give the editor a clean article and say, "Learn this topic." The editor learns the topic but forgets how to fix typos.
The New Way (This Paper): You take the clean article, intentionally ruin it, and give it to the editor. You scramble the letters, repeat words, and add typos. You say, "Here is a messy draft of a farming article. Please fix it and make it perfect."

By forcing the editor to fix the mess, they stay sharp at their job (cleaning up the Microphone's garbled code) while simultaneously learning the new vocabulary of the farming world.

How They Did It (The Recipe)

To make this work without breaking the team, they mixed their training "batches" (groups of practice examples) like a smoothie with four specific ingredients:

The Original Audio (The Anchor): Real recordings of people speaking. This keeps the connection between the Microphone and the Translator strong.
The "Simulated Noise" (The Bridge's Voice): They took real audio, ran it through their system to see what "garbled code" it produced, and then turned that code back into text. This teaches the Translator what the Bridge actually sounds like.
The "Fake Noise" (The Practice): They took clean text from the new domain and randomly scrambled it (like a child typing on a keyboard). This teaches the Translator to fix typos in the new language.
The Target Text (The Goal): Clean text from the new domain, but presented as a "messy" input that needs fixing.

By mixing all these together, the system learns two things at once:

"I still know how to listen to the microphone."
"I also know how to fix messy text about farming/insurance."

The Results

They tested this on two different worlds:

Banking and Insurance: Where the new topics were similar to what they already knew.
Agriculture and Animation: Where the topics were totally different.

The Outcome:

Their new method improved accuracy by up to 22%.
It beat other methods that tried to do the same thing.
In the best cases, the "Text-Only" team performed almost as well as a team that had access to thousands of hours of actual audio recordings.

The Takeaway

This paper is like teaching a musician to play a new genre of music. Instead of just giving them sheet music (which makes them forget how to play their instrument), you give them sheet music that has been scribbled on, torn, and taped back together. By forcing them to reconstruct the music, they learn the new genre without forgetting how to play their instrument.

It's a lightweight, smart way to upgrade AI systems using only the text data we already have, saving us the cost and trouble of recording new audio.

Here is a detailed technical summary of the paper "Text-Only Adaptation in LLM-Based ASR Through Text Denoising."

1. Problem Statement

Context: Large Language Model (LLM)-based Automatic Speech Recognition (ASR) systems typically consist of a speech encoder, a learnable projector (mapping speech to text embeddings), and a pre-trained LLM decoder. These systems rely on paired audio-text data for training.
The Challenge: Adapting these systems to new domains using text-only data is a significant but underexplored challenge.

Catastrophic Forgetting: Standard fine-tuning of the LLM on target-domain text disrupts the critical alignment learned by the projector between the speech modality and the text modality. This degrades the system's ability to process actual audio.
Limitations of Existing Methods: Previous attempts to use text-only data (e.g., using monitoring metrics or trainable soft prompts) either only partially mitigate performance degradation or require tuning complex hyperparameters (e.g., number and placement of soft tokens).
Data Scarcity: Collecting new paired audio-text data for every new domain is expensive and often impractical, whereas text data is abundant.

2. Methodology

The authors propose a novel framework that reframes text-only adaptation as a text denoising task.

Core Insight

In LLM-based ASR, the projector converts audio into a sequence of "soft tokens" that resemble a noisy or corrupted transcript rather than raw features. The LLM's role is to reconstruct the clean transcript from this noisy input. The authors hypothesize that if the LLM is trained to denoise synthetic text that mimics projector outputs, it can adapt to a new domain's vocabulary and syntax without needing audio, while preserving the original speech-text alignment.

Proposed Approach: Multi-View Noise-Driven Batching

To prevent catastrophic forgetting, the model is fine-tuned using a specific batch composition strategy that mixes four types of data pairs $(Input, Target)$ :

$\sigma_a$ (Source Audio-Text): Standard pairs $(a, t)$ from the source domain. This preserves the original speech-text alignment.
$\sigma_{ta}$ (Projector-Induced Noise): Pairs $(noise_a(t), t)$ where the input is generated by passing source audio through the projector and mapping the resulting embeddings to the nearest LLM tokens. This simulates the "true" noise the LLM sees during inference.
$\sigma_t$ (Synthetic Source Noise): Pairs $(noise(t), t)$ where the input is the source transcript $t$ perturbed by random character substitutions and duplications. This acts as a naive approximation of projector noise, allowing the model to learn denoising without audio.
$\tau_t$ (Target Domain Noise): Pairs $(noise(t), t)$ where $t$ comes from the target domain (text-only). The input is perturbed similarly to $\sigma_t$ . This drives the adaptation to the new domain.

Training Dynamics:

The batch is a mixture of these components where $\sigma_a + \sigma_{ta} + \sigma_t + \tau = 1$ .
The ratio $\tau$ is set proportional to the relative size of the target domain data to balance source retention and target specialization.
The LLM is trained to reconstruct the ground truth transcript $t$ from the noisy inputs, effectively learning to "denoise" the target domain's text patterns.

3. Key Contributions

Task Reformulation: The first work to explicitly frame text-only adaptation for LLM-based ASR as a denoising task, leveraging the LLM's inherent ability to reconstruct clean text from corrupted inputs.
Lightweight Architecture: The method requires no architectural changes and no additional learnable parameters (unlike soft-prompt methods). It relies solely on a multi-view noise-driven batching strategy.
Robustness: The approach successfully adapts the model to new domains while preserving the cross-modal alignment between the speech encoder and the LLM, avoiding the catastrophic forgetting seen in standard fine-tuning.

4. Experimental Results

The method was evaluated on two datasets: DefinedAI (customer-agent calls) and SlideSpeech (conference videos), across three scenarios:

In-Domain Adaptation (DefinedAI):
- Target domains: Banking and Insurance (same as source).
- Result: The text-only adapted model achieved a Word Error Rate (WER) of 10.11% (Banking) and 8.71% (Insurance), nearly matching the performance of the ideal "audio-adapted" model (9.92% and 7.92%).
- Improvement: Up to 22.1% relative improvement over the base model, outperforming state-of-the-art baselines (Fang et al. and Ma et al.).
Out-of-Domain Adaptation (SlideSpeech):
- Target domains: Agriculture, Animation, Musical Instruments (different from source: Life, Talent, English).
- Result: Consistent WER improvements were observed (e.g., 6.3% relative improvement in Animation), demonstrating the model's ability to learn domain-specific lexicons from text alone.
Cross-Domain Adaptation (DefinedAI $\to$ SlideSpeech):
- Bridging different domains with different acoustic characteristics.
- Result: The method significantly reduced the linguistic gap, outperforming other text-only baselines and achieving performance comparable to Ma et al. [18], though naturally lower than models with access to target audio.

Ablation Studies:

Removing the audio component ( $\sigma_a$ ) caused catastrophic forgetting (WER spike).
Using unperturbed text (Echo strategy) was less effective than using noisy text, confirming that the denoising formulation is critical for learning domain patterns.

5. Significance

Scalability: This method enables the deployment of high-performance ASR systems in new domains where only text data is available, removing the bottleneck of expensive audio transcription.
Efficiency: It offers a parameter-efficient solution that does not require retraining the projector or adding new parameters, making it easy to integrate into existing LLM-based ASR pipelines.
Performance: It sets a new state-of-the-art for text-only adaptation, proving that LLMs can effectively adapt to new linguistic domains by treating the adaptation process as a reconstruction/denoising problem rather than simple next-token prediction.

In summary, the paper demonstrates that by framing adaptation as a denoising task and carefully mixing source audio, projector-simulated noise, and synthetic target noise, LLM-based ASR systems can effectively learn new domains using text-only data without sacrificing their core speech recognition capabilities.

Text-only adaptation in LLM-based ASR through text denoising

The Problem: The "New Job" Dilemma

The Solution: The "Noise" Game

How They Did It (The Recipe)

The Results

The Takeaway

1. Problem Statement

2. Methodology

Core Insight

Proposed Approach: Multi-View Noise-Driven Batching

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Neural Network Tuning of FSMPC for Drives

Universal Speech Content Factorization

A Policy-Aware Cross-Layer Auditing Service for Tiering and Throttling in Starlink

Trade-offs Between Capacity and Robustness in Neural Audio Codecs for Adversarially Robust Speech Recognition

Robust Wildfire Forecasting under Partial Observability: From Reconstruction to Prediction