Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

Imagine you have a brilliant, multilingual professor (the Speech LLM) who can understand and answer questions in many languages. However, this professor is "frozen" in time; they can't learn new things on their own without a massive amount of expensive, hand-written textbooks for every single language they need to know.

The problem? We don't have enough textbooks for languages like Vietnamese, Indonesian, or German. We only have plenty of "transcripts" (text versions of speech) for English.

The Old Way: The "One-Size-Fits-All" Translator

Previously, researchers tried to teach this professor by showing them audio and its text transcript. They used a small, simple adapter (a projector) to translate the sound waves into words the professor could understand.

Think of this adapter as a universal translator that everyone shares.

The Flaw: When you try to use this one translator for English, Chinese, and German all at once, it gets confused. It's like trying to use a single dictionary to translate poetry, legal contracts, and slang all at the same time. The "loud" languages (like English) drown out the "quiet" ones (like Indonesian), causing the professor to mix up words or give wrong answers. This is called language interference.

The New Solution: The "Smart Switchboard"

This paper introduces a clever new system called Language-Aware Distillation. Instead of one confused translator, they built a Smart Switchboard with a special Query Bank.

Here is how it works, using a simple analogy:

1. The Query Bank (The Library of Specialized Keys)

Imagine the old system had one master key that tried to open every door. It worked okay for similar doors, but failed on unique ones.
The new system has a library of specialized keys (Query Tokens). There is a specific key for English, a specific key for Chinese, a specific key for Spanish, and so on.

2. The Gating Network (The Bouncer)

Before the audio reaches the professor, it hits a Bouncer (the Gating Network).

When you speak in Spanish, the Bouncer instantly recognizes the accent and picks up the Spanish Key.
When you speak in German, it swaps it for the German Key.
It can even mix keys if the language is a blend, but usually, it picks the perfect one.

This ensures that the "English Key" never gets in the way of the "Indonesian Key." They stay in their own lanes, preventing the confusion that happened before.

3. Learning by Listening (Distillation)

The system doesn't need thousands of hours of human-labeled data for every language. It uses a trick called Distillation:

It takes a recording of someone speaking.
It compares the sound to the text transcript.
It teaches the "Smart Switchboard" to make the sound look exactly like the text to the frozen professor.
The Magic: It does this using only 5,800 hours of data total to support 6 different languages. That's incredibly efficient compared to other methods that need millions of hours.

The Results: A Multilingual Super-Student

The researchers tested this new system on two types of tasks:

Open-Ended Chat: "Tell me a story about a cat in Indonesian."
Closed-Ended Questions: "Based on this audio, what is the capital of Vietnam?"

The Outcome:

The new system beat the previous best models by 14% in general conversation.
For specific questions, it improved performance by a massive 32%.
Most importantly, it saved the "low-resource" languages (like Indonesian) from being ignored, allowing them to perform just as well as the dominant languages.

Why This Matters

Think of this as upgrading a global call center.

Before: You had one agent who spoke English perfectly but struggled with other languages because they were trying to use the same mental "dictionary" for everything. Customers in smaller languages got frustrated.
Now: You have a smart system that instantly routes the call to the agent with the perfect specialized dictionary for that specific language. The customers are happier, the system is cheaper to run, and no language is left behind.

In short, this paper teaches AI how to speak many languages clearly without getting confused, using a tiny fraction of the data usually required.

Here is a detailed technical summary of the paper "Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision."

1. Problem Statement

The paper addresses the challenge of training Multilingual Speech Large Language Models (Speech LLMs) that can follow instructions across diverse languages without requiring massive, task-specific speech corpora.

Current Limitations: Existing approaches often rely on Supervised Fine-Tuning (SFT) with large, task-specific datasets, which are scarce for low-resource languages. Alternatively, cascaded pipelines (ASR $\to$ LLM) discard paralinguistic information.
The Distillation Bottleneck: Recent distillation-based methods (e.g., DiVA) align speech and text using a single, static Q-Former projector and a shared sequence of query tokens. While effective for English, these models suffer from language interference when scaled to multilingual settings. A single static query sequence cannot capture distinct phonetic and semantic nuances across diverse languages (e.g., English vs. Chinese), causing dominant languages to overshadow lower-represented ones in the shared representation space.
Data Scarcity: Annotated multilingual speech data is inherently low-resource, and synthetic data generation via Text-to-Speech (TTS) often requires its own heavy training data.

2. Methodology

The authors propose a Language-Aware Distillation framework that trains a multilingual Speech LLM using only ASR data (speech-transcript pairs) while keeping the speech encoder and the LLM frozen.

Architecture Components

Frozen Speech Encoder: Uses Whisper-large-v3 to extract speech embeddings ( $H$ ).
Frozen LLM: Uses Llama-SEA-LION-v3-8B-IT as the text backbone, chosen for its strong coverage of Southeast Asian and low-resource languages.
Language-Aware Query Selection Module: This is the core innovation replacing the static projector.
- Query Bank: Instead of one static sequence, the model maintains a bank of $K$ learnable query sequences ( $B = \{Q^{(k)}\}$ ), one for each target language.
- Gating Network: A lightweight network ( $G_\phi$ ) analyzes the speech embeddings ( $H$ ) to predict language logits ( $g$ ).
- Query Mixing/Selection: Based on the logits, the model either:
  - Soft Mixing: Computes a weighted average of query tokens across languages ( $\tilde{Q}_{soft}$ ).
  - Hard Selection: Selects a single language-specific query token sequence ( $\tilde{Q}_{hard}$ ) using a straight-through estimator for backpropagation.
Q-Former Projector: Maps the selected/mixed queries and speech embeddings into text-like representations ( $Z$ ) to serve as a prefix for the frozen LLM.

Training Objective

The model is trained using three loss components:

Language Identification Loss ( $L_{LID}$ ): Cross-entropy loss to supervise the gating network to correctly identify the input language.
Input Distillation Loss ( $L_{IN}$ ): Aligns the projected speech embeddings ( $Z$ ) with the text embeddings ( $Y$ ) derived from the transcript. It specifically aligns the "audio tail" of the speech sequence with the "text head" of the transcript.
Output Distillation Loss ( $L_{OUT}$ ): Aligns the final hidden states of the frozen LLM when conditioned on speech ( $h^{sp}$ ) versus when conditioned on text transcripts ( $h^{tx}$ ). This ensures the LLM generates the same behavior regardless of the modality.

Scheduled Teacher Forcing: To stabilize early training, the model is forced to use the ground-truth language label for query selection with a probability that anneals from 1.0 to 0.0 over 50% of training steps.

3. Key Contributions

Novel Architecture: Introduction of a language-aware distillation method utilizing a query bank and gating network to dynamically select or mix query tokens, effectively disentangling language-specific information.
Efficiency: Demonstrates that high-performance multilingual Speech LLMs can be trained with minimal trainable parameters (only the adapter and gating network) and limited data (5,800 hours of ASR data for 6 languages) without catastrophic forgetting.
New Benchmark: Creation of Audio-MLQA, a high-quality multilingual spoken Question Answering benchmark derived from MLQA using state-of-the-art TTS, covering 5 languages (EN, VI, ES, DE, ZH).
Performance Gains: Achieves significant improvements over existing baselines in both open-ended instruction following and close-ended QA tasks.

4. Experimental Results

The model was evaluated on open-ended instruction following (AlpacaEval/OpenHermes) and close-ended QA (Audio-MLQA) across 6 languages (English, Vietnamese, Indonesian, Spanish, German, Chinese).

Open-Ended Instruction Following:
- The proposed Hard-Gating model achieved a 14% average improvement over the multilingual baseline (ML-DiVA).
- Notable gains were observed in low-resource languages like Indonesian (ID), where the score improved from 3.04 (ML-DiVA) to 3.71, proving the method's ability to protect lower-represented languages from interference.
Close-Ended QA (Audio-MLQA):
- The model outperformed existing Speech LLM baselines (e.g., SeaLLMs-Audio, MERaLiON) by 32%.
- The Hard-Gating variant achieved a score of 3.96, which is very close to the text-only reference model (4.14), demonstrating superior speech-text alignment.
Ablation Studies:
- Query Length: Increasing query length from 64 to 256 reduced input distillation loss by 89%, highlighting the need for higher capacity to capture complex phonetic-semantic mappings.
- Gating Strategy: Hard query selection consistently outperformed soft mixing. This suggests that explicitly decoupling languages prevents the "averaging effect" where dominant languages interfere with minority languages.
- Gating Architecture: Both Convolutional and Attention-based gating networks achieved >94% language identification accuracy, confirming robust language detection.

5. Significance

This work provides a scalable and resource-efficient paradigm for globalizing Speech LLMs. By moving away from static, shared projectors to dynamic, language-aware routing, the authors solve the critical bottleneck of language interference in multilingual distillation.

Practical Impact: The approach allows for the deployment of capable multilingual speech assistants using only ~5.8K hours of ASR data, making it feasible to support languages that lack massive task-specific datasets.
Future Research: The release of the Audio-MLQA benchmark and the open-source methodology offers a foundation for future research in low-resource multilingual speech understanding, moving beyond the reliance on expensive SFT or cascaded ASR pipelines.