Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Imagine you are trying to understand a friend who is speaking a language you don't know very well, or perhaps they have a very strong accent. If they just say, "I went to the [mumble] yesterday," you might guess "bank," "park," or "bank." It's a guess.

But, if you know two things:

What they were talking about five minutes ago (Context): "We were discussing our savings."
A list of words they really want to use (Biasing): "They mentioned 'mortgage' and 'interest rate' earlier."

Suddenly, the mumble becomes clear: "I went to the bank yesterday."

This paper is about teaching a computer to do exactly that: use "conversation history" and "keyword lists" to understand speech much better, especially when dealing with many different languages and accents.

Here is the breakdown of their invention, using some simple analogies.

1. The Problem: The "Amnesiac" Translator

Most current speech-to-text systems are like a student who has a great memory for grammar but no memory of the conversation.

They hear a sentence, translate it, and then immediately forget it.
If you say, "I'm going to the bank," they might write "bank" (the financial place) or "bank" (the side of a river) without knowing which one you meant.
They also struggle when you switch languages or speak with a heavy accent.

2. The Solution: The "Super-Helper" System

The researchers built a system that acts like a Super-Helper sitting next to the translator. This helper has two jobs:

The Memory Keeper: It remembers what was said in the last few turns of the conversation (Dialogue History).
The Cheat Sheet: It holds a list of specific words the speaker is likely to use (Biasing Words), like names of people, places, or technical terms.

3. How It Works: The "Translator" and the "Bridge"

The system is built using three main parts, which they kept very cleverly separate:

The Ears (Frozen Speech Encoder): This is a pre-trained AI that is already an expert at listening to sound. The researchers didn't change its brain; they just kept it as is because it's already very good.
The Brain (Frozen Language Model): This is a pre-trained AI that is an expert at writing and understanding text in many languages. Again, they didn't change its brain.
The Bridge (The New Part): This is the only thing they actually built and trained. It's a small, lightweight connector that takes the "sound" from the Ears and the "hints" from the Helper, and translates them into a language the Brain understands.

The Analogy: Imagine the Ears are a person who speaks only "Sound," and the Brain is a person who speaks only "Text." The Bridge is a translator who learns to say, "The sound I heard matches the text hint 'Bank' because we were talking about money."

4. The Secret Sauce: "Contrastive Learning" (The "Matchmaker")

This is the most innovative part of the paper.

Usually, when you give a computer a list of words (like "bank, river, money"), it just glues them together with the sound. It's like putting a puzzle piece next to a picture without checking if it fits.

The researchers added a Matchmaker (Contrastive Learning).

The Goal: The Matchmaker's job is to make sure the "Sound" and the "Context Hint" are hugging each other tightly in the computer's mind if they belong together.
The Training: If the sound is "I went to the bank" and the hint is "money," the Matchmaker pulls them closer. If the sound is "I went to the bank" but the hint is "river," the Matchmaker pushes them apart.
The Result: The system learns to feel the connection between the sound and the context, rather than just guessing. It becomes much more confident.

5. The Results: A Multilingual Superpower

They tested this on over 1,500 hours of real conversations in 11 different languages (like English, French, Japanese, Thai) and 5 different English accents (American, British, Indian, etc.).

Without Context: The system made mistakes about 21% of the time.
With Context: The mistakes dropped to about 16%.
With Context + The Matchmaker: The mistakes dropped even further, to about 15.5%.

Why this matters:

It works everywhere: It helped with difficult languages like Japanese and Thai, and tricky accents like Indian English.
It's efficient: Because they didn't have to retrain the massive "Ears" or "Brain," the system is fast and doesn't need a supercomputer to run.
It's smart: It realized that sometimes "History" (what was said before) is the most helpful hint, and sometimes a "Cheat Sheet" (specific words) is better. The Matchmaker helps decide which one to trust.

The Takeaway

This paper shows that to make speech recognition truly human-like, we need to stop treating every sentence as an isolated event. By giving the AI a memory of the conversation and a list of likely words, and then teaching it to deeply connect those clues to the sound, we get a much smarter, more accurate translator that works across the whole world.

It's like upgrading from a dictionary that only defines words, to a friend who knows your story, your vocabulary, and what you're likely to say next.

Here is a detailed technical summary of the paper "Speak in Context: Multilingual ASR with Speech–Context Alignment via Contrastive Learning."

1. Problem Statement

While Automatic Speech Recognition (ASR) has advanced significantly with pretrained models, current systems face two critical limitations in real-world applications:

Limited Multilingual Context-Awareness: Most context-aware ASR systems are restricted to monolingual settings or short, isolated utterances. They struggle to integrate diverse languages, accents, and dynamic conversational history simultaneously.
Lack of Principled Alignment: Existing methods often treat context (e.g., dialogue history, biasing words) as simple concatenated text inputs. There is a lack of explicit, trainable alignment mechanisms that map speech embeddings and contextual embeddings into a shared semantic space, leading to suboptimal interaction between acoustic signals and context.

2. Methodology

The authors propose a Context-Aware Multilingual ASR Framework that integrates a frozen speech encoder with a frozen decoder-only Large Language Model (LLM) via a lightweight projection module. The system is designed to be modular, preserving the pretraining of backbone models while enabling efficient adaptation.

Core Architecture

Frozen Components:
- Speech Encoder: Uses Whisper-large-v3 Turbo to extract acoustic features.
- LLM Decoder: Uses EuroLLM-1.7B-Instruct to generate text.
Lightweight Projection (Speech Connector): A trainable module that downsamples high-dimensional speech features and projects them into the LLM's embedding space using linear layers and GELU activation.
Context Injection: Contextual information is formatted into structured prompts and injected into the LLM input alongside the speech embeddings. Two types of context are utilized:
1. Dialogue History: Preceding turns in the conversation (converted to natural language prompts).
2. Biasing Words: A mix of "Hotwords" (extracted from transcriptions) and "Distractor Terms" (rare words from a lexicon) to guide recognition of domain-specific or rare entities.

Key Innovation: Contrastive Learning for Alignment

To address the misalignment between speech and context, the authors introduce a Contrastive Learning Objective:

Mechanism: The model computes embeddings for both the projected speech ( $\tilde{H}_{spe}$ ) and the context prompt ( $\tilde{H}_{ctx}$ ). Both are mean-pooled and L2-normalized.
Loss Function: An InfoNCE loss is applied. It pulls matched speech-context pairs closer in the embedding space while pushing mismatched pairs (speech from one utterance paired with context from another in the same batch) apart.
Training Objective: The total loss is a weighted combination of the standard Cross-Entropy (CE) loss for transcription and the Contrastive Learning (CL) loss:
$L = \beta \cdot L_{CE} + \alpha \cdot L_{CL}$
where $\alpha$ is dynamically adjusted to balance the two objectives.

3. Key Contributions

Multilingual SpeechLLM Framework: A novel architecture supporting 11 languages and 5 English dialects that effectively leverages dialogue history and biasing words without fine-tuning the massive backbone models.
Embedding-Level Alignment Strategy: The introduction of a contrastive learning objective that explicitly aligns speech and context representations, moving beyond heuristic concatenation to improve semantic grounding.
Comprehensive Empirical Validation: Extensive experiments on a large-scale, real-world conversational dataset (1,500+ hours) demonstrating that context-aware generation and cross-modal alignment significantly boost performance across diverse linguistic conditions.

4. Experimental Results

The model was evaluated on the MLC-SLM dataset (Interspeech 2025 challenge), covering 11 languages and over 1,500 hours of data.

Overall Performance:
- Incorporating context (History + Biasing Words) reduced the average error rate from 21.03% (No Context) to 16.08%.
- The addition of Contrastive Learning further improved performance, achieving a best average error rate of 15.42% (History + CL), representing an overall gain of >5% over the baseline.
Context Type Analysis:
- Dialogue History: Consistently provided the largest gains, particularly for languages like German (WER dropped from 31.49% to 19.89%) and British English. Contrastive learning was most effective when applied to dialogue history.
- Biasing Words: Effective for languages with complex scripts or rare terms (e.g., Korean, Japanese), but performance varied by language.
Interaction of Contexts:
- Combining both dialogue history and biasing words with contrastive learning yielded the second-best result (15.57%), slightly underperforming the "History + CL" setting. This suggests that merging heterogeneous context types under a single alignment objective may introduce competing signals.
Unseen Languages: The model demonstrated strong generalization to languages not present in the LLM's pretraining (e.g., Vietnamese), where context reduced WER by roughly 50%.

5. Significance and Conclusion

This paper highlights that principled alignment between speech and context is crucial for next-generation multilingual ASR.

Modularity: The approach proves that freezing large, expensive pretrained models and training only lightweight connectors is a viable and efficient strategy for complex, multilingual tasks.
Semantic Grounding: Contrastive learning successfully forces the model to understand the semantic relationship between what is spoken and the surrounding context, leading to more robust recognition of pronouns, rare words, and context-dependent expressions.
Future Directions: The authors note that while context is beneficial, the interaction between different context types requires careful design to avoid interference. Future work should explore disentangled optimization strategies and broader context signals (e.g., speaker identity, acoustic environment).

In summary, the proposed framework sets a new standard for context-aware multilingual ASR, demonstrating that explicit cross-modal alignment via contrastive learning can significantly enhance transcription quality across diverse languages and accents.