TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

Imagine you are trying to translate a conversation where someone is speaking Vietnamese but suddenly switches to English words in the middle of a sentence. This is called "code-switching."

For a standard computer listening program (like Siri or Google Assistant), this is a nightmare. Why? Because Vietnamese and English share many similar sounds, but they mean different things.

The Problem: The "Con Sót" Trap

The paper gives a funny but frustrating example. If a Vietnamese speaker says the English word "concert," a standard AI might hear it and write down "con sót" (which means "the remaining child" in Vietnamese).

Why? Because to the AI, the sounds sound almost identical. The AI gets confused because it's trying to map sound directly to text without understanding the "accent" or the "tone" of the language being spoken. It's like trying to read a book written in two different languages mixed together, but the font looks the same for both.

The Solution: TSPC (The Two-Stage Detective)

The authors propose a new system called TSPC (Two-Stage Phoneme-Centric). Instead of trying to guess the word immediately, they break the job down into two steps, like a detective solving a mystery in two phases.

Phase 1: The Sound-to-Sound Translator (S2P)

The Analogy: Imagine you are a translator who doesn't speak the final language yet. You just listen to the audio and write down a list of sounds (phonemes) and tones.

In this stage, the computer ignores the actual words. It just says: "Okay, I heard a 'k' sound, followed by an 'o' sound, with a rising tone."

The Magic Trick: The system treats English words as if they were Vietnamese sounds. It translates "concert" into a Vietnamese-style sound pattern (like "con sót") temporarily.
Why? Because Vietnamese has a very strict system of 6 different tones. By forcing the English words into this Vietnamese "tone framework," the computer can organize the messy sounds into a neat, structured list. It's like putting all the loose puzzle pieces into a specific tray before trying to build the picture.

Phase 2: The Sound-to-Word Translator (P2T)

The Analogy: Now, you have a list of sounds. This second stage is like a spell-checker or a translator who looks at that list of sounds and figures out what the actual words should be.

If the list says "con sót," the second stage looks at the context. Was the person talking about music? If yes, it corrects the output to "concert."
Was the person talking about a child? Then it keeps it as "con sót."

This stage uses a "masking" technique (hiding parts of the text) to teach the computer to be smart about context, so it doesn't make the same mistake twice.

Why is this better?

Think of standard AI as a fast runner who tries to sprint from the starting line (audio) to the finish line (text) in one go. When the path gets bumpy (code-switching), they trip and fall.

The TSPC system is like a careful hiker.

First, they stop and map the terrain (convert audio to sounds/tones).
Then, they use that map to figure out the destination (convert sounds to words).

The Results

The researchers tested this on a mix of Vietnamese and English.

Old AI: Made about 28% mistakes.
New TSPC AI: Made only about 19% mistakes.

Even better, they did this with less computing power than the giants (like Whisper or Qwen) use. It's like building a high-speed train that runs on a bicycle battery.

The Big Takeaway

The paper shows that when languages get mixed up, it's better to stop trying to guess the words immediately. Instead, translate the sounds first, organize them using the rules of the local language (Vietnamese tones), and then figure out the words. It's a smarter, more efficient way to handle the messy reality of how humans actually speak.

Here is a detailed technical summary of the paper "TSPC: A Two-Stage Phoneme-Centric Architecture for Code-Switching Vietnamese-English Speech Recognition."

1. Problem Statement

The paper addresses the significant challenge of Code-Switching (CS) in Automatic Speech Recognition (ASR), specifically for the Vietnamese-English language pair.

Core Issue: Existing End-to-End (E2E) ASR models often fail to distinguish between languages when speakers alternate within a sentence. This leads to phonetic confusion, where English words are incorrectly transcribed as phonetically similar Vietnamese words (e.g., "concert" $\rightarrow$ "con sót").
Specific Complexity:
- Phonological Overlap: Vietnamese and English share many consonants and vowels, creating acoustic ambiguity.
- Tonal Interference: Vietnamese is a tonal language (6 tones), while English is not. Vietnamese speakers often adapt English words into tonal syllables, creating unique inter-lingual homophones that standard, tone-insensitive models struggle to disambiguate.
- Data Scarcity: There is a lack of large-scale, naturalistic code-switched corpora for low-resource languages, making it difficult for models to learn distributional overlaps.

2. Methodology: The TSPC Architecture

The authors propose TSPC (Two-Stage Phoneme-Centric), a novel architecture that decomposes the ASR task into two specialized stages using a unified phoneme representation. Instead of a direct acoustic-to-text mapping, the model uses an intermediate phoneme layer.

A. Unified Vietnamese Phoneme Representation

The core innovation is mapping English lexical items into a unified Vietnamese phonemic space.

Concept: English words are decomposed and aligned with acoustically similar Vietnamese syllables based on phonetic similarity (e.g., English "play" $\rightarrow$ Vietnamese syllable "ây").
Implementation: A "Team Convention and Voting Phase" involving linguistic experts determines the best Vietnamese syllabic candidates for English words. These are then converted into tone-aware phoneme sequences using the VLSP Standard Phoneme Set.
Result: English words are represented as Vietnamese phoneme sequences (e.g., "assistant" $\rightarrow$ specific tone-marked phoneme strings), allowing the model to treat mixed-language input as a single linguistic stream.

B. Two-Stage Pipeline

Stage 1: Speech-to-Phone (S2P)
- Input: Raw audio.
- Function: Converts acoustic signals into tone-aware phoneme sequences.
- Architecture: Uses a frozen PhoWhisper-base encoder (pre-trained on large Vietnamese datasets) for feature extraction, followed by a Transformer decoder trained to predict phonemes.
- Goal: Explicitly model both tonal and non-tonal phonemic inventories.
Stage 2: Phone-to-Text (P2T)
- Input: The phoneme sequence generated by the S2P stage.
- Function: Converts the phoneme sequence back into orthographic text (handling the code-switching).
- Architecture: Framed as a Machine Translation (MT) task using the T5 model.
- Noise Mitigation: To handle errors propagating from the S2P stage, the authors employ a masking strategy inspired by Masked Language Modeling (MLM). The P2T encoder is pre-trained with masked phoneme inputs to learn robust contextual representations.

C. Training Strategy

Joint Fine-Tuning: The S2P and P2T models are integrated and fine-tuned together.
Freezing Strategies: The authors experimented with freezing the S2P parameters (to ensure consistent phonetic output) while updating the P2T decoder. They also tested "encoder-only" fine-tuning for the P2T stage to adapt to the predicted phonemes without losing pre-trained text generation capabilities.

3. Key Contributions

Phoneme-Centric Intermediate Representation: Introduced a method to unify English and Vietnamese into a single phonemic space, effectively resolving cross-lingual acoustic overlaps by leveraging Vietnamese tonal structures.
Two-Stage Architecture: Demonstrated that decoupling acoustic-to-phoneme and phoneme-to-text tasks improves robustness in code-switching scenarios compared to direct E2E models.
Low-Resource Efficiency: Achieved state-of-the-art results using significantly fewer computational resources and training data compared to massive multilingual models (e.g., Whisper, MMS).
Robustness via Masking: Developed a masking strategy for the P2T stage to mitigate error propagation from the S2P stage, a common issue in cascaded systems.

4. Experimental Results

The model was evaluated on Vietnamese-English code-switching (CS) and pure Vietnamese (Vi) test sets.

Code-Switching Performance:
- Baseline (PhoWhisper-base): 27.90% Word Error Rate (WER).
- TSPC (Baseline): 25.35% WER.
- TSPC (with SSL P2T Encoder + Joint Fine-Tuning): 19.06% WER.
- Comparison: TSPC significantly outperformed other baselines, including Qwen3-ASR-0.6B (38.93%), Wav2Vec2-vn-base (38.06%), and Whisper-Large-v3-turbo (31.60%).
Vietnamese-Only Performance:
- TSPC achieved 15.87% WER on the overall Vietnamese test set, which is competitive with PhoWhisper-base (14.05%) despite using fewer resources, and significantly better than Wav2VecVN (21.70%).
Ablation Studies:
- Masking: Using a masked encoder for P2T improved BLEU scores for code-switching cases by ~0.79%.
- Fine-tuning Strategy: The "encoder-only" fine-tuning strategy for the P2T stage yielded the best CS results (17.78% WER in specific ablation setups), proving that preserving pre-trained text generation knowledge while adapting the encoder is crucial.

5. Significance and Future Work

Significance: The paper proves that phonological grounding is critical for low-resource code-switching ASR. By explicitly modeling the tonal and phonetic overlap between Vietnamese and English, TSPC achieves superior performance without requiring massive datasets or compute power. It offers a viable path for ASR in other tonal/non-tonal language pairs.
Limitations:
- The S2P model is limited by the amount of training data (200 hours), which may not cover all phonetic variations.
- Synthetic data generation for training lacks diversity and audio quality.
- Current Transformer models do not fully capture the structural relationships between phonemes.
Future Directions: The authors suggest incorporating graph-based modeling (e.g., GraphRAG) to explicitly represent the structural relations and syntactic roles of phonemes, which could further enhance the model's ability to handle complex phoneme sequences.

In conclusion, TSPC represents a paradigm shift from direct acoustic-to-text mapping to a phoneme-centric, two-stage approach, offering a highly efficient and accurate solution for Vietnamese-English code-switching speech recognition.