TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition

This paper proposes TSPC, a novel two-stage phoneme-centric architecture that leverages an extended Vietnamese phoneme set as an intermediate representation to significantly improve Vietnamese-English code-switching speech recognition accuracy while maintaining computational efficiency.

Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen

Published 2026-03-06
📖 4 min read☕ Coffee break read

Imagine you are trying to translate a conversation where someone is speaking Vietnamese but suddenly switches to English words in the middle of a sentence. This is called "code-switching."

For a standard computer listening program (like Siri or Google Assistant), this is a nightmare. Why? Because Vietnamese and English share many similar sounds, but they mean different things.

The Problem: The "Con Sót" Trap

The paper gives a funny but frustrating example. If a Vietnamese speaker says the English word "concert," a standard AI might hear it and write down "con sót" (which means "the remaining child" in Vietnamese).

Why? Because to the AI, the sounds sound almost identical. The AI gets confused because it's trying to map sound directly to text without understanding the "accent" or the "tone" of the language being spoken. It's like trying to read a book written in two different languages mixed together, but the font looks the same for both.

The Solution: TSPC (The Two-Stage Detective)

The authors propose a new system called TSPC (Two-Stage Phoneme-Centric). Instead of trying to guess the word immediately, they break the job down into two steps, like a detective solving a mystery in two phases.

Phase 1: The Sound-to-Sound Translator (S2P)

The Analogy: Imagine you are a translator who doesn't speak the final language yet. You just listen to the audio and write down a list of sounds (phonemes) and tones.

In this stage, the computer ignores the actual words. It just says: "Okay, I heard a 'k' sound, followed by an 'o' sound, with a rising tone."

  • The Magic Trick: The system treats English words as if they were Vietnamese sounds. It translates "concert" into a Vietnamese-style sound pattern (like "con sót") temporarily.
  • Why? Because Vietnamese has a very strict system of 6 different tones. By forcing the English words into this Vietnamese "tone framework," the computer can organize the messy sounds into a neat, structured list. It's like putting all the loose puzzle pieces into a specific tray before trying to build the picture.

Phase 2: The Sound-to-Word Translator (P2T)

The Analogy: Now, you have a list of sounds. This second stage is like a spell-checker or a translator who looks at that list of sounds and figures out what the actual words should be.

  • If the list says "con sót," the second stage looks at the context. Was the person talking about music? If yes, it corrects the output to "concert."
  • Was the person talking about a child? Then it keeps it as "con sót."

This stage uses a "masking" technique (hiding parts of the text) to teach the computer to be smart about context, so it doesn't make the same mistake twice.

Why is this better?

Think of standard AI as a fast runner who tries to sprint from the starting line (audio) to the finish line (text) in one go. When the path gets bumpy (code-switching), they trip and fall.

The TSPC system is like a careful hiker.

  1. First, they stop and map the terrain (convert audio to sounds/tones).
  2. Then, they use that map to figure out the destination (convert sounds to words).

The Results

The researchers tested this on a mix of Vietnamese and English.

  • Old AI: Made about 28% mistakes.
  • New TSPC AI: Made only about 19% mistakes.

Even better, they did this with less computing power than the giants (like Whisper or Qwen) use. It's like building a high-speed train that runs on a bicycle battery.

The Big Takeaway

The paper shows that when languages get mixed up, it's better to stop trying to guess the words immediately. Instead, translate the sounds first, organize them using the rules of the local language (Vietnamese tones), and then figure out the words. It's a smarter, more efficient way to handle the messy reality of how humans actually speak.