Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yor\`ub\'a

The Big Idea: Losing the "Music" When Turning Speech into Text

Imagine you have a beautiful, complex piece of music (like a song with a singer's voice). You want to turn this music into a simple list of musical notes (like a sheet of music) so a computer can read it and play it back later.

The researchers in this paper discovered a problem: When we turn human speech into a simple list of "digital notes" (called Discrete Speech Units), we lose the melody.

In languages like Mandarin and Yorùbá, the "melody" of a word is just as important as the words themselves. Changing the pitch (tone) changes the meaning entirely.

In Mandarin, saying "ma" with a high pitch means "mother," but with a dipping pitch, it means "horse."
In Yorùbá, the pitch tells you if you are talking about a "head" or a "fire."

The paper asks: Why do current computer methods keep the "words" (consonants and vowels) perfectly, but drop the "melody" (tones) when they try to digitize speech?

The Analogy: The "Heavy Backpack" vs. The "Light Feather"

To understand why this happens, imagine the computer's brain (the AI model) is holding a backpack full of information about a spoken sound.

The Backpack Contents: Inside are two types of items:
- Heavy Rocks: These represent the phonetics (the actual sounds like "b," "a," "t"). They are big, loud, and take up most of the space.
- Light Feathers: These represent the tones (the pitch changes). They are delicate, subtle, and very light.
The Problem (Quantization):
The computer needs to shrink this backpack to fit it into a small box (a digital codebook) to send it over the internet. It uses a method called K-means clustering.
- Think of this method as a magnet. The magnet is very strong and grabs the Heavy Rocks (the phonetics) first because they are big and obvious.
- Because the box is small, the magnet ignores the Light Feathers (the tones) to make room for the rocks.
- Result: The computer remembers what was said perfectly, but it forgets how it was said (the tone).

The Experiments: Trying Different Packing Methods

The researchers tried different ways to pack the backpack to see if they could save the feathers without losing the rocks.

1. The Standard Method (K-means)

The Attempt: Just throw everything in the box and let the magnet grab the biggest things.
The Result: The rocks are safe, but the feathers are crushed or lost. The computer gets the word right but the tone wrong.

2. The "Neural" Method (Neural Vector Quantization)

The Attempt: Use a smarter, trained robot to pack the box. The robot tries to reconstruct the whole sound perfectly.
The Result: It helps a little bit, but the robot still gets distracted by the heavy rocks. It's better than the magnet, but it still struggles to keep the delicate feathers safe.

3. The "Residual" Method (The Two-Step Solution)

This was the paper's big breakthrough. They realized they needed to separate the rocks from the feathers before packing.

Step 1 (The First Pass): Pack the Heavy Rocks (phonetics) into a box first. This creates a "phonetic code."
Step 2 (The Residual): Look at what is left over. Since the rocks are gone, what's left are mostly the Light Feathers (the tone).
Step 3 (The Second Pass): Pack these remaining feathers into a second box.
The Result: Because the feathers aren't competing with the heavy rocks anymore, they get packed much more carefully. The computer now remembers both the word and the tone.

Why Mandarin and Yorùbá Were Different

The researchers tested this on two very different languages:

Mandarin: The tones are like slides (they go up, down, and curve). They are complex and change quickly.
- Best Solution: A "multi-layered" packing system (like stacking boxes inside boxes) worked best here.
Yorùbá: The tones are like flat steps (high, mid, low). They are stable and stay on the vowel.
- Best Solution: The "Two-Step" residual method worked best here because the tones are stable and easy to separate from the vowel sounds.

The Takeaway for the Future

The Problem: Current AI tools are great at turning speech into text, but they are terrible at keeping the "music" (prosody and tone) alive. This is a huge problem for:

Text-to-Speech: Robots sounding robotic or saying the wrong word because they got the tone wrong.
Translation: Translating a sentence and accidentally changing the meaning because the tone was lost.

The Solution: We need to build new "packing methods" that are tone-aware. Instead of just looking at the big rocks (sounds), the AI needs to be taught to gently handle the light feathers (tones) separately.

In short: If we want computers to speak human languages naturally, especially in tonal languages, we have to stop treating speech like a simple list of letters and start treating it like a song where the melody matters just as much as the lyrics.

1. Problem Statement

Discrete Speech Units (DSUs) are widely used in speech processing (e.g., Text-to-Speech, Speech-to-Speech translation) to convert continuous Self-Supervised Learning (SSL) representations into symbolic tokens compatible with language models. While SSL models (like HuBERT) encode rich phonetic and prosodic information in their continuous latent spaces, the process of quantization (discretization) to create DSUs appears to degrade suprasegmental information, specifically lexical tone, more severely than segmental (phonetic) information.

The core problem is that standard quantization methods (like K-means) prioritize high-variance phonetic features over lower-variance tonal features. This limitation hinders the performance of downstream tasks in tone languages (e.g., Mandarin, Yor`ub´a), where tone is lexically contrastive. The authors investigate why this degradation occurs and whether alternative quantization strategies can mitigate it.

2. Methodology

Data and Models

Languages: Two typologically distinct tone languages were used:
- Mandarin: Contour tones, where tone is associated with the syllable but realized on the vowel nucleus.
- Yor`ub´a: Level tones, functioning as a canonical vowel-tone language.
Datasets: AISHELL-1 (Mandarin, 170h) and BibleTTS (Yor`ub´a, 93h).
SSL Models:
- Mandarin: MandarinHuBERT.
- Yor`ub´a: AfriHuBERT (trained on 1,226 African languages).
Preprocessing: Vowel-aligned segments were extracted using forced alignment (Montreal Forced Aligner).

Experimental Setup

The authors employed probing classifiers (lightweight models trained to predict phone identity and lexical tone) to evaluate how well different quantization methods preserve information.

Baseline: Continuous SSL latents (unquantized).
Quantization Strategies Tested:
1. Classic K-means: Frame-level clustering (standard approach).
2. Neural Vector Quantization (VQ): Using RepCodec architecture with reconstruction objectives.
3. Residual VQ (RVQ): Hierarchical quantization (e.g., 250×2 or 125×4 levels).
4. Segmentation-Variant Codebooks (SVC): Averaging frame-level and phone-level centroids.
5. Residual K-means (Proposed): A two-stage non-neural approach:
  - Stage 1: K-means on mean-pooled phone segments ( $K=50$ ) to capture phonetic identity.
  - Stage 2: K-means on the residuals (original latent minus the phonetic centroid) with $K=450$ to capture tonal variation.

Evaluation Metrics

Weighted F1 Score: Used to evaluate the performance of probing classifiers on phone and tone classification tasks.
Comparison: Performance of quantized DSUs was compared against the continuous latent baseline.

3. Key Contributions

Empirical Confirmation of Tone Degradation: The study provides robust evidence that while SSL latents encode tone well, standard quantization (K-means) significantly degrades tonal information compared to phonetic information. Increasing the codebook size ( $K$ ) yields diminishing returns and fails to recover the lost tonal fidelity.
Analysis of Quantization Bias: The authors demonstrate that quantization algorithms inherently prioritize high-variance features (phonetics) over low-variance features (tone/prosody) because the distance metrics used (Euclidean) are dominated by the larger phonetic variations.
Novel Quantization Strategies:
- Residual K-means: A simple, non-neural method that explicitly separates phonetic and tonal information by clustering residuals.
- Hierarchical/Residual Modeling: Demonstrating that multi-level approaches (RVQ and Residual K-means) outperform flat, single-codebook methods.
Cross-Linguistic Insights: The paper highlights that the optimal quantization strategy depends on the linguistic nature of the tone:
- Mandarin (Contour tones): Benefits most from deep, hierarchical modeling (Neural RVQ).
- Yor`ub´a (Level tones): Benefits most from segment-aware residual modeling (Residual K-means).

4. Key Results

Baseline vs. Quantized: Continuous latents achieved near-ceiling F1 scores (e.g., Mandarin Tone: 0.94). Classic K-means ( $K=500$ ) dropped Mandarin tone F1 to 0.70 (a 24% drop), while phone recognition remained high (0.99).
Codebook Size: Increasing $K$ from 1,000 to 10,000 only improved Mandarin tone F1 from 0.80 to 0.87, indicating that simply expanding vocabulary size is not a viable solution.
Neural VQ vs. K-means: Neural VQ showed mixed results, improving Mandarin slightly (0.78 vs 0.70) but performing worse than K-means for Yor`ub´a (0.66 vs 0.77).
Residual & Hierarchical Approaches:
- RVQ (125×4): Achieved the best results for Mandarin (Tone F1: 0.82).
- Residual K-means (Segmental): Achieved the best results for Yor`ub´a (Tone F1: 0.83), surpassing all RVQ variants.
- Mechanism: The "Residual K-means" approach successfully isolates phonetic information in the first pass, allowing the second pass to focus on the smaller-magnitude tonal variations without being overwhelmed by phonetic noise.

5. Significance and Implications

Limitations of Current DSUs: The findings suggest that current Discrete Speech Units are biased toward segmental structure, making them suboptimal for tasks requiring precise prosodic or tonal control.
Future Directions: The paper argues for the development of "tone-aware" or "prosody-aware" quantization techniques. Standard quantizers treat all dimensions equally, but suprasegmental features require specific handling (e.g., residual modeling or hierarchical clustering).
Impact on Applications:
- TTS & Speech-to-Speech Translation: Improved tone retention is critical for reducing homophone confusion and improving naturalness in tone languages.
- Low-Resource Languages: The proposed residual methods offer a path to better representation learning for under-resourced tone languages where accurate prosodic modeling is essential.
Broader Prosody: The authors posit that this issue extends beyond lexical tone to other suprasegmental features like rhythm, prominence, and phrasing, suggesting a fundamental gap in current speech representation learning paradigms.

In conclusion, the paper demonstrates that while SSL models capture tone well, the standard discretization process destroys this information. By decoupling phonetic and tonal representations through residual quantization, it is possible to recover a significant portion of the lost tonal fidelity, offering a promising direction for future speech representation learning.

Lexical Tone is Hard to Quantize: Probing Discrete Speech Units in Mandarin and Yorùbá