Rethinking Discrete Speech Representation Tokens for Accent Generation

Imagine you are trying to send a voice message to a friend, but you want the message to sound exactly like you (your voice) but with a specific accent, like a Scottish brogue or a Southern US twang.

In the world of AI speech, there's a popular tool called Discrete Speech Tokens (DSRTs). Think of these tokens as a "digital shorthand" or a "compressed zip file" of a voice. Instead of sending the whole raw audio wave (which is huge), the AI breaks the voice down into a list of numbers (tokens) that represent the sounds.

For a long time, researchers thought these tokens were great at capturing what was said (the words) and who said it (the voice). But they largely ignored how it was said in terms of accent. This paper asks: If we want to generate or change accents, do these digital tokens actually hold onto that accent information, or does it get lost in the compression?

Here is the breakdown of their findings using some simple analogies:

1. The "Layer Cake" Problem

Imagine the AI model that creates these tokens is a giant, multi-layered cake.

The Bottom Layers: These are like the raw ingredients. They hold the basic, crunchy details of the sound (like the exact pitch or the "crunch" of a consonant).
The Middle Layers: This is where the "flavor" of the accent lives. The researchers found that if you want to keep an accent, you need to look at the middle layers of the cake.
The Top Layers: These are like the frosting. They are very good at understanding the meaning of the words (phonetics) and the identity of the speaker, but they have mostly stripped away the specific "flavor" of the accent.

The Discovery: Most current AI systems use the "top layers" (the frosting) to create speech tokens. The authors found that by the time the AI gets to the top, the accent information has mostly evaporated. It's like trying to bake a strawberry cake using only the vanilla frosting; the strawberry flavor is gone.

2. The "ASR Supervision" Trap

Some researchers tried to make these tokens better by training them to be really good at transcribing speech (turning voice into text). This is called "ASR supervision."

Think of this like hiring a strict editor who only cares about spelling and grammar. If you ask this editor to summarize a story, they will give you the perfect plot (the words) and the perfect character names (the speaker), but they will ruthlessly delete all the regional slang and dialect because it's "not standard."

The Result: The paper found that when you train tokens to be perfect at reading text, they accidentally delete the accent information. The accent gets "edited out" because it's not needed for reading.

3. The "Magic Squeeze" Myth

Some previous studies claimed that if you just make the "zip file" smaller (reducing the codebook size), the AI would magically separate the "words" from the "accent" and "voice." They thought, "If we squeeze the data hard enough, the accent will pop out on its own."

The Reality Check: The authors tested this and found it doesn't work.

The Analogy: Imagine you have a smoothie with strawberries (accent), bananas (words), and yogurt (voice). If you try to squeeze the smoothie through a tiny hole to separate the ingredients, you don't get pure strawberries and pure bananas. You just get a smaller, messier smoothie where everything is less distinct.
The Finding: Simply making the token list smaller doesn't cleanly separate the accent from the words. It just makes the whole thing worse.

4. The Solution: A New Recipe

Since the old recipes (using top layers or small zip files) were losing the accent, the authors proposed a new way to cook:

For keeping the accent (Accent-Preserving VC): Don't use the top layers. Use the middle layers of the AI cake where the accent flavor is still strong.
For changing the accent (Accent-Adaptive VC): Use a mix of tokens that keeps the words clear but allows the AI to swap the "accent layer" with a new one.

They tested this new recipe by asking people to listen to the results. The new method sounded much more like the intended accent and much less like the AI was just "guessing" the accent (which is called "hallucinating" an accent).

The Big Takeaway

If you want an AI to speak with a specific accent, you can't just use the standard "smart" layers of the AI that are good at reading text. You have to dig deeper into the middle layers where the "regional flavor" is still stored, and you can't rely on shrinking the data size to magically separate the accent from the words.

In short: To get a good accent, you need to stop looking at the "frosting" (the high-level meaning) and start tasting the "middle layers" (the specific sound patterns) of the AI cake.

Here is a detailed technical summary of the paper "Rethinking Discrete Speech Representation Tokens for Accent Generation."

1. Problem Statement

Discrete Speech Representation Tokens (DSRTs) have become a cornerstone for bridging speech signals with Large Language Models (LLMs), enabling tasks like Zero-Shot Text-to-Speech (ZS-TTS) and Speech-to-Speech Translation. However, while prior research has extensively analyzed how DSRTs encode phonetic and speaker information, the encoding of accent information remains largely unexplored.

Current ZS-TTS systems often suffer from "accent hallucination," where the generated speech adopts an accent different from the reference speaker. Furthermore, existing claims suggest that accent control can be achieved through naive methods like adjusting codebook sizes or using ASR-supervised models. The authors argue these claims lack systematic investigation, leaving it unclear whether accent generation capabilities stem from the token representations themselves or are merely byproducts of large-scale pretraining.

Core Research Questions:

How do different design choices in DSRTs (layer selection, ASR supervision, codebook size) influence the amount of accent information encoded?
How can these insights be leveraged to enable more controllable accent generation (e.g., in Voice Conversion)?

2. Methodology

The authors propose a unified evaluation framework that assesses DSRTs from two distinct perspectives: Accessibility (how easily the information can be extracted) and Recoverability (how well the information can be reconstructed in synthesized speech).

A. Evaluation Framework

Accessibility (Accent ABX Task):
- The authors introduce a novel Accent ABX task. Unlike traditional Phone ABX (which tests phoneme discrimination) or Speaker ABX (which tests identity), Accent ABX tests the discriminability of words spoken in different accents.
- Triplet Construction: Triplets $(a, b, x)$ are constructed where $a$ and $x$ share the same accent, and $b$ has a different accent. Crucially, $a, b,$ and $x$ must be different speakers to avoid speaker identity bias, and they must share identical lexical content (the same word) rather than just phonetic context, as accents often alter phoneme sequences.
- Word Selection: A data-driven approach selects the 100 most frequent words and identifies the top 10% of (Accent A, Accent B, Word) combinations that yield the lowest ABX error rates (most discriminative), focusing on phonetic cues like rhoticity and vowel quality.
Recoverability (Cross-Accent Voice Conversion):
- The authors train Unit-to-Speech models (HiFiGAN) to resynthesize speech from DSRTs.
- Cross-Accent VC: They perform inference using DSRTs from a source speaker (with Accent A) and a target Speaker ID (with Accent B).
- Metrics:
  - Accent Similarity: Cosine similarity of accent embeddings (using GenAID).
  - Speaker Similarity: Cosine similarity of speaker embeddings (using WavLM).
  - Phonetic Similarity: Jensen-Shannon distance of Phonetic Posteriorgrams (PPGs).
  - Intelligibility: Word Error Rate (WER).
- This setup quantifies whether the accent in the output is derived from the source DSRTs (preserving source accent) or overridden by the target speaker ID (adapting to target accent).

B. Experimental Setup

Models: HuBERT (base), HuBERT fine-tuned for ASR (HuBERT-ft), and Whisper.
Discretization: RepCodec with Vector Quantization (VQ) using various codebook sizes (32 to 8192).
Dataset: VCTK corpus (13 accent regions) and LibriSpeech (for training the quantizer).

3. Key Contributions

First Systematic Investigation of Accent in DSRTs: The paper is the first to explicitly quantify how accent information is encoded in discrete speech tokens, challenging the assumption that accent is naturally disentangled or easily controlled.
Novel Evaluation Framework: Introduction of the Accent ABX task for accessibility and a Cross-Accent VC pipeline for recoverability, providing a comprehensive metric suite for accent analysis.
Empirical Findings on Design Choices:
- Layer Selection: Accent information is most prominent in mid-early layers (e.g., Layer 6–9) of HuBERT, distinct from speaker information (early layers) and phonetic information (middle layers).
- ASR Supervision: Models fine-tuned for ASR (HuBERT-ft, Whisper) significantly discard accent information, particularly in deeper layers, making them suboptimal for accent-preserving tasks.
- Codebook Size: Naive reduction of codebook size (e.g., from 8192 to 32) does not effectively disentangle accent from phonetic or speaker information. Instead, it acts as a lossy compressor, degrading all information types simultaneously.
Proposed Token Designs: Based on findings, the authors propose two specific token configurations:
- Content-Accent Tokens: Using HuBERT Layer 9 with a large codebook (8192) for accent-preserving VC.
- Content Tokens: Using HuBERT-ft Layer 18 with a smaller codebook (256) for accent-adaptive VC (where accent should be suppressed).

4. Results

Layer Dependency: Accent recoverability peaks at HuBERT Layer 6–9. In contrast, ASR-supervised models show a sharp decline in accent recoverability in later layers.
Codebook Limitations: Reducing codebook size from 1024 to 32 caused a sharp drop in Accent Cosine Similarity (A-SIM) and a rise in PPG distance, proving that small codebooks do not isolate "content" from "accent" but rather destroy both.
Performance Comparison:
- The proposed Content-Accent tokens (HuBERT L9, size 8192) significantly outperformed the state-of-the-art Vevo approach (HuBERT L18, size 8192) in accent-preserving Voice Conversion.
- Subjective Evaluation (MOS): The proposed method achieved higher scores for accent similarity to the source speaker (3.61 vs. 2.39 for Vevo) while maintaining high intelligibility and speaker similarity to the target.
- ASR Models: Whisper and HuBERT-ft showed lower peak accent recoverability compared to standard HuBERT, confirming that ASR supervision filters out accent cues.

5. Significance and Implications

Rethinking Token Design: The paper refutes the common practice of using deep, ASR-supervised layers or small codebooks for "content" tokens in speech generation. It demonstrates that such choices inadvertently strip away accent information, leading to the "accent hallucination" observed in ZS-TTS systems.
Guidance for Controllable Generation: The findings provide a clear roadmap for designing DSRTs:
- To preserve accent, use mid-early layers of non-ASR models (e.g., HuBERT L6–L9) with large codebooks.
- To adapt to a target accent, use ASR-supervised models or deeper layers where accent information is naturally attenuated.
Future Directions: The authors highlight that accent and speaker identity are partially entangled in current representations. Fully disentangling them may require explicit supervision (e.g., accent classification heads) rather than relying solely on unsupervised quantization.

In conclusion, this work establishes that accent is a distinct, layer-dependent feature in speech representations that cannot be controlled via simple codebook adjustments. It offers a rigorous framework and specific architectural recommendations to build more inclusive and controllable speech generation systems.

Rethinking Discrete Speech Representation Tokens for Accent Generation

1. The "Layer Cake" Problem

2. The "ASR Supervision" Trap

3. The "Magic Squeeze" Myth

4. The Solution: A New Recipe

The Big Takeaway

1. Problem Statement

2. Methodology

A. Evaluation Framework

B. Experimental Setup

3. Key Contributions

4. Results

5. Significance and Implications

More like this

Adiabatic Capacitive Neuron: An Energy-Efficient Functional Unit for Artificial Neural Networks

Multi-Domain Supervised Contrastive Learning for UAV Radio-Frequency Open-Set Recognition

ACCOR: Attention-Enhanced Complex-Valued Contrastive Learning for Occluded Object Classification Using mmWave Radar IQ Signals

Continuous-Time Analysis of AFDM: Pulse-Shaping, Fundamental Bounds and Impact of Hardware Impairments

Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge