Rethinking Discrete Speech Representation Tokens for Accent Generation

This paper presents the first systematic investigation into how accent information is encoded in Discrete Speech Representation Tokens (DSRTs), introducing a unified evaluation framework that reveals layer selection is the most critical factor for retaining accents, while ASR supervision significantly diminishes them and naive codebook reduction fails to disentangle accent from phonetic and speaker information.

Jinzuomu Zhong, Yi Wang, Korin Richmond, Peter Bell

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are trying to send a voice message to a friend, but you want the message to sound exactly like you (your voice) but with a specific accent, like a Scottish brogue or a Southern US twang.

In the world of AI speech, there's a popular tool called Discrete Speech Tokens (DSRTs). Think of these tokens as a "digital shorthand" or a "compressed zip file" of a voice. Instead of sending the whole raw audio wave (which is huge), the AI breaks the voice down into a list of numbers (tokens) that represent the sounds.

For a long time, researchers thought these tokens were great at capturing what was said (the words) and who said it (the voice). But they largely ignored how it was said in terms of accent. This paper asks: If we want to generate or change accents, do these digital tokens actually hold onto that accent information, or does it get lost in the compression?

Here is the breakdown of their findings using some simple analogies:

1. The "Layer Cake" Problem

Imagine the AI model that creates these tokens is a giant, multi-layered cake.

  • The Bottom Layers: These are like the raw ingredients. They hold the basic, crunchy details of the sound (like the exact pitch or the "crunch" of a consonant).
  • The Middle Layers: This is where the "flavor" of the accent lives. The researchers found that if you want to keep an accent, you need to look at the middle layers of the cake.
  • The Top Layers: These are like the frosting. They are very good at understanding the meaning of the words (phonetics) and the identity of the speaker, but they have mostly stripped away the specific "flavor" of the accent.

The Discovery: Most current AI systems use the "top layers" (the frosting) to create speech tokens. The authors found that by the time the AI gets to the top, the accent information has mostly evaporated. It's like trying to bake a strawberry cake using only the vanilla frosting; the strawberry flavor is gone.

2. The "ASR Supervision" Trap

Some researchers tried to make these tokens better by training them to be really good at transcribing speech (turning voice into text). This is called "ASR supervision."

Think of this like hiring a strict editor who only cares about spelling and grammar. If you ask this editor to summarize a story, they will give you the perfect plot (the words) and the perfect character names (the speaker), but they will ruthlessly delete all the regional slang and dialect because it's "not standard."

  • The Result: The paper found that when you train tokens to be perfect at reading text, they accidentally delete the accent information. The accent gets "edited out" because it's not needed for reading.

3. The "Magic Squeeze" Myth

Some previous studies claimed that if you just make the "zip file" smaller (reducing the codebook size), the AI would magically separate the "words" from the "accent" and "voice." They thought, "If we squeeze the data hard enough, the accent will pop out on its own."

The Reality Check: The authors tested this and found it doesn't work.

  • The Analogy: Imagine you have a smoothie with strawberries (accent), bananas (words), and yogurt (voice). If you try to squeeze the smoothie through a tiny hole to separate the ingredients, you don't get pure strawberries and pure bananas. You just get a smaller, messier smoothie where everything is less distinct.
  • The Finding: Simply making the token list smaller doesn't cleanly separate the accent from the words. It just makes the whole thing worse.

4. The Solution: A New Recipe

Since the old recipes (using top layers or small zip files) were losing the accent, the authors proposed a new way to cook:

  • For keeping the accent (Accent-Preserving VC): Don't use the top layers. Use the middle layers of the AI cake where the accent flavor is still strong.
  • For changing the accent (Accent-Adaptive VC): Use a mix of tokens that keeps the words clear but allows the AI to swap the "accent layer" with a new one.

They tested this new recipe by asking people to listen to the results. The new method sounded much more like the intended accent and much less like the AI was just "guessing" the accent (which is called "hallucinating" an accent).

The Big Takeaway

If you want an AI to speak with a specific accent, you can't just use the standard "smart" layers of the AI that are good at reading text. You have to dig deeper into the middle layers where the "regional flavor" is still stored, and you can't rely on shrinking the data size to magically separate the accent from the words.

In short: To get a good accent, you need to stop looking at the "frosting" (the high-level meaning) and start tasting the "middle layers" (the specific sound patterns) of the AI cake.