Controllable Accent Normalization via Discrete Diffusion

The paper proposes DLM-AN, a controllable accent normalization system using discrete diffusion and self-supervised speech tokens that enables tunable accent retention through selective token reuse while automatically adjusting duration for native rhythm, achieving state-of-the-art word error rates and smooth interpretability.

Qibing Bai, Yuhan Du, Tom Ko, Shuai Wang, Yannan Wang, Haizhou Li

Published 2026-03-17
📖 5 min read🧠 Deep dive

Imagine you are listening to a friend tell a story. They are speaking English, but they have a strong accent from their home country. Sometimes, you want to understand them perfectly (so you need to remove the accent). Other times, you want to keep their unique voice and charm, just making it slightly clearer.

Until now, computer programs that fix accents were like a blunt hammer: they either smashed the accent completely away (making the speaker sound robotic and generic) or left it exactly as is. They couldn't "turn a dial" to find the perfect middle ground.

This paper introduces DLM-AN, a new system that acts like a smart, adjustable accent dial. Here is how it works, explained through simple analogies.

1. The Problem: The "All-or-Nothing" Hammer

Previous methods were like a photo editor that only had two buttons: "Make it Black and White" (remove all accent) or "Leave it Color" (keep the accent). If you wanted a slightly sepia-toned photo, you were out of luck. In language learning or movie dubbing, you often need that middle ground—keeping the speaker's identity but making the words clearer.

2. The Solution: The "Smart Reuse" Strategy

The core idea of DLM-AN is simple: Don't throw the whole sentence away; just fix the parts that are wrong.

Think of a sentence as a string of Lego bricks.

  • The Good Bricks: Some words are pronounced perfectly, even with an accent (e.g., "Hello" or "My").
  • The Tricky Bricks: Some words are mangled by the accent (e.g., "Death" sounding like "Deeth" or "Had" sounding like "Had-uh").

The system uses a special tool called the Common Token Predictor (CTP). Imagine this tool as a quality inspector walking down the line of Lego bricks.

  • It looks at each brick.
  • If the brick is a "Good Brick" (pronounced like a native speaker), the inspector says, "Keep this one! It's perfect."
  • If the brick is a "Tricky Brick," the inspector says, "This one needs to be replaced."

3. The Magic Dial: Controlling the Accent

This is where the "Controllable" part comes in. The user gets a dial (a threshold setting) to decide how many bricks to keep.

  • Dial turned all the way to "Native": The inspector throws away every brick and builds the sentence from scratch using perfect native bricks. The accent is gone, but the speaker might sound a bit generic.
  • Dial turned all the way to "Original": The inspector keeps every brick, even the tricky ones. The accent remains 100% intact.
  • Dial in the Middle: The inspector keeps the "Good Bricks" and only replaces the "Tricky Bricks."
    • Result: The speaker still sounds like themselves (because their good pronunciation is preserved), but the difficult words are now clear. You can slide this dial anywhere to get exactly the amount of accent you want.

4. The Rhythm Coach: Fixing the Timing

Accents don't just change what words sound like; they change how long you hold them. A non-native speaker might drag out a vowel or rush through a sentence.

DLM-AN has a second tool called the Duration Ratio Predictor. Think of this as a conductor for the music of speech.

  • If the original speaker is rushing, the conductor slows them down to match a native rhythm.
  • If they are dragging, the conductor speeds them up.
  • This ensures that even if you keep the accent, the speech flows naturally and doesn't sound like a robot with a broken timer.

5. How It Builds the Voice (The "Diffusion" Part)

The paper uses a technique called "Discrete Diffusion." Imagine you have a sentence written on a piece of paper, but half the words are covered by black ink (masked).

  • The AI looks at the visible words and the "Good Bricks" it decided to keep.
  • It then "paints over" the black ink, guessing what the missing words should be based on the context.
  • It does this step-by-step, refining the sentence until it's perfect. Because it starts with the "Good Bricks" already in place, it knows exactly where the speaker's identity lives.

Why This Matters

  • For Language Learners: You can practice speaking and ask the AI to "fix 50% of my accent." This helps you hear the difference between your version and the native version without losing your own voice identity.
  • For Movie Dubbing: Imagine a foreign actor in a movie. You might want their voice to sound like a native English speaker for clarity, but you still want to hear their unique emotional tone and rhythm. This system allows for that perfect blend.
  • Better Quality: The paper shows that this method makes fewer mistakes (lower "Word Error Rate") than previous systems because it doesn't try to rebuild the whole sentence from scratch; it just fixes the broken parts.

In short: DLM-AN is like a smart editor that knows which parts of your speech are already great and which parts need a little polish, letting you decide exactly how much of your original accent you want to keep.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →