Controllable Accent Normalization via Discrete Diffusion

Imagine you are listening to a friend tell a story. They are speaking English, but they have a strong accent from their home country. Sometimes, you want to understand them perfectly (so you need to remove the accent). Other times, you want to keep their unique voice and charm, just making it slightly clearer.

Until now, computer programs that fix accents were like a blunt hammer: they either smashed the accent completely away (making the speaker sound robotic and generic) or left it exactly as is. They couldn't "turn a dial" to find the perfect middle ground.

This paper introduces DLM-AN, a new system that acts like a smart, adjustable accent dial. Here is how it works, explained through simple analogies.

1. The Problem: The "All-or-Nothing" Hammer

Previous methods were like a photo editor that only had two buttons: "Make it Black and White" (remove all accent) or "Leave it Color" (keep the accent). If you wanted a slightly sepia-toned photo, you were out of luck. In language learning or movie dubbing, you often need that middle ground—keeping the speaker's identity but making the words clearer.

2. The Solution: The "Smart Reuse" Strategy

The core idea of DLM-AN is simple: Don't throw the whole sentence away; just fix the parts that are wrong.

Think of a sentence as a string of Lego bricks.

The Good Bricks: Some words are pronounced perfectly, even with an accent (e.g., "Hello" or "My").
The Tricky Bricks: Some words are mangled by the accent (e.g., "Death" sounding like "Deeth" or "Had" sounding like "Had-uh").

The system uses a special tool called the Common Token Predictor (CTP). Imagine this tool as a quality inspector walking down the line of Lego bricks.

It looks at each brick.
If the brick is a "Good Brick" (pronounced like a native speaker), the inspector says, "Keep this one! It's perfect."
If the brick is a "Tricky Brick," the inspector says, "This one needs to be replaced."

3. The Magic Dial: Controlling the Accent

This is where the "Controllable" part comes in. The user gets a dial (a threshold setting) to decide how many bricks to keep.

Dial turned all the way to "Native": The inspector throws away every brick and builds the sentence from scratch using perfect native bricks. The accent is gone, but the speaker might sound a bit generic.
Dial turned all the way to "Original": The inspector keeps every brick, even the tricky ones. The accent remains 100% intact.
Dial in the Middle: The inspector keeps the "Good Bricks" and only replaces the "Tricky Bricks."
- Result: The speaker still sounds like themselves (because their good pronunciation is preserved), but the difficult words are now clear. You can slide this dial anywhere to get exactly the amount of accent you want.

4. The Rhythm Coach: Fixing the Timing

Accents don't just change what words sound like; they change how long you hold them. A non-native speaker might drag out a vowel or rush through a sentence.

DLM-AN has a second tool called the Duration Ratio Predictor. Think of this as a conductor for the music of speech.

If the original speaker is rushing, the conductor slows them down to match a native rhythm.
If they are dragging, the conductor speeds them up.
This ensures that even if you keep the accent, the speech flows naturally and doesn't sound like a robot with a broken timer.

5. How It Builds the Voice (The "Diffusion" Part)

The paper uses a technique called "Discrete Diffusion." Imagine you have a sentence written on a piece of paper, but half the words are covered by black ink (masked).

The AI looks at the visible words and the "Good Bricks" it decided to keep.
It then "paints over" the black ink, guessing what the missing words should be based on the context.
It does this step-by-step, refining the sentence until it's perfect. Because it starts with the "Good Bricks" already in place, it knows exactly where the speaker's identity lives.

Why This Matters

For Language Learners: You can practice speaking and ask the AI to "fix 50% of my accent." This helps you hear the difference between your version and the native version without losing your own voice identity.
For Movie Dubbing: Imagine a foreign actor in a movie. You might want their voice to sound like a native English speaker for clarity, but you still want to hear their unique emotional tone and rhythm. This system allows for that perfect blend.
Better Quality: The paper shows that this method makes fewer mistakes (lower "Word Error Rate") than previous systems because it doesn't try to rebuild the whole sentence from scratch; it just fixes the broken parts.

In short: DLM-AN is like a smart editor that knows which parts of your speech are already great and which parts need a little polish, letting you decide exactly how much of your original accent you want to keep.

1. Problem Statement

Accent Normalization (AN) aims to convert non-native (L2) accented speech into a native (L1) form while preserving the speaker's identity. While existing methods (reference-based, reference-free, and token-based) have improved conversion quality, they suffer from a critical limitation: lack of controllability over accent strength.

Most systems perform a "full" accent shift, offering no way to gradually reduce an accent or retain specific native-like features as part of the speaker's identity.
Existing attempts at controllability (e.g., manipulating diffusion timesteps in continuous models) often lack fine-grained rhythm adjustability and duration control.
There is a need for a system that allows users to smoothly trade off between full normalization and preserving the original accent, while also controlling total speech duration.

2. Methodology: DLM-AN

The authors propose DLM-AN, a controllable accent normalization system built on masked discrete diffusion over self-supervised speech tokens. The pipeline consists of four main stages:

A. Tokenization and Content Representation

SSL Tokenizer: Uses WavLM-large to extract self-supervised representations from the L2-accented waveform, which are then quantized into discrete tokens (1024 clusters).
Token Encoder: A Transformer encoder processes these tokens to produce continuous content representations. To ensure these representations are phonemically rich, a CTC-based phonemic guidance head is attached, predicting phoneme logits from transcripts.

B. Common Token Predictor (CTP)

Concept: Under a phonetically rich tokenizer, native and accented versions of the same utterance share many tokens in similarly pronounced regions but differ in accent-affected regions.
Mechanism: The CTP is a sequence-tagging module trained to assign a confidence score to each source token, indicating the likelihood that it is shared with the native target.
Training: Labels are derived using the Longest Common Subsequence (LCS) between source and target token sequences.
Control Mechanism: During inference, users can set a threshold ( $\tau$ $τ$ ). Tokens with CTP confidence $> \tau$ $> τ$ are reused from the source to initialize the target sequence.
- High $\tau$ (e.g., 1.0): Few tokens reused $\rightarrow$ Full normalization (generate from scratch).
- Low $\tau$ (e.g., 0.0): All tokens reused $\rightarrow$ Near-resynthesis (preserve original accent).

C. Duration Ratio Predictor (DP)

Problem: L2 speech often has different rhythms and speaking rates. Simply preserving source duration can lead to unnatural prosody.
Solution: A flow-matching based Duration Ratio Predictor estimates the ratio $r = \text{duration}_{\text{target}} / \text{duration}_{\text{source}}$ .
Usage: This ratio determines the length of the target sequence, allowing for explicit duration scaling (compression or expansion).

D. Discrete Diffusion Decoder (DLM)

Architecture: Extends the LLaDA (Large Language Model with Discrete Diffusion) architecture. It uses a bidirectional Transformer with self-attention only (no causal masking) to iteratively predict masked tokens.
Process:
1. Initialization: The target sequence is initialized based on the CTP strategy (reusing high-confidence source tokens) and the DP strategy (setting sequence length). Unselected positions are masked.
2. Generation: The model iteratively predicts and unmaskes tokens conditioned on the content features.
3. Sampling: Uses a greedy sampling algorithm with Classifier-Free Guidance (CFG) to enhance content preservation and accent reduction.
Synthesis: The final target token sequence is converted to a Mel-spectrogram via a flow-matching speech synthesizer and then to a waveform using the HiFT vocoder.

3. Key Contributions

First Discrete Diffusion AN System: Proposes the first accent normalization system based on masked discrete diffusion, enabling iterative token generation conditioned on phonemically guided content.
Interpretable Accent Strength Control: Introduces the Common Token Predictor (CTP), which provides a simple, threshold-based mechanism to control accent strength. Reusing more source tokens preserves more of the original accent, offering a smooth, continuous control knob.
Robust Duration Control: Incorporates a flow-matching Duration Ratio Predictor for automatic and explicit adjustment of total output duration to match native rhythms.
State-of-the-Art Performance: Demonstrates superior content preservation (lowest Word Error Rate) compared to autoregressive and continuous diffusion baselines, while maintaining competitive naturalness and accent reduction.

4. Experimental Results

Experiments were conducted on multi-accent English data (Arabic, Chinese, Hindi, Korean, Spanish, Vietnamese) using the L2-ARCTIC and Emilia datasets.

Content Preservation (WER): DLM-AN achieved the lowest Word Error Rate (10.64% - 11.19%) among all compared systems (TokAN, CosyAccent), significantly outperforming baselines. This indicates superior intelligibility and content retention.
Accent Reduction: DLM-AN achieved the lowest accentedness scores (ACT) in free-duration settings (22.94), indicating strong normalization capabilities.
Controllability:
- CTP Efficacy: CTP-based token reuse outperformed random token reuse, yielding lower WER and better accent separation.
- Smooth Control: Varying the CTP threshold ( $\tau$ ) from 1.0 to 0.0 resulted in a monotonic increase in accentedness (ACT) and speaker similarity (SIM), proving the system can smoothly transition from full normalization to accent preservation.
Duration Scaling: DLM-AN maintained the lowest WER across various duration scaling ratios (0.5x to 1.5x). It showed particular robustness in compression scenarios where other models (like TokAN) degraded due to token sequence length mismatches.
Ablation Studies: Confirmed that both Classifier-Free Guidance (CFG) and pretraining on native data are critical for achieving low WER and high naturalness.

5. Significance

This work bridges the gap between high-fidelity speech conversion and user-controllable parameters.

Application Impact: It enables practical applications such as language learning (where learners need to hear their own accent reduced gradually) and dubbing (where specific accent retention is required for character identity).
Methodological Advance: It successfully adapts discrete diffusion language models (LLaDA) to the speech domain, demonstrating that token-based diffusion offers better theoretical foundations (ELBO optimization) and iterative refinement capabilities compared to autoregressive or continuous diffusion approaches.
Interpretability: The use of token reuse as a control mechanism makes the "accent strength" knob highly interpretable, unlike black-box parameter adjustments in other diffusion models.