SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

Imagine you are trying to translate a book into a spoken audiobook. For a long time, computers had two main ways to do this, and both had a major flaw:

The "One-Word-at-a-Time" Robot (Autoregressive): This robot reads a word, thinks hard, says it, then reads the next word, thinks, and says it. It sounds very natural and human, but it's slow. It's like a snail that refuses to move until it has finished its current step. If you want the audiobook to start now, you have to wait for the snail to finish the first sentence before it even starts the second.
The "Whole-Page" Robot (Non-Autoregressive): This robot looks at the whole page, guesses what the whole sentence sounds like, and spits it out all at once. It's fast, but it's clumsy. It can't start speaking until it has the entire page in front of it. If you are reading a live news feed and the text is streaming in, this robot just sits there waiting, creating a long, awkward silence before it finally speaks.

Enter SyncSpeech: The "Conductor" Robot.

The paper introduces a new model called SyncSpeech. Think of it as a brilliant orchestra conductor who has learned to do the best of both worlds.

The Big Idea: The "Temporal Mask"

The secret sauce of SyncSpeech is something they call the Temporal Mask Transformer (TMT).

Imagine you are reading a script to a group of actors.

Old Way: You read one line, wait for the actor to say it, then read the next.
SyncSpeech Way: You read a line, and immediately you tell the actors, "Okay, for this line, you need to speak for exactly 3 seconds, and here are the 5 notes you need to hit." You don't wait for them to finish the line before you start planning the next one.

The "Temporal Mask" is like a special pair of glasses the computer wears. It looks at the text and says, "I know the next word is coming, but I'm going to predict how long that word will take to say and generate all the sounds for it right now, while I'm still reading the next word."

How It Works (The Analogy)

1. The "Look-Ahead" Trick (Streaming)
Imagine you are reading a text message that is being typed out in real-time.

The Problem: If you wait for the whole sentence, the conversation feels dead.
The SyncSpeech Solution: As soon as you see the first two words of the message, SyncSpeech starts talking. It doesn't wait for the period at the end. It predicts the rhythm and the sounds for the current word while the next word is still being typed. It's like a jazz musician who can start playing the next chord before the current one has fully faded out.

2. The "Batch" Generation (Efficiency)
In the old "One-Word-at-a-Time" method, the computer has to do a math calculation for every single sound wave (like 50 calculations for one second of speech).

SyncSpeech groups them up. It looks at a text word and says, "I need to generate 10 sound units for this word." It does all 10 calculations at the same time (in parallel).
Analogy: Imagine a bakery.
- Old Robot: Makes one cookie, puts it on a tray, makes another, puts it on a tray.
- SyncSpeech: Looks at the order for "10 cookies," and puts 10 cookies on the tray simultaneously. It's much faster, but the cookies still taste perfect.

3. The "High-Probability Mask" (Training)
To teach this robot, the researchers used a clever trick. Instead of showing it the whole book and asking it to guess the end, they covered up huge chunks of the audio and asked it to fill in the blanks very aggressively.

Analogy: It's like a teacher who covers up 80% of a student's homework and says, "Fill in these missing parts." By forcing the student to guess the missing pieces so often, the student becomes a genius at understanding the whole picture, not just the parts they can see. This made the model learn faster and sound more natural.

Why Does This Matter?

The results are like upgrading from a dial-up internet connection to 5G:

Speed: It is 5.8 times faster at starting to speak (low latency). If you ask a smart assistant a question, it starts answering almost instantly, rather than making you wait for it to "think" about the whole sentence first.
Efficiency: It is 8.8 times more efficient (it uses less computer power to do the same job).
Quality: Despite being a speed demon, it doesn't sound robotic. It still sounds as natural as the slow, careful robots.

The Bottom Line

SyncSpeech is the first TTS model that can stream text and stream speech perfectly in sync. It doesn't wait for the whole sentence to be written before it starts talking, and it doesn't make you wait for it to finish one word before it starts the next. It's like a conversation where the computer is listening and speaking at the exact same time, just like a human would.

1. Problem Statement

Current Text-to-Speech (TTS) systems face a fundamental trade-off between generation efficiency and latency:

Autoregressive (AR) Models: Generate speech tokens sequentially (left-to-right). While this supports streaming and low latency, it suffers from low generation efficiency because the time complexity scales linearly with the number of speech tokens ( $O(T)$ ), which is much larger than the text length.
Non-Autoregressive (NAR) Models: Generate speech tokens in parallel, offering high efficiency. However, they typically require the entire text input before generation begins, resulting in high first-packet latency (FPL) and an inability to handle streaming text inputs incrementally.
Existing Hybrid Attempts: Recent methods attempting to handle streaming text (e.g., interleaved text-speech modeling) still rely on AR paradigms, producing only one speech token per step, thus failing to achieve significant efficiency gains.

Goal: Develop a TTS model that combines the streaming capability of AR models with the parallel decoding efficiency of NAR models.

2. Methodology: SyncSpeech & Temporal Masked Transformer (TMT)

The proposed SyncSpeech model is built upon a novel Temporal Masked Transformer (TMT) paradigm. It unifies temporally ordered generation with parallel decoding.

A. Core Architecture

The system consists of two main components:

Text-to-Token Model (Backbone): The TMT, which predicts speech tokens and durations.
Token-to-Speech Module: A chunk-aware speech decoder (based on CosyVoice 2) that converts semantic tokens into waveforms.

B. Sequence Design & Training Strategy

Sequence Construction: The model processes streaming text by predicting the duration of the current text token and generating all corresponding speech tokens in a single step.
- Look-ahead: The model allows a look-ahead of $q$ text tokens to improve prosody and alignment.
- Masking: During training, the speech tokens corresponding to the current target text token are masked (replaced with <MASK>), while preceding tokens are visible.
Input Sequence: The input $f$ consists of text tokens, a duration placeholder <DPH>, and masked speech tokens.
Loss Function: The model minimizes a negative log-likelihood loss for two tasks simultaneously:
1. Mask Prediction: Predicting the masked speech tokens.
2. Duration Prediction: Predicting the duration (number of speech tokens) for the next text token.
Hybrid Attention Mask: A specialized attention mechanism is used:
- Causal Attention: Applied to text tokens and special tokens (ensuring streaming capability).
- Bidirectional Attention: Applied to masked and speech tokens corresponding to the same text token. This allows the model to perceive the total duration of a text unit, improving robustness and naturalness.

C. High-Probability Masked Pre-training

To address training inefficiency (where gradients were only backpropagated for one text token at a time), the authors introduced a High-Probability Masked Pre-training stage:

A binary mask is applied to a large portion of the speech tokens (based on Bernoulli distribution).
This accelerates convergence and significantly improves overall model performance and robustness.
The model is subsequently fine-tuned using a strategy consistent with the inference process.

D. Inference Process

Streaming Input: As text tokens arrive, the model waits until the look-ahead threshold ( $q$ ) is met.
Single-Step Decoding: For each new text token, the model predicts:
- The duration of the next text token.
- All speech tokens corresponding to the current text token.
Waveform Generation: Once the generated speech tokens exceed the decoder's chunk size, the waveform is synthesized immediately.
Complexity: The time complexity is decoupled from the speech sequence length ( $T$ ) and scales linearly with the input text length ( $L$ ), i.e., $O(L)$ . Since $L \ll T$ , this yields massive efficiency gains.

3. Key Contributions

Temporal Masked Transformer (TMT): A new paradigm that synergistically unifies AR-style sequential generation with NAR-style parallel decoding.
Streaming Synchronous Generation: The ability to generate speech immediately upon receiving the second text token, achieving true low-latency streaming without sacrificing efficiency.
Hybrid Attention Mechanism: A novel mask design that combines causal and bidirectional attention to handle duration prediction and token alignment effectively.
High-Probability Masked Pre-training: A training strategy that significantly boosts convergence speed and final model quality.

4. Experimental Results

The model was evaluated on LibriSpeech (English) and SeedTTS (Mandarin) benchmarks, comparing against CosyVoice (AR) and CosyVoice2.

Speech Quality: SyncSpeech maintains quality comparable to state-of-the-art AR models.
- WER (Word Error Rate): Comparable to CosyVoice2 (e.g., 3.07% vs 3.00% on LibriSpeech).
- MOS (Mean Opinion Score): No significant difference in naturalness (4.48 vs 4.48).
- Speaker Similarity: Nearly identical to baselines.
Latency (First-Packet Latency - FPL):
- FPL-A (Text Available): 3.7x faster than AR models on English; 5.8x faster on Mandarin.
- FPL-L (Streaming from LLM): SyncSpeech initiates synthesis after just 2 text tokens, whereas CosyVoice2 requires 5 and other baselines require the full sequence.
Efficiency (Real-Time Factor - RTF):
- 6.4x speedup on English.
- 8.8x speedup on Mandarin.
- This is attributed to the shift from $O(T)$ to $O(L)$ time complexity.
Ablation Studies:
- Removing pre-training increased WER by 1.17% and decreased UTMOS.
- Replacing Hybrid Attention with standard causal attention significantly degraded robustness (WER jumped to 8.19%), proving the necessity of bidirectional attention within token blocks.

5. Significance

Bridging the Divide: SyncSpeech successfully resolves the long-standing trade-off between the low latency of AR models and the high efficiency of NAR models.
LLM Integration: The model is specifically optimized for integration with upstream Large Language Models (LLMs). Its ability to generate speech in sync with streaming text output makes it ideal for real-time conversational agents and video dubbing.
Scalability: By decoupling generation time from the high frame rate of speech tokens, SyncSpeech offers a scalable solution for real-time applications where milliseconds of latency matter.

In conclusion, SyncSpeech represents a foundational shift in TTS architecture, enabling high-quality, low-latency, and highly efficient speech synthesis suitable for next-generation interactive AI systems.