WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models

The Big Problem: The "Memory Hoarder"

Imagine you are writing a very long story, one word at a time. Every time you write a new word, you have to re-read the entire story you've written so far to make sure the new word fits perfectly.

The Old Way (Current TTS Models): If you want to generate a 1-hour audiobook, the computer has to re-read the first 3,599 words every single time it adds the 4,000th word.
The Consequence: As the story gets longer, the computer's brain (memory) gets clogged up, and it gets slower and slower. Eventually, it runs out of memory and crashes. It's like trying to carry a backpack that gets heavier every step you take; eventually, you can't walk anymore.

The Solution: WAND (Windowed Attention and Knowledge Distillation)

The authors of this paper created a new framework called WAND. Think of it as giving the computer a "smart pair of glasses" and a "mentor."

1. The Smart Glasses: Splitting the View

Instead of staring at the whole story at once, WAND splits the view into two parts:

The "Global" View (The Anchor): The computer keeps a permanent, clear view of the instructions. This includes the text you want spoken, the reference audio (to copy the voice), and the style tags. These are the "anchors" that never change.
The "Local" View (The Sliding Window): For the words it is currently generating, the computer only looks at the last few words (a small window). It ignores the words from 10 minutes ago because, in speech, what you said a long time ago doesn't really matter for the sound of the next syllable.

The Analogy: Imagine driving a car.

Global Attention: You keep your eyes on the map and the destination sign (the instructions). You never lose sight of where you are going.
Local Attention: You only look at the road immediately in front of the car (the last few seconds). You don't need to look at the road you passed 5 miles ago to know how to steer right now.
Result: Your brain (memory) stays light, and you can drive forever without getting tired.

2. The Mentor: Knowledge Distillation

When you suddenly tell a computer to stop looking at the whole story and only look at the last few words, it gets confused and starts making mistakes (like sounding robotic or forgetting the accent).

To fix this, WAND uses a Teacher-Student approach:

The Teacher: The original, heavy, slow computer that looks at everything.
The Student: The new, fast, lightweight computer that only looks at the "window."
The Lesson: The Teacher whispers the correct answers to the Student while the Student practices. This way, the Student learns to be just as good as the Teacher, but without needing the heavy memory.

The Analogy: It's like a master chef (Teacher) teaching an apprentice (Student). The apprentice doesn't need to memorize every single recipe in the world; they just need to watch the master cook a few dishes and learn the technique. Now the apprentice can cook great food using a much smaller kitchen.

The Results: Fast, Light, and Long

The paper tested this on three different modern speech systems. Here is what happened:

Memory Savings: The computer's "backpack" became 66% lighter. It can now generate hours of audio without running out of memory.
Speed: Because it doesn't have to re-read the whole history, the speed stays constant. Whether you are generating 1 second or 1 hour of audio, it takes the same amount of time per step.
Quality: The speech sounds just as natural and human as the heavy, slow models.
Efficiency: They only needed 100 hours of training data (a tiny amount for AI) to teach the new system how to do this.

Why This Matters

Before WAND, making long audiobooks or continuous voice assistants was a hardware nightmare. You needed expensive, powerful servers just to keep the memory from overflowing.

WAND changes the game. It allows us to generate infinite-length speech on regular hardware. It's the difference between trying to carry a mountain of bricks in your hands versus using a conveyor belt that only holds the bricks you need right now.

In short: WAND teaches AI to focus on what matters right now while remembering the big picture, making speech synthesis faster, cheaper, and capable of going on forever.

1. Problem Statement

Recent decoder-only autoregressive text-to-speech (AR-TTS) models, which leverage Large Language Model (LLM) backbones, produce high-fidelity speech with superior zero-shot generalization. However, they suffer from significant scalability issues:

Quadratic Complexity: Standard Transformer self-attention mechanisms scale quadratically ( $O(L^2)$ ) with sequence length, leading to high computational costs.
Unbounded Memory Growth: While Key-Value (KV) caching improves inference speed, the memory footprint grows linearly ( $O(L)$ ) with every generated token. This creates a critical bottleneck for generating long-form audio, as the cumulative memory eventually exceeds hardware constraints.
Limitations of Existing Solutions:
- Pruning: Reducing model depth fails to address the fundamental quadratic cost of the remaining layers.
- Linear Attention/Mamba: These require training from scratch and often result in inferior speech quality compared to established AR-TTS models.
- Speculative Decoding: Accelerates token generation but does not resolve the underlying memory scaling issue.

2. Methodology: The WAND Framework

The authors propose WAND (Windowed Attention and Knowledge Distillation), a framework designed to transform the computational and memory scaling of AR-TTS from linear to constant ( $O(1)$ ) without architectural modifications or training from scratch.

A. Dual-Attention Mechanism

The core hypothesis is that AR-TTS models do not require full-sequence attention. Instead, they rely on two distinct information types:

Global Attention (Persistent): Conditioning tokens (system prompts, target text, reference audio) retain full access. These tokens provide the semantic and acoustic context (speaker identity, emotion) and account for 48–65% of the total attention mass.
Local Sliding-Window Attention: Generated speech tokens are restricted to a fixed-size sliding window ( $W$ $W$ ) of recent history. Since acoustic signals are locally coherent and monotonic, distant past tokens have diminishing influence once global conditions are set.
- Result: The KV cache is split into a fixed global component and a rolling window for acoustic tokens, achieving constant memory usage regardless of sequence length.

B. Knowledge Distillation (KD)

To mitigate performance degradation caused by abruptly restricting attention, WAND employs knowledge distillation from a full-attention "teacher" model to a windowed "student" model. The training objective combines:

Cross-Entropy Loss ( $L_{CE}$ ): Anchors the student to ground-truth acoustic tokens.
Skew KL-Divergence Loss ( $L_{KL}$ ): Encourages the student's token probability distribution to mimic the full-attention teacher, ensuring consistency when long-range context is removed.
Total Loss: $L = L_{CE} + \lambda L_{KL}$ .

C. Curriculum Learning Strategy

To stabilize fine-tuning, the window size is not fixed immediately. The authors use a curriculum learning approach:

Progressive Tightening: The window size starts large ( $W_{start}$ ) and progressively shrinks to the target size ( $W$ ) using a cosine schedule.
Soft Masking: Instead of hard truncation, a temperature-controlled soft mask is applied to attention logits. This allows partial attention to masked positions during early training, preserving gradient flow before enforcing strict constraints.

3. Key Contributions

Constant Overhead Architecture: A method to restrict attention in LLM-based TTS to ensure constant memory and computational overhead ( $O(1)$ ) without modifying the underlying model architecture.
Data-Efficient Adaptation: A knowledge distillation strategy that allows high-fidelity adaptation using only 100 hours of data (approx. 1% of typical training sets) and generalizes across languages (English to Mandarin) without additional training data.
Cross-Architecture Validation: Successful implementation across three diverse systems (CosyVoice 2, IndexTTS 1.5, SparkTTS) with different backbones (Qwen2.5, GPT-style), codecs (FSQ, VQ, BiCodec), and token rates.

4. Experimental Results

The framework was evaluated on the Seed-TTS benchmark (English and Mandarin).

Memory Efficiency:
- Achieved up to 66.2% reduction in KV cache size (e.g., IndexTTS 1.5 dropped from 38.44 MB to 13.01 MB for 10s generation).
- Memory usage remains bounded regardless of output length.
Computational Efficiency:
- Reduced total GFLOPs by up to 46.9%.
- Achieved 1.51× to 1.89× speedup in inference.
- Latency: Unlike full attention (which grows linearly with sequence length), WAND maintains near-constant per-step latency (~7.8ms to 9.0ms) even for long sequences.
Quality Preservation:
- English (test-en): Word Error Rate (WER) remained within 0.2% of the baseline or improved (e.g., CosyVoice 2 WER improved from 1.94% to 1.72%).
- Mandarin (test-zh): Despite being fine-tuned only on English data, the models maintained robust performance with Character Error Rate (CER) degradation of less than 0.1% absolute.
- Subjective Quality: Naturalness Mean Opinion Score (NMOS) and UTMOS scores were comparable to or slightly better than baselines.

5. Significance and Impact

Enabling Long-Form Synthesis: By decoupling memory growth from sequence length, WAND removes the hardware barrier to generating arbitrarily long, continuous audio, a critical requirement for applications like audiobooks, podcasts, and real-time dialogue.
Universal Optimization: The method proves that the "attention sink" phenomenon observed in text LLMs (where attention concentrates on prefixes and local windows) applies to speech synthesis as well. This suggests a universal structural property of autoregressive generation.
Practical Deployment: The ability to adapt existing high-quality models with minimal data (100 hours) and no architectural changes makes this approach highly viable for immediate industrial deployment, solving the "memory wall" problem in current TTS systems.

In conclusion, WAND successfully bridges the gap between the high fidelity of modern AR-TTS models and the strict efficiency requirements of real-world, long-form applications.