Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling

Imagine you are trying to send a high-quality recording of a symphony orchestra to a friend, but your internet connection is slow. You have two choices:

The Old Way (X-Codec-2.0): You send the music in tiny, rapid-fire snapshots (50 snapshots per second). Because you are sending so many snapshots, the file is huge, and it takes a long time to transmit. Also, to keep the file small, you have to compress the sound so much that the high notes (like the violins) sound a bit "muffled" or dull, like listening to music through a thick blanket.
The New Way (This Paper's Solution): You decide to send fewer snapshots (only 25 per second), but you make each snapshot much bigger and clearer. You also upgrade the quality of the sound itself, making it crisp and bright (24 kHz instead of 16 kHz).

The Result? You send half as many snapshots, but the music sounds better and arrives faster.

Here is a breakdown of how the author, Husein Zolkepli, achieved this magic trick, using simple analogies:

1. The Problem: Too Many Tiny Photos

The original X-Codec-2.0 was like a security camera taking a picture of a speaker's voice 50 times every second.

The Issue: It was taking too many pictures (50 Hz), which made the data stream heavy. Also, because it was trying to fit so many pictures into a small space, the "resolution" of the sound was limited to 16 kHz. This is like looking at a photo that is slightly blurry; you miss the fine details of the high-pitched sounds (like a sibilant "s" or a crisp "t").

2. The Solution: The "Zoom Out" Trick

The author didn't rebuild the whole camera. Instead, they made two simple tweaks to the lens:

Tweak A: The "Step Ladder" (Increasing Hop Size):
Imagine walking down a hallway. The old model took a tiny step every 320 inches. The new model takes a giant leap of 960 inches. By taking bigger steps, you cover the same distance (time) with fewer steps (tokens). This cuts the data rate in half (from 50 Hz to 25 Hz).
Tweak B: The "Grouping" (Pooling):
Before taking the picture, the new model groups two tiny details together and averages them into one clear, strong detail. This is like looking at a crowd of people and describing them as "a group of 50 people" instead of listing every single face. This keeps the rhythm of the speech perfect but reduces the amount of data needed to describe it.

3. The Magic of "Stretching" the Decoder

When you change the steps (from small to big), the machine that reconstructs the sound (the decoder) gets confused because it expects a different size.

The Fix: Instead of throwing away the old decoder and building a new one from scratch, the author used a technique called Linear Interpolation.
The Analogy: Imagine you have a rubber band with a pattern drawn on it. If you stretch the rubber band to make it longer, the pattern gets distorted. Instead of redrawing the whole pattern, the author simply "stretched" the existing pattern mathematically to fit the new size. This allowed the model to keep all its previous knowledge while adapting to the new, faster speed.

4. The Results: Better Sound, Less Data

The author tested this new "Super Codec" on a massive library of voices speaking 116 different languages (from English to Malay to Hindi).

The Score: They used a robot judge (UTMOSv2) that predicts how much a human would like the sound. The new model scored 0.29 points higher than the old one. In the world of audio, that's a huge jump!
The Comparison: It beat almost every other competing audio codec that operates at this speed (25 Hz). It sounds clearer, especially for high-pitched sounds, and it's much more efficient for AI models to process.

5. What's Next? (The Limitations)

The author is honest about what this model can't do yet:

The "Clean Room" Problem: The model was trained mostly on clean, studio-quality voices. If you try to use it on a noisy street or an emotional, shouting voice, it might get confused. It's like a chef who is amazing at making perfect sushi but hasn't learned how to cook a messy, spicy curry yet.
The "Big Vocabulary" Challenge: Because the model is so efficient, each "token" (word/sound unit) carries a lot of information. This makes it slightly harder for other AI models to predict the next word, kind of like trying to guess the next card in a deck where every card is a complex puzzle instead of a simple number.

Summary

The author took a very good audio compressor, slowed down its "shutter speed" to take fewer pictures, and made those pictures bigger and clearer. By doing this, they created a version that is twice as efficient but sounds better, making it a perfect tool for future AI assistants that need to speak many languages without lagging or sounding robotic.

Here is a detailed technical summary of the paper "Improving X-Codec-2.0 for Multi-Lingual Speech: 25 Hz Latent Rate and 24 kHz Sampling" by Husein Zolkepli.

1. Problem Statement

The original X-Codec-2.0 is a state-of-the-art multilingual neural audio codec that utilizes a frozen HuBERT semantic encoder and a transformer-based architecture. However, it operates with specific constraints that limit its performance in certain scenarios:

Low Temporal Resolution: It operates at a 50 Hz latent frame rate with a 16 kHz sampling rate.
Audio Fidelity: The 16 kHz output limits high-frequency fidelity, often resulting in "muffled" high-frequency content.
Efficiency: The 50 Hz rate generates longer token sequences for autoregressive models (like LLMs), increasing computational overhead during generation.
Underutilization: As multilingual datasets grow, the fixed 50 Hz resolution may fail to capture fine-grained speech variations effectively.

The authors aim to improve temporal efficiency and perceptual quality without altering the core architecture or requiring a full retraining from scratch.

2. Methodology

The proposed solution involves minimal architectural modifications to X-Codec-2.0, focusing on temporal pooling and decoder adaptation.

A. Temporal Pooling and Hop Size Adjustment

Hop Size Increase: The encoder's hop size was increased from 320 samples (original) to 960 samples.
Pooling Layer: An additional average pooling layer (AvgPool1d with kernel size $k=2$ and stride $2$) was introduced before vector quantization.
Result: This reduces the latent frame rate from 50 Hz to 25 Hz, effectively halving the number of discrete tokens per second while maintaining temporal coherence.

B. Decoder Weight Interpolation

Changing the hop size alters the dimensionality of the decoder's output layer. Instead of discarding pretrained weights, the authors applied one-dimensional linear interpolation to the generator head's output projection parameters (weights and biases).

Formula: $w'_i = (1 - \alpha_i) w_{\lfloor x_i \rfloor} + \alpha_i w_{\lceil x_i \rceil}$ , where $x_i$ maps the new index to the old index based on the ratio of the new length ( $L'$ ) to the old length ( $L$ ).
Benefit: This allows the decoder to retain the spectral characteristics of the original model while adapting smoothly to the new 24 kHz resolution without retraining from scratch.

C. Training Configuration

Frozen Components: The semantic encoder (HuBERT) and the codec encoder remain frozen.
Fine-Tuning: Only the decoder is fine-tuned to accommodate the new hop size and pooling.
Data: Trained on ~16,000 hours of multilingual speech (100+ languages) resampled to 24 kHz.
Loss Function: Multi-objective loss combining Mel-spectrogram, adversarial, and semantic losses (same coefficients as the baseline).

3. Key Contributions

Architectural Efficiency: Demonstrated that a simple modification (pooling + hop size adjustment) can reduce the token rate by 50% (50 Hz $\to$ 25 Hz) while simultaneously increasing the audio sampling rate (16 kHz $\to$ 24 kHz).
Weight Transfer Strategy: Introduced a linear interpolation method for decoder weights, enabling the reuse of pretrained models for new resolutions without full retraining.
State-of-the-Art Performance: Achieved the best reported performance among all codecs operating at a 25 Hz token rate.
Open Source Release: Released source code, checkpoints, and generation comparisons on Hugging Face.

4. Experimental Results

The model was evaluated on the Common Voice 17 test set (116 languages) using UTMOSv2 (a neural predictor for Mean Opinion Score).

MOS Improvement: The proposed model achieved a +0.29 MOS improvement over the original X-Codec-2.0 baseline.
Cross-Language Consistency: Improvements were consistent across 116 languages, with specific gains noted in high-frequency reconstruction and perceptual clarity.
Benchmark Comparison: In a comparison against 13 other codecs (including DAC, Encodec, Mimi, and SpeechTokenizer), the proposed model (25 Hz, 24 kHz) outperformed all competitors operating at the 25 Hz token rate.
- Example (English): Ours (2.457) vs. X-Codec-2.0 Baseline (2.168).
- Example (Spanish): Ours (2.314) vs. X-Codec-2.0 Baseline (2.139).

5. Significance and Limitations

Significance

LLM Integration: Reducing the token rate to 25 Hz significantly lowers the computational cost for autoregressive speech generation in Large Language Models (LLMs), making speech modeling more scalable.
High-Fidelity Audio: Raising the sampling rate to 24 kHz improves the reproduction of high-frequency sounds, addressing the "muffled" quality of 16 kHz codecs.
Modularity: Proves that X-Codec-2.0's design is robust enough to handle significant temporal resolution changes with minimal architectural intervention.

Limitations

Data Bias: Training data was primarily clean speech from Common Voice, limiting generalization to noisy, emotional, or expressive speech (e.g., singing, acting).
Evaluation Metric: Reliance on UTMOSv2 (trained mostly on English) for multilingual evaluation, though it correlates well with human judgment, may not fully capture subjective preferences across all languages.
Downstream Complexity: The larger vocabulary size (65,536) combined with lower token rates increases the information density per token, potentially raising perplexity for autoregressive models.

Future Work

The authors suggest exploring:

Systematic variation of latent frame rates (10 Hz to 100 Hz) to find optimal trade-offs.
Increasing decoder depth or receptive fields to compensate for information loss at lower frame rates.
Training on diverse, noisy, and expressive datasets.
Evaluating downstream performance in TTS and speech-language modeling pipelines.