Fish Audio S2 Technical Report

Imagine you have a digital voice actor who is incredibly talented but, until now, has been a bit of a "one-trick pony." They could read a script perfectly, but if you asked them to whisper a secret, sound angry, or switch characters mid-sentence, they would often get confused or sound robotic.

Fish Audio S2 is the upgrade that turns this digital actor into a true method actor who can follow your every whim, just by talking to them in plain English.

Here is a breakdown of how they did it, using some everyday analogies:

1. The "Two-Brain" System (The Architecture)

Most voice AI systems try to do everything at once: figure out what to say and how to say it simultaneously. It's like asking a chef to write a recipe, chop the vegetables, and cook the meal all in one second. It's messy and slow.

Fish Audio S2 splits the job into two specialized roles:

The Slow Brain (The Director): This part is like a movie director. It reads the script and decides the big picture: "Okay, this sentence needs to be whispered," or "Now the character is switching to a villain." It handles the meaning and the flow.
The Fast Brain (The Sound Engineer): This part is a lightning-fast technician. It listens to the Director's instructions and instantly generates the actual sound waves, adding the tiny details like breaths, cracks in the voice, and pitch changes.

By separating these jobs, the system can think deeply about the story while simultaneously producing high-quality sound at incredible speeds.

2. The "Smart Filter" Factory (The Data Pipeline)

To teach an AI to speak well, you need millions of hours of audio. But the internet is full of bad audio: background noise, overlapping voices, and people mumbling.

Instead of hiring thousands of humans to listen to every clip, Fish Audio built a self-cleaning factory:

The Quality Inspector: A smart robot that listens to audio and instantly rejects anything that sounds bad (like a bouncer at a club).
The Translator: Another robot that doesn't just write down what was said, but also describes how it was said. If a person laughs nervously, the robot writes: [nervous laugh] right next to the text.

The Magic Trick: Usually, the robots that clean the data are different from the robots that grade the AI's homework. Fish Audio used the same robots for both. This means the AI is graded on exactly the same standards it was taught, so it never gets confused by "distribution shift" (a fancy way of saying the rules changing between learning and testing).

3. The "Tough Coach" (Reinforcement Learning)

Once the AI knows the basics, it needs to learn to follow complex instructions. This is where Reinforcement Learning comes in. Think of this as a tough coach who doesn't just say "Good job" or "Bad job."

The coach uses a multi-dimensional scorecard:

Did you say the right words? (Semantic Accuracy)
Did you sound natural? (Acoustic Quality)
Did you sound like the right person? (Speaker Similarity)

If the AI tries to skip a word or ignore an instruction like "speak slowly," the coach immediately deducts points. The AI learns through trial and error, trying thousands of variations until it gets the perfect score.

4. The "Super-Expressive" Result

Because of these upgrades, Fish Audio S2 can do things that were previously impossible for open-source models:

The "Chameleon" Effect: You can give it a script with multiple characters, and it will naturally switch voices mid-sentence without you having to restart the generation.
The "Director's Cut": You can type instructions like "Say this part while crying, then switch to a whisper" and it will do exactly that. It understands natural language, not just code.
The "Marathon Runner": It can read a whole book chapter without losing its voice or getting tired, keeping the same tone and quality from start to finish.

5. The "Lightning Fast" Engine

Finally, they didn't just build a smart brain; they built a fast car to drive it. They used a special engine (SGLang) usually reserved for text chatbots.

The Result: It generates audio so fast that it feels like magic. You can start hearing the voice in less than 100 milliseconds (faster than a human blink), and it can generate audio 5 times faster than real-time. It's like having a voice actor who can record a whole audiobook in the time it takes to brew a cup of coffee.

The Bottom Line

Fish Audio S2 is a major leap forward because it treats voice generation not just as "reading text aloud," but as acting. By combining a smart two-part brain, a self-cleaning data factory, and a tough coaching system, they've created an open-source voice AI that is fast, expressive, and understands human instructions better than almost anything else available today.

Here is a detailed technical summary of the Fish Audio S2 Technical Report.

1. Problem Statement

Despite recent advancements in Text-to-Speech (TTS), several critical bottlenecks remain in open-source systems:

Instruction Following: Generating fine-grained vocal features (e.g., emotion, prosody, paralinguistics) via natural language instructions at scale is difficult. Most models rely on rigid control tokens or struggle with complex, free-form descriptions.
Data Curation & Distribution Shift: Scaling TTS requires massive datasets. Traditional pipelines often use separate models for data filtering/annotation and Reinforcement Learning (RL) reward modeling, leading to distribution mismatches between pre-training and post-training objectives.
Long-Form & Multi-Speaker Stability: Generating coherent, long-form audio with multiple speakers in a single pass often suffers from timbre drift, hallucinations, and token skipping.
Inference Latency: Achieving ultra-low latency (Time-to-First-Audio) and high throughput in production environments remains challenging for autoregressive audio models.

2. Methodology

Fish Audio S2 introduces a comprehensive framework addressing these issues through architectural innovation, a unified data pipeline, and advanced training strategies.

A. Architecture: Dual-Autoregressive (Dual-AR)

The model retains a decoder-only Transformer backbone but introduces a Dual-AR mechanism to decouple temporal semantic modeling from depth-wise acoustic modeling:

Slow AR (Temporal Backbone): A 4B-parameter model (based on Qwen3-4B) that operates autoregressively over the full sequence. It interleaves text tokens with the first semantic token ( $q^{(0)}_t$ ) from the audio codebook. This layer plans linguistic content and coarse prosody.
Fast AR (Acoustic Decoder): A lightweight 4-layer network that reconstructs the remaining 9 fine-grained acoustic tokens ( $q^{(1)}_t \dots q^{(N-1)}_t$ ) based on the hidden state of the Slow AR.
Multi-Codebook Fusion (MCF): The generated tokens from all 10 codebooks are aggregated into a continuous vector to serve as the input for the next time step in the Slow AR, ensuring tight coupling between semantic and acoustic generation.
Audio Tokenizer: Built on Descript Audio Codec (DAC) with 10 Residual Vector Quantization (RVQ) codebooks. It features causal convolutions, Transformer bottlenecks for long-range dependency, and semantic distillation (aligning the first codebook with w2v-BERT 2.0 activations) to ensure rich semantic representation.

B. Data Pipeline: Dual-Purpose Design

To eliminate distribution shift, Fish Audio S2 employs a Multi-Purpose Data Pipeline where the same models serve as both pre-training filters/annotators and RL reward signals:

Stage 1 (Separation): Vocal separation and Voice Activity Detection (VAD).
Stage 2 (Quality Filtering): A Speech Quality Model (w2v-BERT 2.0 based) filters low-fidelity samples (noise, overlapping voices). This model is reused as an Acoustic Reward in RL.
Stage 3 (Rich Transcription): A fine-tuned ASR model (Qwen3-Omni) generates transcripts that include natural language captions for vocal features (e.g., [angry], [inhale], [prolonged laugh]) and speaker turns. This model is reused as an Intelligibility/Instruction Reward in RL.

C. Training Strategy

Pre-training & SFT: The model is pre-trained on 10M+ hours of audio across 80 languages. It uses a modality interleaving strategy (interleaving text and audio tokens) to enforce monotonic alignment. Reference audio is prepended to the system prompt (not appended) to prevent verbatim memorization.
RL-Based Post-Training: To address hallucinations and timbre drift without the high cost of PPO, the team uses a Group Relative Policy Optimization (GRPO) variant.
- Multi-Reward System: A composite reward ( $R_{total}$ $R_{t o t a l}$ ) combines:
  - $R_{STT}$ : Semantic accuracy (via ASR confidence).
  - $R_{Pref}$ : Acoustic quality (via the Speech Quality Model).
  - $R_{SIM}$ : Speaker similarity (via external voiceprint models).
- Optimization: Uses LoRA weight-swapping to compute KL divergence penalties without loading a full reference model into VRAM.

D. Inference Engine

Built on SGLang, the engine achieves production-ready performance:

RadixAttention: Extended to cache both semantic and acoustic tokens, enabling high prefix-cache hit rates for voice reuse.
Co-Scheduling: Uses MPS to co-schedule vocoder decoding with LLM decoding on the same GPU.
Performance: Achieves RTF 0.195 and Time-to-First-Audio (TTFA) < 100ms.

3. Key Contributions

Instruction-Following TTS: The first open-source system capable of fine-grained control via free-form natural language descriptions (e.g., "speak in a hurry," "whisper while laughing") without dedicated control tokens.
Native Multi-Speaker & Multi-Turn Generation: Capable of generating complex, interleaved dialogues with multiple distinct speakers in a single pass.
Unified Data & RL Framework: A novel pipeline where data annotation models double as RL reward models, ensuring distributional consistency and enabling scalable, automated vocal annotation.
Dual-AR Architecture: A highly efficient architecture that separates semantic planning from acoustic detail generation, enabling stable long-form synthesis.
Open Ecosystem: Release of model weights, fine-tuning code, and a high-performance SGLang-based inference engine.

4. Results

Fish Audio S2 demonstrates state-of-the-art performance across objective and subjective benchmarks:

Objective Metrics:
- Seed-TTS-Eval: Achieved leading WER scores (0.54% Chinese, 0.99% English), outperforming S1 and competitors like CosyVoice 3 and Qwen3-TTS.
- Multilingual (Minimax/CV3): Lowest WER in 11/24 languages on the Minimax testset and highest speaker similarity (SIM) in 17/24.
- Long-Form: Lowest WER/CER and stable speaker consistency (low SIM-Std) on the Long-TTS-Eval benchmark.
Subjective & Instruction Benchmarks:
- Audio Turing Test (ATT): Achieved a posterior mean of 0.483 (indistinguishable from human), improving to 0.515 with instruction rewriting, surpassing previous SOTA.
- Emergent TTS Eval: Achieved an 81.88% win rate against baselines, particularly excelling in paralinguistics (91.61%) and syntactic complexity.
- Fish Audio Instruction Benchmark: Demonstrated a 93.3% tag-activation rate and a quality score of 4.51/5.0, significantly outperforming Fish Audio S1 in both Chinese and English.
Inference Performance:
- RTF: 0.195 (5x faster than real-time).
- TTFA: < 100ms.
- Throughput: >3000 acoustic tokens/sec under high concurrency.

5. Significance

Fish Audio S2 represents a significant leap forward in open-source TTS by bridging the gap between large language model capabilities and high-fidelity audio generation.

Paradigm Shift: It moves TTS from rigid, token-based control to natural language instruction following, making voice synthesis more accessible and expressive.
Scalability: The dual-purpose data pipeline solves the bottleneck of creating high-quality, instruction-rich datasets, offering a blueprint for future RL-aligned audio models.
Production Readiness: By achieving ultra-low latency and high throughput via SGLang integration, it makes high-quality, controllable TTS viable for real-time applications like chatbots, live dubbing, and interactive audiobooks.
Community Impact: The open release of weights, code, and a production-grade engine lowers the barrier to entry for researchers and developers, fostering rapid innovation in the field.