Fish Audio S2 Technical Report

This paper introduces Fish Audio S2, an open-source text-to-speech system that leverages a multi-stage training pipeline to enable multi-speaker, multi-turn generation with natural-language instruction following, while providing production-ready weights and an efficient SGLang-based inference engine.

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you have a digital voice actor who is incredibly talented but, until now, has been a bit of a "one-trick pony." They could read a script perfectly, but if you asked them to whisper a secret, sound angry, or switch characters mid-sentence, they would often get confused or sound robotic.

Fish Audio S2 is the upgrade that turns this digital actor into a true method actor who can follow your every whim, just by talking to them in plain English.

Here is a breakdown of how they did it, using some everyday analogies:

1. The "Two-Brain" System (The Architecture)

Most voice AI systems try to do everything at once: figure out what to say and how to say it simultaneously. It's like asking a chef to write a recipe, chop the vegetables, and cook the meal all in one second. It's messy and slow.

Fish Audio S2 splits the job into two specialized roles:

  • The Slow Brain (The Director): This part is like a movie director. It reads the script and decides the big picture: "Okay, this sentence needs to be whispered," or "Now the character is switching to a villain." It handles the meaning and the flow.
  • The Fast Brain (The Sound Engineer): This part is a lightning-fast technician. It listens to the Director's instructions and instantly generates the actual sound waves, adding the tiny details like breaths, cracks in the voice, and pitch changes.

By separating these jobs, the system can think deeply about the story while simultaneously producing high-quality sound at incredible speeds.

2. The "Smart Filter" Factory (The Data Pipeline)

To teach an AI to speak well, you need millions of hours of audio. But the internet is full of bad audio: background noise, overlapping voices, and people mumbling.

Instead of hiring thousands of humans to listen to every clip, Fish Audio built a self-cleaning factory:

  • The Quality Inspector: A smart robot that listens to audio and instantly rejects anything that sounds bad (like a bouncer at a club).
  • The Translator: Another robot that doesn't just write down what was said, but also describes how it was said. If a person laughs nervously, the robot writes: [nervous laugh] right next to the text.

The Magic Trick: Usually, the robots that clean the data are different from the robots that grade the AI's homework. Fish Audio used the same robots for both. This means the AI is graded on exactly the same standards it was taught, so it never gets confused by "distribution shift" (a fancy way of saying the rules changing between learning and testing).

3. The "Tough Coach" (Reinforcement Learning)

Once the AI knows the basics, it needs to learn to follow complex instructions. This is where Reinforcement Learning comes in. Think of this as a tough coach who doesn't just say "Good job" or "Bad job."

The coach uses a multi-dimensional scorecard:

  1. Did you say the right words? (Semantic Accuracy)
  2. Did you sound natural? (Acoustic Quality)
  3. Did you sound like the right person? (Speaker Similarity)

If the AI tries to skip a word or ignore an instruction like "speak slowly," the coach immediately deducts points. The AI learns through trial and error, trying thousands of variations until it gets the perfect score.

4. The "Super-Expressive" Result

Because of these upgrades, Fish Audio S2 can do things that were previously impossible for open-source models:

  • The "Chameleon" Effect: You can give it a script with multiple characters, and it will naturally switch voices mid-sentence without you having to restart the generation.
  • The "Director's Cut": You can type instructions like "Say this part while crying, then switch to a whisper" and it will do exactly that. It understands natural language, not just code.
  • The "Marathon Runner": It can read a whole book chapter without losing its voice or getting tired, keeping the same tone and quality from start to finish.

5. The "Lightning Fast" Engine

Finally, they didn't just build a smart brain; they built a fast car to drive it. They used a special engine (SGLang) usually reserved for text chatbots.

  • The Result: It generates audio so fast that it feels like magic. You can start hearing the voice in less than 100 milliseconds (faster than a human blink), and it can generate audio 5 times faster than real-time. It's like having a voice actor who can record a whole audiobook in the time it takes to brew a cup of coffee.

The Bottom Line

Fish Audio S2 is a major leap forward because it treats voice generation not just as "reading text aloud," but as acting. By combining a smart two-part brain, a self-cleaning data factory, and a tough coaching system, they've created an open-source voice AI that is fast, expressive, and understands human instructions better than almost anything else available today.