StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Imagine you're talking to a robot friend. In the past, these robots were like stiff, monotone robots from old movies—they could answer your questions, but they sounded the same whether they were telling a joke or delivering bad news. They lacked "flavor."

Recently, new "Speech Language Models" (SLMs) have arrived. These are like actors who can not only read a script but also change their voice to sound happy, angry, fast, or loud. But here's the problem: How do we know if they are actually good at acting, or if they are just faking it?

That's exactly what this paper, StyleBench, is trying to solve.

🎭 The Problem: The "Fake Smile" of AI

The authors noticed that while these AI voices are getting better, there's no standard "driver's license test" for them. We know they can change their tone, but we don't have a systematic way to measure:

Can they get really angry, or just slightly annoyed?
Can they speed up their speech like a nervous person, or do they just mumble faster?
If you ask them to be "happier" in the middle of a conversation, do they actually get happier, or do they just say "Okay" in the same boring voice?

🏗️ The Solution: Building "StyleBench"

To fix this, the researchers built a giant testing ground called StyleBench. Think of it as a gym for AI voices.

Instead of just asking the AI a single question, they created multi-turn conversations (like a real chat).

Turn 1: The AI speaks normally (neutral).
Turn 2: You ask, "Can you say that again, but sound angry?"
Turn 3: You ask, "Okay, now make it really, really angry!"

They tested the AI on four specific "muscles" of voice:

Emotion: (Happy, Sad, Angry, etc.)
Speed: (Slow and lazy vs. Fast and frantic)
Volume: (Whispering vs. Shouting)
Pitch: (High and squeaky vs. Low and deep)

To make sure the test was fair, they used a "control group" method. They took the exact same sentence and asked the AI to say it in different ways. If the AI changed the words, it failed. If it kept the words the same but changed the vibe, it passed.

🏆 The Results: Who Won the Acting Award?

The researchers tested 10 different AI models (some small, some huge). Here is what they found:

The "Good Actors": Models like Kimi-Audio and GLM-4-Voice were the stars of the show. When asked to get angrier or louder, they actually did it. They understood the nuance of "a little bit faster" vs. "super fast."
The "Method Actors" who got stuck: Some models could handle a simple request but failed when asked to increase the intensity. They got stuck in a loop, unable to turn the volume knob up further.
The "Robots": Some models (like LLaMA-omni2) basically ignored the instructions. You'd ask for a happy tone, and they'd give you a robot voice. They were great at answering questions, but terrible at acting.

🔍 Why Did Some Fail? (The Secret Sauce)

The paper dug deep to find out why the winners were better than the losers. They found two main reasons:

What They Ate (Training Data):
Imagine training a chef. If you only feed them recipes for plain boiled chicken (standard tasks like reading text), they won't know how to make a spicy, complex dish.
- The losers were trained mostly on standard tasks (reading text, answering questions).
- The winners were fed a special diet of data that included natural conversations with lots of emotion and style variations. They learned from real human interactions.
The Translation Tool (Tokenizers):
AI speaks a secret code. To turn that code back into human speech, it needs a translator (a tokenizer).
- Some models use a translator that strips away the "flavor" (the emotion and tone) to focus only on the meaning.
- The winners use a high-fidelity translator that keeps the "spice" intact. It knows that the code for "I'm happy" is different from the code for "I'm angry," even if the words are the same.

🚀 The Big Takeaway

This paper is a wake-up call for the AI world. Just because an AI can talk doesn't mean it can communicate with style.

StyleBench gives us a ruler to measure how good these AI voices really are. It shows us that to build a truly realistic AI companion—one that can laugh with you, comfort you, or get excited with us—we need to train them on better data and build better "voice translators."

In short: We are moving from AI that just talks to AI that truly speaks with feeling.

Here is a detailed technical summary of the paper "StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control."

1. Problem Statement

While Speech Language Models (SLMs) have advanced significantly in integrating paralinguistic information (emotion, speed, volume, pitch) to enhance human-computer interaction, there is a critical lack of systematic benchmarks to evaluate their ability to control speaking style intensity in multi-turn conversations.

Current Limitations: Existing benchmarks (e.g., AudioBench, SpeechFeedback) primarily focus on single-turn tasks, conventional ASR, or coarse-grained emotional categorization. They fail to quantify the intensity variation of styles across dialogue turns or assess whether models can faithfully follow dynamic, conversational style prompts (e.g., "say it again with a happier tone").
The Gap: It remains unclear how well leading SLMs and Omni Language Models (OLMs) can maintain semantic coherence while dynamically adjusting acoustic features (emotion, speed, volume, pitch) in response to user instructions during a dialogue.

2. Methodology: StyleBench

The authors propose StyleBench, a comprehensive benchmark designed to evaluate style control across four dimensions: Emotion, Speed, Volume, and Pitch.

A. Dataset Construction

Structure: A bilingual (English/Chinese) multi-turn dialogue dataset consisting of 14,400 three-turn QA pairs.
Dialogue Flow:
1. Turn 1 (Neutral): A baseline response.
2. Turn 2 (Stylistic Prompt): The user asks the model to adjust the style (e.g., "Say it with a happier tone").
3. Turn 3 (Enhanced Prompt): The user requests further intensification or attenuation (e.g., "Make it even happier").
Data Composition:
- Emotion Subsets: Use semantically neutral QA pairs designed to convey contextual signals without explicit affective terms. Reference audio is drawn from RAVDESS (high-quality emotional speech) to synthesize varying emotional intensities.
- Prosodic Subsets (Speed, Volume, Pitch): Use a shared pool of neutral dialogues. Base utterances are synthesized using CosyVoice2 and then processed via FFmpeg to create controlled acoustic variations.
Synthesis Strategy: All audio is synthesized independently to ensure that intensity variations arise solely from the prompts, isolating the model's controllability from semantic content changes.

B. Evaluation Metrics

The paper introduces a two-step evaluation framework combining automatic metrics and human evaluation:

Semantic Relevance: Measured by Single-turn Relevance Degree (SRD) and Multi-turn Relevance Degree (MRD) using a text-based LLM (Qwen3-4B) to ensure the model maintains dialogue coherence.
Style Control Performance:
- Valid Sample Percentage (VSP): The proportion of responses where the model successfully produces a distinct, intended stylistic output (measured via human evaluation for emotion).
- Style Variation Degree (SVD): A quantitative metric for Speed, Volume, and Pitch. It calculates the absolute percentage difference in style scores between consecutive turns ( $\Delta_1$ $Δ_{1}$ and $\Delta_2$ $Δ_{2}$ ) to measure the magnitude of adjustment.
  - Speed: Syllables per minute (SPM) via Whisper-large-v3.
  - Volume: Root Mean Square (RMS) energy.
  - Pitch: Average fundamental frequency (F0).

3. Key Contributions

StyleBench Dataset: The first multi-turn dialogue benchmark specifically designed to evaluate style intensity control across four dimensions (Emotion, Speed, Volume, Pitch) with 14.4K samples.
Evaluation Toolkit: Development of dimension-specific metrics (VSP and SVD) that quantify both the success rate of following instructions and the degree of acoustic variation.
Empirical Analysis: A comprehensive evaluation of 10 end-to-end open-source SLMs (ranging from 0.5B to 9B parameters), revealing performance gaps and analyzing the root causes (training data and tokenizers).

4. Experimental Results

The authors evaluated 10 models, including Mini-omni, Qwen2.5-omni, GLM-4-Voice, and Kimi-Audio.

Semantic Coherence (MRD): Only three models (Qwen2.5-omni, GLM-4-Voice, Kimi-Audio) achieved an MRD > 60%, indicating reliable multi-turn consistency. Models like LLaMA-omni2 and Baichuan-omni-1.5 showed poor multi-turn coherence.
Emotion Control:
- Kimi-Audio led in most emotion categories but showed diminishing returns in the third turn (intensity adjustment became less effective).
- GLM-4-Voice showed strong performance, particularly in Turn 3.
- Models like LLaMA-omni2 and Baichuan-omni-1.5 were largely unresponsive to emotional adjustment prompts.
Prosodic Control (Speed, Volume, Pitch):
- Kimi-Audio and GLM-4-Voice consistently demonstrated superior control, achieving higher VSP (valid response rates) and SVD (intensity shift magnitude).
- For example, Kimi-Audio achieved a VSP of ~81.88% for Speed adjustments, with an SVD of ~29.94%, significantly outperforming other models.
- Many smaller or less specialized models failed to generate valid responses or showed negligible style variation.

5. Analysis of Performance Gaps

The paper identifies two critical factors influencing style control capabilities:

Training Data:
- Underperforming models were pre-trained primarily on conventional tasks (ASR, Spoken QA).
- Top performers (GLM-4-Voice, Kimi-Audio) utilized unsupervised speech datasets or explicitly designed style control datasets during pre-training, allowing them to learn natural stylistic patterns.
Speech Tokenizers:
- The architecture of the speech tokenizer is crucial. GLM-4-Voice uses an independently trained tokenizer that better preserves paralinguistic cues compared to generic tokenizers like SpeechTokenizer or Whisper-large-v3.
- The analysis suggests that speech tokens inherently carry paralinguistic information, and high-fidelity tokenizers are essential for reproducing style variations.

6. Significance

Standardization: StyleBench fills a critical gap in the field by providing a standardized, multi-turn evaluation framework for conversational style control, moving beyond single-turn or categorical assessments.
Guidance for Future Models: The findings suggest that to achieve advanced stylistic speech generation, future models must prioritize:
1. Training on diverse, style-rich dialogue datasets.
2. Developing specialized speech tokenizers that retain acoustic and paralinguistic information.
Human-Computer Interaction: The benchmark provides a roadmap for developing more natural, expressive, and responsive voice assistants capable of nuanced emotional and prosodic interactions.