AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

Imagine you are a director trying to cast a voice actor for a new anime character. You need a voice that sounds "anime-like"—energetic, expressive, and distinct from a normal news anchor.

In the past, checking if a computer-generated voice hits this mark was like asking 100 people to guess a secret number between 1 and 100. Everyone has a different idea of what "anime-like" means, so the results were messy, expensive, and slow.

This paper introduces AnimeScore, a new way to solve this problem. Think of it as turning a confusing math test into a simple game of "This or That."

Here is the breakdown of how they did it, using everyday analogies:

1. The Problem: The "Subjective Score" Trap

Usually, when we rate speech, we ask people: "On a scale of 1 to 5, how natural does this sound?"
But "anime-like" isn't like "natural." It's more like asking, "Is this painting more like Van Gogh or Picasso?" You can't really give it a single number. One person might think high-pitched screams are anime-like, while another thinks it's about clear, fast talking. This made it hard for AI developers to know if their voice models were improving.

2. The Solution: The "Blind Taste Test"

Instead of asking people to give a score, the researchers asked them to play a game: "Which of these two voices sounds more like an anime character?"

The Data: They gathered 15,000 of these "A vs. B" choices from 187 different people.
The Trick: To make sure people weren't just guessing based on the words being spoken (like hearing a word that only appears in anime scripts), they filtered out the text and focused purely on the sound.
The Result: They created a massive dataset that acts as a "gold standard" for what humans actually prefer.

3. The Discovery: It's Not Just "High Pitch"

There is a common myth that anime voices are just "high-pitched squeaks." The researchers used their new data to prove this wrong. They analyzed the winning voices and found the "secret sauce" is actually a mix of three things:

The "Resonance" (The Instrument): It's not about being high-pitched; it's about the shape of the voice. Imagine a violin vs. a flute. Anime voices use a specific "resonance shaping" that makes them sound fuller and more controlled, not just squeaky.
The "Flow" (The River): The winning voices had a very smooth, continuous flow of sound. They didn't have many awkward pauses or breaks. It's like a river that keeps moving without hitting rocks.
The "Clarity" (The Chef): The speakers were very deliberate with their pronunciation. They spoke quickly (like a fast river) but didn't slur their words. It was a "dense flow" with "delicate enunciation."

The Analogy: If a normal voice is a casual conversation at a coffee shop, an anime voice is like a professional radio host reading a script with perfect energy, clear diction, and zero stumbles.

4. The AI Teacher: Learning from the Game

The researchers built an AI model (a "teacher") to learn from these 15,000 games.

Old Way (Hand-crafted rules): They tried to teach the AI using simple math rules (like "if pitch is high, give points"). This was like trying to teach someone to cook by giving them a list of ingredients. It worked okay (about 69% accuracy), but it missed the nuance.
New Way (The "Deep Learner"): They used a modern AI technique called SSL (Self-Supervised Learning). Think of this as letting the AI listen to thousands of hours of audio and figure out the patterns on its own, without being told the rules.
The Result: The new AI became a master judge. It could predict which voice sounded more "anime-like" with 90.8% accuracy. It learned the subtle "vibe" that the simple math rules missed.

5. Why This Matters

This isn't just about making anime voices. It's about giving AI developers a compass.

Before: Developers had to guess if their new voice model was good, then pay humans to listen and give feedback. It was slow and expensive.
Now: They can use AnimeScore as an automatic "reward signal." It's like a video game score that instantly tells the AI, "Good job, that voice sounds more like an anime character!" This allows them to train better voices much faster.

Summary

The paper says: "Stop trying to measure 'anime-ness' with a ruler. Instead, play a game of 'This or That' to train an AI. We found that anime voices aren't just high-pitched; they are smooth, clear, and emotionally expressive. Our new AI tool can now spot these differences better than any human rulebook ever could."

Here is a detailed technical summary of the paper "AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style."

1. Problem Statement

The evaluation of "anime-like" speech styles currently lacks a standardized, objective metric. Existing approaches rely on costly and time-consuming subjective listening tests (Mean Opinion Score or MOS). However, applying standard MOS protocols to anime-likeness is problematic because:

Lack of Absolute Scale: Unlike "naturalness" or "intelligibility," "anime-likeness" is a multidimensional, domain-specific construct without a shared perceptual anchor, making absolute scoring inconsistent among evaluators.
Development Bottleneck: Speech generation developers must conduct frequent, expensive human evaluations to iterate on style, slowing down the development cycle.
Misconceptions: There is a prevailing heuristic that anime voices are simply "high-pitched," but the actual acoustic drivers of this perception are not well quantified.

2. Methodology

A. Dataset Construction (AnimeScore)

The authors constructed a large-scale preference dataset to train and evaluate models:

Data Sources: 3,000 utterances were selected from three corpora: Anim-400k (anime-derived), ReazonSpeech (TV/everyday speech), and Coco-Nut (diverse YouTube styles).
Bias Mitigation:
- Linguistic Bias: A Qwen3-30B-Instruct model filtered transcripts to ensure evaluators judged based on acoustics, not script content (retaining only texts unlikely to be anime scripts).
- Quality Control: Samples were filtered for duration (2–10s), ASR quality (CER), and predicted MOS (>3).
- Speaker Diversity: Speaker embeddings (ECAPA-TDNN) were used to cluster and sample speakers, preventing dominance by specific voice types.
Pairwise Generation: 15,000 A/B comparison pairs were created. The pairing strategy prioritized cross-corpus contrasts while controlling for text and speaker similarity to isolate style differences.
Annotation: 187 human evaluators provided pairwise preferences ("Which sounds more anime-like?") and free-form descriptions of their reasoning.

B. Acoustic Analysis & Feature Engineering

The authors analyzed the relationship between human preferences and acoustic features:

Feature Categories: Emotional Explicitness, Timbre Difference, Prosodic Salience, Articulation Clarity, and Temporal Control.
Key Findings: Contrary to the "high pitch" stereotype, the analysis revealed that anime-likeness is driven by:
- Controlled Resonance: Lower median formants (F1, F2, F3) indicating a fuller vocal quality, not just a spectral shift.
- Prosodic Continuity: Higher voicing ratios and spectral flux (continuous energy).
- Articulation Strategy: A paradoxical combination of high syllable rates (fast flow) with low pause ratios, yet deliberate enunciation (high articulation rate).

C. Model Architecture

Two types of models were trained to predict the "AnimeScore":

Handcrafted Baseline: A logistic regression classifier trained on pairwise differences of the acoustic features identified above.
SSL-Based Ranking Models: End-to-end models using frozen Self-Supervised Learning (SSL) encoders (WavLM, HuBERT, wav2vec 2.0, data2vec).
- Pipeline: Input Audio $\rightarrow$ Frozen SSL Encoder $\rightarrow$ BiLSTM $\rightarrow$ Mean Pooling $\rightarrow$ MLP $\rightarrow$ Scalar Score.
- Training Objective: Pairwise logistic loss (RankNet) to minimize $-\log \sigma(s_a - s_b)$ based on ground-truth A/B preferences.

3. Key Contributions

AnimeScore Dataset: A publicly available dataset of 15,000 pairwise judgments with free-form annotations, specifically designed to overcome the lack of absolute scales for style evaluation.
Acoustic Deconstruction: A rigorous analysis debunking the "high pitch" myth, identifying that resonance shaping, prosodic continuity, and deliberate articulation are the true drivers of anime-likeness.
Performance Benchmarking: Established a performance ceiling for handcrafted features (69.3% AUC) and demonstrated that SSL-based representations significantly outperform them.
Practical Framework: A drop-in evaluation metric for rapid model screening and a potential reward signal for Reinforcement Learning from Human Feedback (RLHF) in generative speech models.

4. Experimental Results

Metric	Handcrafted Features (LogReg)	SSL-Based Models (Best: HuBERT)
AUC	69.3%	90.8%
Accuracy	~63.4%	~82.4%
NLL	N/A	0.3852

Acoustic Insights: The "Timbre Difference" (Formants) and "Articulation Clarity" (Pause ratio/Syllable rate) were the strongest predictors among handcrafted features.
Model Performance: Masked-prediction models (HuBERT, WavLM) outperformed contrastive models (wav2vec 2.0), suggesting that encoding prosodic and paralinguistic properties is crucial for capturing anime style.
Human Agreement: The SSL models closely reproduced human comparative judgments, validating their use as a proxy for human raters.

5. Significance and Future Work

Standardization: AnimeScore provides the first reproducible, objective metric for anime-style speech, removing the bottleneck of expensive listening tests.
RLHF Application: The predictor can serve as a reward function to optimize generative speech models (e.g., TTS) toward specific stylistic targets without requiring continuous human feedback.
Limitations & Future Directions: The study notes limitations in data scale, demographic imbalance, and a lack of ablation studies on model structure. Future work will focus on expanding the dataset and exploring the integration of this metric into generative model training pipelines.

In conclusion, the paper successfully shifts the paradigm of anime speech evaluation from unreliable absolute scoring to robust pairwise ranking, revealing that the "anime voice" is a complex interplay of resonance and rhythm rather than a simple pitch shift.