AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

This paper introduces AnimeScore, a preference-based framework and dataset that establishes a standardized objective metric for evaluating "anime-like" speech by leveraging pairwise rankings and SSL-based models to overcome the limitations of traditional subjective assessments.

Joonyong Park, Jerry Li

Published Fri, 13 Ma
📖 4 min read☕ Coffee break read

Imagine you are a director trying to cast a voice actor for a new anime character. You need a voice that sounds "anime-like"—energetic, expressive, and distinct from a normal news anchor.

In the past, checking if a computer-generated voice hits this mark was like asking 100 people to guess a secret number between 1 and 100. Everyone has a different idea of what "anime-like" means, so the results were messy, expensive, and slow.

This paper introduces AnimeScore, a new way to solve this problem. Think of it as turning a confusing math test into a simple game of "This or That."

Here is the breakdown of how they did it, using everyday analogies:

1. The Problem: The "Subjective Score" Trap

Usually, when we rate speech, we ask people: "On a scale of 1 to 5, how natural does this sound?"
But "anime-like" isn't like "natural." It's more like asking, "Is this painting more like Van Gogh or Picasso?" You can't really give it a single number. One person might think high-pitched screams are anime-like, while another thinks it's about clear, fast talking. This made it hard for AI developers to know if their voice models were improving.

2. The Solution: The "Blind Taste Test"

Instead of asking people to give a score, the researchers asked them to play a game: "Which of these two voices sounds more like an anime character?"

  • The Data: They gathered 15,000 of these "A vs. B" choices from 187 different people.
  • The Trick: To make sure people weren't just guessing based on the words being spoken (like hearing a word that only appears in anime scripts), they filtered out the text and focused purely on the sound.
  • The Result: They created a massive dataset that acts as a "gold standard" for what humans actually prefer.

3. The Discovery: It's Not Just "High Pitch"

There is a common myth that anime voices are just "high-pitched squeaks." The researchers used their new data to prove this wrong. They analyzed the winning voices and found the "secret sauce" is actually a mix of three things:

  • The "Resonance" (The Instrument): It's not about being high-pitched; it's about the shape of the voice. Imagine a violin vs. a flute. Anime voices use a specific "resonance shaping" that makes them sound fuller and more controlled, not just squeaky.
  • The "Flow" (The River): The winning voices had a very smooth, continuous flow of sound. They didn't have many awkward pauses or breaks. It's like a river that keeps moving without hitting rocks.
  • The "Clarity" (The Chef): The speakers were very deliberate with their pronunciation. They spoke quickly (like a fast river) but didn't slur their words. It was a "dense flow" with "delicate enunciation."

The Analogy: If a normal voice is a casual conversation at a coffee shop, an anime voice is like a professional radio host reading a script with perfect energy, clear diction, and zero stumbles.

4. The AI Teacher: Learning from the Game

The researchers built an AI model (a "teacher") to learn from these 15,000 games.

  • Old Way (Hand-crafted rules): They tried to teach the AI using simple math rules (like "if pitch is high, give points"). This was like trying to teach someone to cook by giving them a list of ingredients. It worked okay (about 69% accuracy), but it missed the nuance.
  • New Way (The "Deep Learner"): They used a modern AI technique called SSL (Self-Supervised Learning). Think of this as letting the AI listen to thousands of hours of audio and figure out the patterns on its own, without being told the rules.
  • The Result: The new AI became a master judge. It could predict which voice sounded more "anime-like" with 90.8% accuracy. It learned the subtle "vibe" that the simple math rules missed.

5. Why This Matters

This isn't just about making anime voices. It's about giving AI developers a compass.

  • Before: Developers had to guess if their new voice model was good, then pay humans to listen and give feedback. It was slow and expensive.
  • Now: They can use AnimeScore as an automatic "reward signal." It's like a video game score that instantly tells the AI, "Good job, that voice sounds more like an anime character!" This allows them to train better voices much faster.

Summary

The paper says: "Stop trying to measure 'anime-ness' with a ruler. Instead, play a game of 'This or That' to train an AI. We found that anime voices aren't just high-pitched; they are smooth, clear, and emotionally expressive. Our new AI tool can now spot these differences better than any human rulebook ever could."