Human-CLAP: Human-perception-based contrastive language-audio pretraining

This paper introduces Human-CLAP, a human-perception-based contrastive language-audio model trained on subjective evaluation scores to significantly improve the correlation between automated CLAP scores and human judgments, addressing the previously low alignment of standard CLAP metrics with human perception.

Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi Saruwatari

Published Wed, 11 Ma
📖 4 min read☕ Coffee break read

Here is an explanation of the paper "Human-CLAP" using simple language and creative analogies.

The Big Problem: The Robot vs. The Human

Imagine you are a chef trying to invent a new recipe. You have a robot assistant that helps you taste your dishes.

  • The Robot (Old CLAP): This robot has read millions of cookbooks and knows that "soup" usually goes with "hot." If you show it a picture of a cold bowl of soup, it might say, "This is a perfect match!" because the words "soup" and "hot" are in its database, even if the actual food tastes wrong.
  • The Human (You): You take a bite and say, "Ew, this is cold. It doesn't match the description."

The problem the researchers found is that the Robot (CLAP) is very good at matching words to sounds based on statistics, but it is terrible at understanding if a human would actually like or agree with that match. In the world of AI music and sound generation, the robot's "score" (called CLAPScore) was often high even when humans thought the sound was garbage.

The Solution: Teaching the Robot to Listen to Humans

The researchers created a new version of the robot called Human-CLAP.

Instead of just letting the robot guess based on its massive library of text and audio, they gave it a "tasting panel." They took a small group of real humans, played them audio clips with text descriptions, and asked them to rate how well the text matched the sound on a scale of 0 to 10.

  • The Old Way: The robot was trained on a massive dataset where it assumed every text and audio pair was a perfect match. It was like studying a dictionary where every definition is correct, even if the examples are weird.
  • The New Way (Human-CLAP): They took a tiny slice of data (only about 1/320th of the original size) but it was "high-quality" data because it had human ratings. They taught the robot: "Hey, when humans give this a low score, you should give it a low score too. When humans love it, you should love it."

How It Works: The "Weighted" Lesson

Think of the training process like a teacher grading a student's homework.

  1. The Old Teacher: The teacher looks at 1,000 homework assignments. If the student gets one right, the teacher gives a gold star. If they get one wrong, the teacher gives a red X. But the teacher treats every question as equally important, even if some questions are obvious mistakes.
  2. The New Teacher (Human-CLAP): This teacher looks at the same homework but has a special "Human Score" in hand.
    • If a student gets a question right and the Human Score says "Great job!", the teacher gives a huge gold star.
    • If a student gets a question "right" (mathematically) but the Human Score says "This is actually terrible," the teacher gives a big red X.
    • The teacher uses a special formula (called wSCE) that weighs the lessons based on how much humans actually liked the result.

The Results: A Much Better Match

The researchers tested their new robot against the old one.

  • The Old Robot: When comparing its scores to human opinions, the correlation was weak (about 0.28). It was like trying to guess the weather by looking at a random clock; sometimes it's right, but mostly it's guessing.
  • The New Robot (Human-CLAP): The correlation jumped up to over 0.45.

What does this mean?
It means the new robot is much better at predicting what a human will think. If a human says, "This AI-generated sound of a dog barking doesn't sound like a real dog," the new robot will agree and give it a low score. The old robot might have said, "It has the word 'dog' in it, so it's a 10/10!"

Why This Matters

This is a huge step forward for Text-to-Audio technology (where you type "a cat playing piano" and an AI makes the sound).

Currently, developers use the old robot's score to decide if their AI is getting better. But since the old robot is out of touch with human ears, they might be optimizing for the wrong things. By using Human-CLAP, developers can finally build AI that creates sounds that actually sound good to us, not just to a computer algorithm.

In short: They taught an AI to stop guessing what humans like and start actually listening to what humans say.