Human-CLAP: Human-perception-based contrastive language-audio pretraining

Here is an explanation of the paper "Human-CLAP" using simple language and creative analogies.

The Big Problem: The Robot vs. The Human

Imagine you are a chef trying to invent a new recipe. You have a robot assistant that helps you taste your dishes.

The Robot (Old CLAP): This robot has read millions of cookbooks and knows that "soup" usually goes with "hot." If you show it a picture of a cold bowl of soup, it might say, "This is a perfect match!" because the words "soup" and "hot" are in its database, even if the actual food tastes wrong.
The Human (You): You take a bite and say, "Ew, this is cold. It doesn't match the description."

The problem the researchers found is that the Robot (CLAP) is very good at matching words to sounds based on statistics, but it is terrible at understanding if a human would actually like or agree with that match. In the world of AI music and sound generation, the robot's "score" (called CLAPScore) was often high even when humans thought the sound was garbage.

The Solution: Teaching the Robot to Listen to Humans

The researchers created a new version of the robot called Human-CLAP.

Instead of just letting the robot guess based on its massive library of text and audio, they gave it a "tasting panel." They took a small group of real humans, played them audio clips with text descriptions, and asked them to rate how well the text matched the sound on a scale of 0 to 10.

The Old Way: The robot was trained on a massive dataset where it assumed every text and audio pair was a perfect match. It was like studying a dictionary where every definition is correct, even if the examples are weird.
The New Way (Human-CLAP): They took a tiny slice of data (only about 1/320th of the original size) but it was "high-quality" data because it had human ratings. They taught the robot: "Hey, when humans give this a low score, you should give it a low score too. When humans love it, you should love it."

How It Works: The "Weighted" Lesson

Think of the training process like a teacher grading a student's homework.

The Old Teacher: The teacher looks at 1,000 homework assignments. If the student gets one right, the teacher gives a gold star. If they get one wrong, the teacher gives a red X. But the teacher treats every question as equally important, even if some questions are obvious mistakes.
The New Teacher (Human-CLAP): This teacher looks at the same homework but has a special "Human Score" in hand.
- If a student gets a question right and the Human Score says "Great job!", the teacher gives a huge gold star.
- If a student gets a question "right" (mathematically) but the Human Score says "This is actually terrible," the teacher gives a big red X.
- The teacher uses a special formula (called wSCE) that weighs the lessons based on how much humans actually liked the result.

The Results: A Much Better Match

The researchers tested their new robot against the old one.

The Old Robot: When comparing its scores to human opinions, the correlation was weak (about 0.28). It was like trying to guess the weather by looking at a random clock; sometimes it's right, but mostly it's guessing.
The New Robot (Human-CLAP): The correlation jumped up to over 0.45.

What does this mean?
It means the new robot is much better at predicting what a human will think. If a human says, "This AI-generated sound of a dog barking doesn't sound like a real dog," the new robot will agree and give it a low score. The old robot might have said, "It has the word 'dog' in it, so it's a 10/10!"

Why This Matters

This is a huge step forward for Text-to-Audio technology (where you type "a cat playing piano" and an AI makes the sound).

Currently, developers use the old robot's score to decide if their AI is getting better. But since the old robot is out of touch with human ears, they might be optimizing for the wrong things. By using Human-CLAP, developers can finally build AI that creates sounds that actually sound good to us, not just to a computer algorithm.

In short: They taught an AI to stop guessing what humans like and start actually listening to what humans say.

Here is a detailed technical summary of the paper "Human-CLAP: Human-perception-based contrastive language–audio pretraining" presented at APSIPA ASC 2025.

1. Problem Statement

Context: Contrastive Language–Audio Pretraining (CLAP) models are foundational for connecting audio and text, widely used in tasks like text-to-audio (TTA) generation, audio captioning, and source separation. A common metric derived from these models is CLAPScore, which calculates the cosine similarity between audio and text embeddings to quantify their relevance.

The Gap: While CLAPScore is used as an objective evaluation metric for TTA, its alignment with human subjective perception remains unclarified.

Low Correlation: The authors hypothesize that CLAPScore does not accurately reflect human judgment because standard CLAP models are trained on massive datasets where audio-text pairs are assumed to be perfect matches, ignoring "noisy" data (e.g., text that misses audio details).
Consequence: This misalignment means CLAPScore may fail to distinguish between high-quality and low-quality generated audio from a human perspective, making it an unreliable metric for evaluating TTA systems.

2. Methodology: Human-CLAP

The authors propose Human-CLAP, a fine-tuned CLAP model designed to align embedding spaces with human subjective evaluations.

A. Dataset and Ground Truth

Source: The RELATE dataset, containing subjective evaluation scores for audio-text pairs.
Data Composition: Includes natural audio (from AudioCaps) and synthesized audio (generated by AudioLDM, AudioLDM2, Tango, and Tango2).
Scoring: Human listeners rated pairs on an 11-point scale (0 = no match, 10 = perfect match).
Quality Control: Listeners were screened using "anchor samples" (known mismatched pairs) to ensure reliability. The final dataset used for training consisted of ~4,788 pairs (training + validation), which is approximately 1/320th the size of the original CLAP training data.

B. Model Architecture & Training Strategy

The core innovation lies in the Loss Function, which combines contrastive learning with regression to incorporate human scores ( $a_i$ ) as target weights.

Regression Loss ( $L_{reg}$ ):
- The model minimizes the error between the predicted cosine similarity ( $y_i$ ) and the rescaled human subjective score ( $a_i \in [0, 1]$ ).
- Uses Mean Squared Error (MSE) or Mean Absolute Error (MAE).
- Formula: $L_{reg} = \frac{1}{N}\sum (a_i - y_i)^p$ (where $p=2$ for MSE, $p=1$ for MAE).
Weighted Contrastive Loss ( $L_{wSCE}$ ):
- Standard CLAP uses Symmetric Cross Entropy (SCE) to maximize similarity for pairs and minimize it for non-pairs.
- Human-CLAP modification: The SCE loss is weighted by the human score ( $a_i$ ). Pairs with higher human relevance contribute more to the loss minimization, while low-relevance pairs contribute less.
- Formula: $L_{wSCE} = -\frac{1}{2N} \sum a_i \left( \log \frac{e^{sim_{i,i}/\tau}}{\sum e^{sim_{i,j}/\tau}} + \dots \right)$ .
Total Loss:
- $L = \lambda_1 L_{wSCE} + \lambda_2 L_{reg}$ .
- The model is fine-tuned from a pre-trained LAION CLAP base using this combined objective.

3. Key Contributions

Empirical Analysis: The paper provides the first rigorous investigation into the correlation between CLAPScore and human subjective ratings, demonstrating that the correlation is surprisingly low.
Human-CLAP Model: Proposes a novel fine-tuning framework that uses a small amount of high-quality human subjective data to re-align the CLAP embedding space with human perception.
Efficiency: Demonstrates that significant improvements can be achieved using a tiny fraction of the data required for pre-training (approx. 1/320th of the original dataset).

4. Experimental Results

The authors evaluated the models using Spearman's Rank Correlation Coefficient (SRCC), Linear Correlation Coefficient (LCC), and Kendall's Tau (KTAU).

Baseline Performance:
- Standard LAION CLAP and MS CLAP showed very low correlation with human scores (SRCC $\approx$ 0.28).
- This confirmed that CLAPScore is not a reliable proxy for human judgment.
Human-CLAP Performance:
- The best-performing variant (wSCE + MAE) achieved an SRCC of 0.457 and LCC of 0.481.
- Improvement: This represents an increase in SRCC of >0.17 compared to the baseline (0.280 $\to$ 0.457).
- Error Reduction: The Mean Squared Error (MSE) between predicted scores and human scores dropped from 0.068 (baseline) to 0.057.
Ablation Studies:
- wSCE + MAE outperformed using MSE or MAE alone, and outperformed wSCE alone.
- Low-Scoring Pairs: Models trained with only MAE struggled to assign low scores to low-relevance pairs. The inclusion of wSCE was crucial for the model to correctly identify and penalize mismatched audio-text pairs.
- Generalization: The improvement held true across both natural audio and various synthesized audio models (AudioLDM, Tango, etc.).

5. Significance and Impact

Better Evaluation Metrics: Human-CLAP offers a more reliable, automated metric for evaluating Text-to-Audio generation, reducing the reliance on expensive and time-consuming human listening tests.
Data Efficiency: It proves that high-level human perception alignment can be achieved through fine-tuning on small, curated datasets rather than re-training foundation models from scratch.
Model Alignment: The work highlights the limitations of current "zero-shot" foundation models in capturing nuanced human preferences and provides a practical pathway to align AI metrics with human subjective reality.

In conclusion, the paper successfully bridges the gap between algorithmic similarity scores and human perception, establishing Human-CLAP as a superior metric for audio-text relevance evaluation.

Human-CLAP: Human-perception-based contrastive language-audio pretraining

The Big Problem: The Robot vs. The Human

The Solution: Teaching the Robot to Listen to Humans

How It Works: The "Weighted" Lesson

The Results: A Much Better Match

Why This Matters

1. Problem Statement

2. Methodology: Human-CLAP

A. Dataset and Ground Truth

B. Model Architecture & Training Strategy

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

Adiabatic Capacitive Neuron: An Energy-Efficient Functional Unit for Artificial Neural Networks

Multi-Domain Supervised Contrastive Learning for UAV Radio-Frequency Open-Set Recognition

ACCOR: Attention-Enhanced Complex-Valued Contrastive Learning for Occluded Object Classification Using mmWave Radar IQ Signals

Continuous-Time Analysis of AFDM: Pulse-Shaping, Fundamental Bounds and Impact of Hardware Impairments

Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge