CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Imagine you are a music producer working in a high-tech studio. You have a team of AI composers (like Suno, MusicGen, or Mureka) that can create songs based on your instructions. Sometimes you say, "Make a sad jazz song." Other times, you say, "Make a song with these specific lyrics," or even, "Make a song that sounds like this recording I'm playing for you."

The problem? How do you know which AI is doing the best job?

In the past, we had simple ways to judge music: "Does it sound good?" or "Does it match the text?" But now, the instructions are getting complicated (mixing text, lyrics, and reference audio), and our old judging tools are like using a ruler to measure temperature—they just don't fit the job.

This paper introduces a new solution called CMI-RewardBench. Think of it as building a super-judge for the AI music world. Here is how they did it, broken down into simple steps:

1. The Problem: The "Lost in Translation" Gap

Imagine you ask an AI to write a song based on a complex recipe: "Use a piano, but make it sound like a rainy day, and include these specific lyrics about a cat."

Old Judges: They might only check if the piano sounds good (ignoring the lyrics) or if the lyrics are there (ignoring the piano). They are like a food critic who only tastes the salt, ignoring the spice, the texture, and the presentation.
The Gap: Current AI models are getting very good at following these complex, mixed instructions, but we don't have a reliable way to grade them on all those things at once.

2. The Solution: Building a "Super-Judge" Ecosystem

The authors built a complete system to train and test a new kind of AI judge. They did this in three main parts:

Part A: The Training Ground (The Datasets)

To teach a judge how to be fair, you need thousands of examples of "Good vs. Bad" music.

The "Pseudo-Label" Gym (CMI-Pref-Pseudo): They used a very smart AI (Qwen3-Omni) to act as a first-pass judge. They generated 110,000 pairs of songs and had the AI compare them. To make sure the AI wasn't cheating (like always picking the first song it saw), they checked its work twice. This created a massive, cheap training dataset.
The "Human Expert" Academy (CMI-Pref): They also hired 31 human music experts to listen to 4,000 pairs of songs and give detailed feedback. These humans didn't just say "A is better"; they explained why (e.g., "The rhythm was off," or "The lyrics didn't match the mood"). This is the "gold standard" data.

Part B: The Test Track (The Benchmark)

They created a standardized test called CMI-RewardBench. Imagine a driving test for cars, but instead of cars, it's music models.

The test includes different "obstacles":
- Text-only: "Write a pop song."
- Lyrics-only: "Here are the words, make the music."
- Audio-only: "Make it sound like this clip."
- The "Combo" Challenge: "Use these lyrics, in this style, and match this reference audio."
They tested existing judges (including giant AI models like Gemini) on this track. Spoiler: Even the smartest general AI models struggled, often getting less than 80% of the answers right compared to human experts. They were too "general" and missed the musical nuances.

Part C: The New Champion (CMI-RM)

The authors trained their own specialized judge, called CMI-RM.

The Secret Sauce: Instead of being a giant, bloated model, they built a small, efficient "musical brain" (only about 30 million parameters) that is specifically designed to listen to text, lyrics, and audio all at the same time.
The Result: This small, specialized judge outperformed the giant, general-purpose AIs. It learned to understand that "sad jazz" with "lyrics about a cat" is a specific, complex request, not just a sad song.

3. Why This Matters: The "Top-K" Filter

Here is the coolest practical application: Inference-Time Scaling.

Imagine you ask an AI to generate 10 different versions of a song.

Without a good judge: You have to listen to all 10, or just pick the first one.
With CMI-RM: You run the 10 songs through the "Super-Judge" instantly. It scores them all. You then pick the Top 3 (or Top 1) best ones.
The Analogy: It's like having a talent scout at an audition. Instead of hiring the first singer who walks in, the scout quickly rates 10 singers and tells you, "Only hire #3, they are the only one who actually hit the high note." This makes the AI music generation much higher quality without needing to generate more music, just better selection.

Summary

The Issue: AI music is getting complex (mixing text, lyrics, and sound), but our tools to judge it are outdated.
The Fix: The authors created a massive library of human-verified music comparisons and a new, specialized AI judge trained on them.
The Win: This new judge is better at understanding complex instructions than even the biggest general AI models. It helps us filter out bad AI music and keep the good stuff, making the future of AI-generated music much more reliable and creative.

In short, they built the ultimate music critic that speaks the language of both humans and complex AI instructions, ensuring that when you ask for a song, you actually get the song you imagined.

Here is a detailed technical summary of the paper "CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction."

1. Problem Statement

The rapid advancement of AI-generated music (AIGC) has led to models capable of handling complex Compositional Multimodal Instructions (CMI). These inputs can simultaneously include text descriptions, lyrics, and reference audio prompts. However, evaluation mechanisms have lagged behind, creating a critical gap:

Data Scarcity: Existing datasets focus on user-item affinity (recommendation) or single-modality alignment (text-to-audio), lacking fine-grained comparative rankings for complex, multimodal instructions.
Fragmented Metrics: Traditional metrics (e.g., Fréchet Audio Distance) operate at the distribution level and fail to provide sample-level signals. Recent sample-level metrics (e.g., SongEval, PAM) are often specialized for isolated attributes (e.g., only caption alignment) and cannot handle flexible input combinations (text + lyrics + audio).
Evaluation Mismatch: State-of-the-art generative models support flexible inputs, but current reward models and benchmarks cannot reliably judge generation quality across these heterogeneous conditions.

2. Methodology

A. Dataset Ecosystem

To bridge the data gap, the authors constructed a comprehensive ecosystem:

CMI-Pref-Pseudo: A large-scale dataset of 110,000 pseudo-labeled pairs. Generated using Qwen3-Omni as a judge, this dataset underwent a rigorous "Position-Consistency" filtering process to eliminate positional bias (where the model's preference depends on the order of audio presentation).
CMI-Pref: A high-quality, human-annotated corpus of 4,027 pairs created by 31 experts. It covers diverse genres, instruments, and multimodal prompts (text, lyrics, audio-to-audio). Annotations include preference labels, confidence scores, and rationale feedback.
Integration: The benchmark integrates existing resources (PAM, MusicEval, Music Arena) to ensure robustness.

B. CMI-RewardBench (The Benchmark)

A unified benchmark designed to evaluate reward models on five distinct tasks:

Musicality: Absolute scoring of audio quality.
Text-Music Alignment: Consistency between text prompts and audio.
Compositional Instruction Alignment: Adherence to complex, simultaneous constraints (e.g., text + lyrics + reference audio).
Evaluation Protocols: Uses regression-based correlation (LCC, SRCC, K-Tau) for scoring tasks and pairwise preference accuracy for ranking tasks.

C. CMI-RM (The Model)

The authors developed CMI-RM, a parameter-efficient reward model family (~30M parameters) designed to handle heterogeneous inputs.

Architecture: A two-tower multimodal architecture based on MuQ-MuLan.
- Prompt Tower: Encodes text, lyrics, and reference audio separately using frozen encoders, then fuses them via a 4-layer Prompt Transformer.
- Audio Tower: Encodes the generated evaluation audio.
- Joint Transformer: A single-layer self-attention mechanism fuses the prompt and audio embeddings.
- Heads: Outputs two scalar scores: Musicality ( $s_{MUS}$ ) and Alignment ( $s_{ALI}$ ).
Training Strategy:
- Stage 1 (Pre-training): Trained on CMI-Pref-Pseudo using Bradley-Terry loss with label smoothing (0.2) to mitigate over-confidence from noisy pseudo-labels.
- Stage 2 (Fine-tuning): Fine-tuned on a mixture of human annotations (CMI-Pref train split + MusicEval) using both pairwise preference and scalar regression losses.

3. Key Contributions

CMI-Pref & CMI-Pref-Pseudo: The first large-scale preference datasets specifically targeting compositional multimodal instructions, including lyrics and audio-to-audio conditioning.
CMI-RewardBench: A unified benchmark that exposes a significant capability gap in current models, showing that even frontier multimodal LLMs (e.g., Gemini-2.5-Pro) struggle to exceed 80% agreement with human preferences on complex CMI tasks.
CMI-RM: A single, parameter-efficient architecture capable of processing text, lyrics, and audio simultaneously, achieving performance comparable to or better than specialized baselines (like SongEval) while supporting diverse evaluation settings.

4. Key Results

Performance on Benchmark

Musicality: CMI-RM (fine-tuned on CMI + MusicEval) achieved state-of-the-art performance on the PAM music subset (SRCC: 0.6988) and Music Arena (Accuracy: 73.43%).
Compositional Alignment:
- On the CMI-Pref test set, CMI-RM achieved 78.20% accuracy, significantly outperforming general-purpose MLLMs like Gemini 3 Pro (65.80%) and Qwen3-Omni (60.40%).
- In fully compositional conditions (Text + Lyrics + Audio), CMI-RM reached 82.40% accuracy, whereas generalist models dropped to ~66%.
Generalization: The model demonstrated strong transferability, maintaining high performance across different datasets (PAM, MusicEval, Music Arena) without overfitting to a single domain.

Ablation Studies

Training Data: Fine-tuning on CMI-Pref was the primary driver for cross-benchmark generalization. Combining CMI-Pref with MusicEval further refined sensitivity to semantic content.
Initialization: Pre-training on pseudo-labeled data (Distillation) provided a robust prior, significantly boosting performance over random initialization (e.g., improving Music Arena accuracy from 71.34% to 73.43%).
Label Smoothing: Applying label smoothing during pseudo-training was critical to prevent over-confident decision boundaries, improving downstream accuracy by ~2-3%.

Test-Time Scaling

Best-of-N: Using CMI-RM for top-k filtering (reranking generated samples) demonstrated effective inference-time scaling. For MusicGen-small, increasing $N$ from 1 to 10 consistently improved alignment and musicality scores, though gains began to saturate for Stable-Audio-Open-small.

5. Significance

Unified Evaluation: This work establishes the first rigorous framework for evaluating music generation under compositional multimodal instructions, moving beyond simple text-to-audio alignment.
Practical Utility: The CMI-RM model provides a lightweight, efficient solution for post-training filtering and inference-time scaling, enabling higher quality music generation without retraining the base generative models.
Community Resource: By releasing the datasets (CMI-Pref, CMI-Pref-Pseudo), the benchmark, and model weights, the authors provide a foundational infrastructure for future research in aligned music generation, addressing the critical bottleneck of evaluation in the AIGC music domain.
Insight on Context: The study reveals that "musicality" in AIGC is context-dependent; incorporating prompt information (text/lyrics/audio) significantly improves the prediction of human musicality preferences, suggesting that reward models must be context-aware rather than purely reference-free.