Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

Imagine you are a chef trying to create the perfect recipe for a "Quality Taste Test." You want to know exactly how humans judge the flavor of a dish when they are eating it and listening to the sizzle of the pan at the same time. This is what Audio-Visual Quality Assessment (AVQA) is all about: figuring out how people rate videos when both the picture and the sound matter.

For a long time, this research was stuck in a tiny, expensive kitchen (a university lab). Only a few people could come in, sit in perfect silence, and taste the food. The result? The "recipes" (datasets) were too small, too boring, and didn't taste like real life.

This paper is like a bold new chef who decides to open the kitchen to the whole world using a massive online crowd, while still making sure the food tastes good. Here is how they did it, explained simply:

1. The Problem: The "Tiny Lab" Trap

Before this, researchers could only test a few hundred videos in a controlled lab. It was like trying to learn how the whole world eats by only asking your immediate family.

The Issue: The videos were fake or too perfect. Real life is messy (bad microphones, shaky cameras, loud neighbors).
The Result: The computer models trained on this data were like students who only studied for a test in a quiet library; they failed when they went out into the noisy cafeteria of the real internet.

2. The Solution: The "Global Taste Test" (Crowdsourcing)

The authors decided to ask thousands of regular people on the internet (via Amazon Mechanical Turk) to rate videos. But they knew that asking random people on the internet is risky. Some might be distracted, some might have bad headphones, and some might just click "5 stars" for everything.

To fix this, they built a Smart Guard System with three layers of protection:

Layer 1: The "Check-In" (Environment Control): Before you can taste, you have to prove you're in a quiet room, using a real computer (not a phone), and wearing headphones. It's like a bouncer checking your ID and making sure you aren't wearing noise-canceling earplugs that block the music.
Layer 2: The "Training Wheels" (Qualification): You can't just jump into the main event. You have to take a practice test. If your answers are too random or weird compared to the group, you get kicked out. It's like a cooking contest where you have to chop an onion perfectly before you're allowed to cook the main dish.
Layer 3: The "Double-Check" (Data Filtering): Even after people rate the videos, the system looks at the results. If someone gave every video a "3" or rated them in a weird pattern, their answers are thrown out. They only keep the answers that make sense.

3. The Result: The "Giant Library" (YT-NTU-AVQ)

Thanks to this system, they created the YT-NTU-AVQ dataset.

Size: It has 1,620 videos. That's huge compared to the old datasets which had maybe 50 or 100.
Variety: These aren't fake movies. They are real YouTube videos (Creative Commons) featuring everything from dancing cats to cooking shows and music performances.
Depth: Instead of just asking "How good is this?", they asked four questions:
1. How good is the whole experience?
2. How good is just the picture?
3. How good is just the sound?
4. The Secret Sauce: "Which one did you care about more?" (Did you focus on the singer's voice or the stage lights?)

4. The Big Discovery: "The Eyes Have It"

When they analyzed the data, they found something surprising.

The Finding: Even though people said they were listening to the sound, their final score was almost entirely driven by how good the video looked.
The Analogy: Imagine you are eating a delicious meal, but the plate is chipped and ugly. You might say, "The food tastes great," but you'll still give the whole experience a lower score because the plate ruined the vibe. In these videos, if the picture was blurry, people hated the video, even if the music was perfect. If the picture was great, they forgave a slightly bad sound.
The Exception: If the sound was terrible (like a loud screech), people finally noticed. But for normal videos, the eyes are the boss.

Why This Matters

This paper is like building a giant, realistic simulator for AI.

Before: AI models were trained on tiny, fake datasets. They were like drivers who only learned to drive in an empty parking lot.
Now: They have a dataset that looks and sounds like the real, messy internet. This helps build AI that can actually understand how humans feel when watching a video on their phone in a noisy coffee shop.

In short: They figured out how to ask thousands of people to judge video quality without losing their minds, built a massive library of real-world videos, and discovered that when it comes to video, what you see is usually more important than what you hear.

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

1. The Problem: The "Tiny Lab" Trap

2. The Solution: The "Global Taste Test" (Crowdsourcing)

3. The Result: The "Giant Library" (YT-NTU-AVQ)

4. The Big Discovery: "The Eyes Have It"

Why This Matters

1. Problem Statement

2. Methodology

A. Crowdsourced Subjective Experiment Framework

B. Dynamic Data Filtering

C. Stratified Data Preparation Strategy

3. Key Contributions

4. Results and Analysis

5. Significance

Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing

1. The Problem: The "Tiny Lab" Trap

2. The Solution: The "Global Taste Test" (Crowdsourcing)

3. The Result: The "Giant Library" (YT-NTU-AVQ)

4. The Big Discovery: "The Eyes Have It"

Why This Matters

1. Problem Statement

2. Methodology

A. Crowdsourced Subjective Experiment Framework

B. Dynamic Data Filtering

C. Stratified Data Preparation Strategy

3. Key Contributions

4. Results and Analysis

5. Significance

More like this

Conversational Successes and Breakdowns in Everyday Smart Glasses Use

EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents

GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction

PyEncode: An Open-Source Library for Structured Quantum State Preparation

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation