Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

Imagine you are a judge in a cooking competition. Your job is to taste two dishes and decide which one looks and tastes better. But instead of tasting, you are a computer program trying to judge the quality of images and videos.

For years, we've had these computer judges (called Quality Metrics) like SSIM, LPIPS, and VMAF. They are supposed to act like human eyes, telling us if a video is blurry, if a photo has bad colors, or if a movie looks too grainy. Usually, we test them by asking real humans, "Which one do you like better?" and seeing if the computer agrees.

But here's the problem: Just because a computer gets the "right answer" on a specific test doesn't mean it understands why humans see things the way they do. It might be guessing correctly by accident, or it might be using a completely different (and wrong) logic.

This paper introduces a new way to test these computer judges. Instead of just asking "Did you get the score right?", the authors ask, "Do you see the world like a human eye actually works?"

They use a set of "vision puzzles" based on how our brains and eyes are biologically wired. Think of it as a medical exam for the computer's "eyes."

Here are the four main puzzles they used, explained with simple analogies:

1. The "Whisper vs. Shout" Test (Contrast Detection)

The Human Reality: Our eyes aren't equally good at seeing everything. We are terrible at seeing very faint details in the dark, and we are also bad at seeing tiny, high-pitched details (like fine sand). We are actually best at seeing medium-sized details (like the texture of a brick wall). This is called the Contrast Sensitivity Function (CSF).

The Test: The researchers showed the computer images with patterns of different sizes and brightness levels, asking, "Can you see this?"
The Result:

Old judges (like PSNR): They act like a robot counting pixels. They think a tiny speck of dust is just as important as a giant blurry smudge. They don't understand that humans ignore tiny specks.
The "Good" judges (like ColorVideoVDP): These actually mimic the human eye, knowing that medium-sized details are the most important.
The "Bad" judges (like SSIM): Surprisingly, SSIM is obsessed with tiny, high-frequency details. It's like a judge who cares more about a single grain of salt on a cake than the fact that the cake is burnt.

2. The "Noise vs. Signal" Test (Contrast Masking)

The Human Reality: Imagine you are trying to hear a whisper in a quiet library (easy). Now imagine trying to hear that same whisper in a loud rock concert (impossible). The loud music "masks" the whisper. In vision, if an image is already busy and textured (like a forest), a small distortion (like a scratch) is harder to see than if that scratch was on a plain white wall.

The Test: The researchers showed the computer a busy, noisy image and asked, "How much of a new scratch do I need to add before you notice it?"
The Result:

Most judges: They didn't care about the background noise. They thought a scratch was equally visible on a plain wall and a busy forest.
The "Deep Learning" judges (like LPIPS): These were the surprise stars! Even though they were never taught about "masking," they learned from looking at millions of photos that "busy backgrounds hide small errors." They got this right, just like humans do.

3. The "Flicker" Test (Temporal Sensitivity)

The Human Reality: If you wave a flashlight back and forth very fast, it looks like a steady light. If you wave it slowly, you see it flickering. Our eyes have a "sweet spot" for flickering; we are most sensitive to a specific speed (about 8 times a second).

The Test: The researchers showed the computer videos that flickered at different speeds.
The Result:

Most video judges: They are terrible at this. They usually only look at two or three frames at a time, so they can't tell the difference between a slow flicker and a fast one.
The "Good" judges: Only a few specialized video metrics could actually tell, "Hey, this is flickering at the speed humans hate the most!"

4. The "Big vs. Small" Test (Contrast Constancy)

The Human Reality: This is the weirdest one. If you look at a faint gray line, it looks very different depending on its size. But if you look at a very bright, high-contrast line, it looks the same size and brightness whether it's tiny or huge. Our brains "flatten" our perception of big, obvious things.

The Test: The researchers asked the computer to match the brightness of a faint line to a bright line.
The Result:

Everyone failed: None of the computer judges understood this. They kept trying to mathematically calculate the difference, failing to realize that for humans, "big and bright" looks the same regardless of size. This is a major blind spot for current technology.

The Big Takeaway

The authors built a "Vision Gym" to test these computer judges. They found that:

Old formulas (like SSIM) are often wrong about what humans care about (they love tiny details too much).
AI judges (Deep Learning) are surprisingly good at some things (like masking) because they learned from real photos, even without being taught the rules.
Video judges are generally bad at understanding time (flicker).
Nobody understands how we perceive "big, bright" things (contrast constancy) yet.

Why does this matter?
If you are streaming a movie on Netflix, you want the video to look good without using too much data. If the computer judge thinks a blurry forest is "bad" (when humans can't even see the blur), the system wastes data trying to fix it. If the judge thinks a flickering light is "fine" (when humans hate it), the movie will look terrible.

By using these "vision puzzles," the authors hope to help engineers build better judges that truly understand how human eyes work, leading to better pictures and videos for everyone.

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

1. The "Whisper vs. Shout" Test (Contrast Detection)

2. The "Noise vs. Signal" Test (Contrast Masking)

3. The "Flicker" Test (Temporal Sensitivity)

4. The "Big vs. Small" Test (Contrast Constancy)

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results & Findings

5. Significance

Evaluating quality metrics through the lenses of psychophysical measurements of low-level vision

1. The "Whisper vs. Shout" Test (Contrast Detection)

2. The "Noise vs. Signal" Test (Contrast Masking)

3. The "Flicker" Test (Temporal Sensitivity)

4. The "Big vs. Small" Test (Contrast Constancy)

The Big Takeaway

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results & Findings

5. Significance

More like this

Monotone Comparative Statics without Lattices

Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans

Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence

Construction of time-varying ISS-Lyapunov Functions for Impulsive Systems

Real-Time BDI Agents: a model and its implementation