Position: Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered

This position paper argues that the evaluation of modern visual processing systems must shift from a reliance on single-metric benchmarks toward a human-centered, context-aware paradigm to better align with human perception and foster genuine innovation.

Jinfan Hu, Fanghua Yu, Zhiyuan You, Xiang Yin, Hongyu An, Xinqi Lin, Chao Dong, Jinjin Gu

Published 2026-03-10
📖 4 min read☕ Coffee break read

Imagine you are a chef trying to create the world's most delicious soup. For years, the only way to judge your soup was by measuring its temperature with a thermometer. If the thermometer said "90°C," you were a genius. If it said "85°C," you were a failure.

But here's the problem: Temperature doesn't taste like flavor.

This is exactly the situation the paper "Evaluation of Visual Processing Should Be Human-Centered, Not Metric-Centered" is describing for the world of computer vision (specifically, fixing blurry or damaged photos).

Here is the breakdown of their argument using simple analogies:

1. The Old Way: The "Thermometer" of Images

For a long time, researchers improved image restoration (making blurry photos sharp) by trying to beat a computer score called PSNR or SSIM.

  • The Analogy: Think of these scores like a ruler measuring how closely a copy matches the original. If you copy a drawing and the lines are 99% identical to the original, you get a high score.
  • The Problem: In the real world, we don't just want a perfect copy; we want a beautiful picture. Sometimes, to make a photo look amazing to a human eye, you have to change it slightly (add a bit of texture, fix a face, make colors pop). But the "ruler" hates change. It thinks, "Hey, you changed the pixels! That's a bad job!" So, it gives you a low score, even though the photo looks great to us.

2. The New Problem: The "Fake Sharpness" Trap

Recently, computers got smarter. They started using "Generative AI" (like GANs and Diffusion models) to invent new details that weren't in the original photo.

  • The Analogy: Imagine a painter who doesn't just copy a photo but adds realistic fur to a dog or sharp eyes to a person. The picture looks incredible.
  • The Glitch: The computer scores (like LPIPS) started to get confused. They saw all the new "sharp" details and thought, "Wow, so much detail! High score!"
  • The Danger: This created a trap. Researchers started "cheating" the system. They would make images too sharp, or add weird, noisy textures just to trick the computer into giving them a higher score. It's like a student memorizing the answer key instead of actually learning the subject. The score goes up, but the actual quality (the "taste" of the soup) gets worse.

3. The "One-Size-Fits-All" Mistake

The paper argues that we are trying to judge every type of photo with the same single number.

  • The Analogy: Imagine a single judge at a talent show who only cares about how loud the singer is.
    • If a heavy metal band plays, the judge says, "Great! 10/10!"
    • If a soft jazz singer performs, the judge says, "Terrible! 1/10!"
    • But in reality, the jazz singer might be more beautiful and moving to the audience.
  • The Reality: A photo of a cartoon needs to look different than a photo of a human face or a forest. A "perfect" score for a cartoon might look terrible on a human face. We need to judge them based on what they are supposed to be, not just a single number.

4. The Proposed Solution: "The Human Taste Test"

The authors are calling for a change in how we evaluate these AI models. Instead of just looking at the computer's score, we need to put the human back in the driver's seat.

  • The Analogy: Instead of just checking the thermometer, we need to invite a panel of food critics (humans) to taste the soup.
  • What they want:
    • Context Matters: Ask, "Does this face look real?" "Does this building look stable?" "Does this cartoon look fun?"
    • No More Cheating: Stop rewarding models just for making things "sharper" if it looks unnatural.
    • Better Tools: The computer tools (metrics) need to get smarter. They need to learn to understand what is in the picture (semantics), not just count pixels. They need to realize that a blurry background (depth of field) is a good thing in a portrait, not a mistake to be fixed.

The Bottom Line

The paper says: "Stop letting the scoreboard run the game."

Right now, AI researchers are so obsessed with beating the computer score that they are forgetting to make images that humans actually enjoy looking at. They want to shift the focus back to human perception. The goal isn't to get the highest number on a chart; it's to create images that look beautiful, natural, and useful to real people.

In short: Don't let the robot judge the art. Let the human judge the art, and use the robot only to help, not to decide.