Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

This paper introduces HarmonicEval, a reference-free, multi-criteria evaluation metric for vision-language models that aggregates criterion-wise scores to better align with human judgments across diverse multi-modal tasks, supported by the newly constructed MMHE benchmark containing 18,000 expert human evaluations.

Masanari Ohi, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

Published Tue, 10 Ma
📖 5 min read🧠 Deep dive

Imagine you are a teacher grading essays written by a robot that can "see" pictures and describe them.

In the past, if you wanted to grade these robot essays, you had a very specific, one-size-fits-all rubric.

  • If the robot wrote a caption for a photo of a cat, you used a "Cat Rubric."
  • If the robot answered a question about a map, you used a "Map Rubric."

The problem? The "Cat Rubric" was terrible at grading the "Map" answers, and vice versa. It was like trying to measure the length of a swimming pool with a ruler meant for measuring fabric. The existing tools were too rigid and couldn't adapt to the different types of tasks the robot was doing.

This paper introduces a new, smarter way to grade these robots called HarmonicEval, along with a massive new "exam" called MMHE to test it.

Here is the breakdown in simple terms:

1. The Problem: The "One-Size-Fits-All" Trap

Current tools for grading AI text usually give you a single score, like a final grade of "B+." But they don't tell you why it got a B+.

  • Did it get the facts wrong?
  • Was it too wordy?
  • Was the grammar bad?
  • Did it miss important details?

Different tasks care about different things. A "Visual Question Answering" task (answering questions about a picture) needs brevity (short answers) and correctness. An "Image Captioning" task (describing a whole picture) needs completeness (telling the whole story). Old tools often prioritized the wrong things, giving high scores to answers that were grammatically perfect but factually wrong, or answers that were too long-winded.

2. The Solution: HarmonicEval (The "Balanced Scorecard")

The authors created HarmonicEval, which is like a multi-criteria report card instead of a single grade.

Instead of just asking, "How good is this text?", it asks five specific questions:

  1. Correctness: Is the information true?
  2. Completeness: Did it miss any important details?
  3. Clarity: Is it easy to understand?
  4. Fluency: Does it sound natural?
  5. Conciseness: Is it too wordy?

The Magic Ingredient: The "Harmonic" Weighting
Here is the clever part. The system doesn't just average these five scores (which would be like giving a "C" for a "5" and a "1"). Instead, it uses a special math trick called Harmonic Weighting.

Think of it like a jury deliberation:

  • If the AI is very confident about the "Correctness" score (it's sure it's right), that score gets a heavy vote.
  • If the AI is shaky and unsure about the "Fluency" score (maybe the text is weird), that score gets a lighter vote.

The system automatically decides which criteria matter most for that specific answer based on how confident the AI feels. This prevents one shaky score from ruining the whole grade, or one perfect score from hiding a major mistake.

3. The Test: MMHE (The "Grand Exam")

To prove their new grading system works, the authors couldn't just guess. They needed a massive, real-world test.

They built MMHE (Multi-task Multi-criteria Human Evaluation).

  • The Scale: They gathered 18,000 expert human judgments.
  • The Variety: They tested the AI on four different types of tasks (describing images, answering questions, reading documents, and identifying objects).
  • The Criteria: For every single answer, three human experts graded it on all five criteria mentioned above.

This is the first time anyone has created such a huge, detailed dataset that breaks down AI performance by specific criteria across multiple tasks. It's like having a library of 18,000 essays, each graded by three teachers who wrote down exactly what was good and what was bad.

4. The Results: Why It Matters

When they tested HarmonicEval against this massive dataset:

  • Better Alignment: It matched human expert opinions much better than the old tools.
  • Better Feedback: Because it breaks down the score, it can tell a developer, "Your AI is great at being concise, but it keeps making factual errors."
  • Versatility: It worked well on all four tasks, whereas old tools usually failed when you switched from one task to another.

The Big Picture Analogy

Imagine you are hiring a multitasking assistant.

  • Old Method: You ask them to cook dinner, fix a leak, and write a poem. You give them a single score of "7/10." You have no idea if they burned the food, flooded the kitchen, or wrote a terrible poem.
  • HarmonicEval: You give them a report card. "Cooking: 9/10 (Great taste!), Plumbing: 2/10 (Leak is worse!), Poetry: 8/10 (Rhymes well!)."
  • The "Harmonic" part: If the assistant is very sure they fixed the leak, you trust that score. If they are unsure about the poem, you weigh that score less heavily.

In summary: This paper gives us a smarter, more flexible ruler to measure AI. It stops us from using a "one-size-fits-all" grade and starts giving us a detailed, honest report card that helps us fix AI models faster and better.