Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
This paper introduces HarmonicEval, a reference-free, multi-criteria evaluation metric for vision-language models that aggregates criterion-wise scores to better align with human judgments across diverse multi-modal tasks, supported by the newly constructed MMHE benchmark containing 18,000 expert human evaluations.