Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

This paper addresses the limitations of existing visual emotion evaluation methods for Multimodal Large Language Models (MLLMs) by proposing an open-vocabulary, automated Emotion Statement Judgment framework that reveals current models' strengths in context-based interpretation but highlights significant gaps in understanding subjective perception compared to humans.

Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you have a very smart, well-read robot friend (a Multimodal Large Language Model, or MLLM) who can look at pictures and talk about them. You want to know: Does this robot actually "feel" what the picture is about, or is it just guessing?

For a long time, scientists tried to test this by asking the robot simple questions like, "Is this picture happy or sad?" or "Why is the person in the photo crying?" But the results were confusing. Sometimes the robot seemed great; other times, it failed miserably.

The authors of this paper argue that the problem isn't the robot's brain—it's the test. The old tests were like trying to measure a fish's ability to climb a tree. They were too rigid, too limited, and didn't account for the messy, complicated way humans actually feel emotions.

Here is a simple breakdown of their new solution, using some everyday analogies:

1. The Old Way: The "Multiple Choice" Trap

Imagine you show a picture of a firefighter running into a burning building.

  • The Old Test: Asks, "What emotion is this?" and gives you a list: Happy, Sad, Angry, Scared.
  • The Problem: The robot might say "Brave" or "Urgent." But the test says, "Wrong! Those aren't on the list!"
  • The Flaw: Human emotions aren't just a list of checkboxes. They depend on the story, the context, and who is looking at the picture. The old tests forced the robot into a box that didn't fit.

2. The New Idea: The "True or False" Game (ESJ)

The authors propose a new game called Emotion Statement Judgment (ESJ). Instead of asking the robot to guess the emotion, they give it a statement and ask, "Is this true or false?"

Think of it like a teacher reading a sentence to a student and asking, "Does this sentence make sense for this picture?"

  • Statement A: "This picture makes people feel brave because a firefighter is saving a family." -> True.
  • Statement B: "This picture makes a 35-year-old firefighter feel only fear, not urgency." -> False. (Maybe he feels both, or maybe he feels duty).

This is better because:

  • It allows for nuance (it's not just "sad" or "happy").
  • It checks context (what's happening in the story?).
  • It checks subjectivity (how would different people feel?).

3. The Factory: INSETS (The "Emotion Assembly Line")

To make this game possible, they needed millions of these "True/False" statements. Writing them by hand would take a lifetime. So, they built a robot factory called INSETS.

  • Step 1: The Mood Ring. They showed thousands of pictures to many different AI robots and asked, "What words describe the mood?" The robots shouted out thousands of words (like "festive," "anxious," "euphoric").
  • Step 2: The Librarian. They took all those words and organized them into a giant, structured library of emotions (based on a famous psychologist's map of feelings).
  • Step 3: The Storyteller. The system then automatically wrote thousands of "True" and "False" sentences based on those words, covering different angles: the mood, the story, the setting, and how different people might feel.

The result is a massive library called INSETS-462k (462,000 examples) and a final, high-quality test called MVEI.

4. The Results: The Robot is Smart, But Not Human

They ran the test on the world's best AI models (like GPT-4o).

  • The Good News: The robots are getting much better! They are great at explaining why a picture makes them feel a certain way (e.g., "The dark clouds make it feel gloomy").
  • The Bad News: They still struggle with Subjectivity.
    • Analogy: If you show a picture of a scary movie scene, a human knows that a horror fan might feel "excited," while a scared child feels "terrified." The robot often misses this. It tends to think everyone feels the exact same way.
    • Even the smartest robots lag behind humans by a significant margin, especially when it comes to guessing how specific people (like a firefighter or a teacher) would react.

5. Why This Matters

This paper is like building a better driver's license test for AI.

  • Before, the test was just "Can you steer the car?" (Basic emotion recognition).
  • Now, the test is "Can you navigate traffic, read the road signs, and understand how other drivers might react?" (Context, nuance, and subjectivity).

The authors hope that by using this new, fairer test, developers can build AI that doesn't just "see" pictures, but truly understands the emotional stories behind them.

In a nutshell: We stopped asking the robot to pick a color from a tiny box and started asking it to read a sentence and tell us if it makes sense. This revealed that while robots are getting smarter, they still have a lot to learn about the messy, personal, and complex world of human feelings.