Customizing Visual Emotion Evaluation for MLLMs: An Open-vocabulary, Multifaceted, and Scalable Approach

Imagine you have a very smart, well-read robot friend (a Multimodal Large Language Model, or MLLM) who can look at pictures and talk about them. You want to know: Does this robot actually "feel" what the picture is about, or is it just guessing?

For a long time, scientists tried to test this by asking the robot simple questions like, "Is this picture happy or sad?" or "Why is the person in the photo crying?" But the results were confusing. Sometimes the robot seemed great; other times, it failed miserably.

The authors of this paper argue that the problem isn't the robot's brain—it's the test. The old tests were like trying to measure a fish's ability to climb a tree. They were too rigid, too limited, and didn't account for the messy, complicated way humans actually feel emotions.

Here is a simple breakdown of their new solution, using some everyday analogies:

1. The Old Way: The "Multiple Choice" Trap

Imagine you show a picture of a firefighter running into a burning building.

The Old Test: Asks, "What emotion is this?" and gives you a list: Happy, Sad, Angry, Scared.
The Problem: The robot might say "Brave" or "Urgent." But the test says, "Wrong! Those aren't on the list!"
The Flaw: Human emotions aren't just a list of checkboxes. They depend on the story, the context, and who is looking at the picture. The old tests forced the robot into a box that didn't fit.

2. The New Idea: The "True or False" Game (ESJ)

The authors propose a new game called Emotion Statement Judgment (ESJ). Instead of asking the robot to guess the emotion, they give it a statement and ask, "Is this true or false?"

Think of it like a teacher reading a sentence to a student and asking, "Does this sentence make sense for this picture?"

Statement A: "This picture makes people feel brave because a firefighter is saving a family." -> True.
Statement B: "This picture makes a 35-year-old firefighter feel only fear, not urgency." -> False. (Maybe he feels both, or maybe he feels duty).

This is better because:

It allows for nuance (it's not just "sad" or "happy").
It checks context (what's happening in the story?).
It checks subjectivity (how would different people feel?).

3. The Factory: INSETS (The "Emotion Assembly Line")

To make this game possible, they needed millions of these "True/False" statements. Writing them by hand would take a lifetime. So, they built a robot factory called INSETS.

Step 1: The Mood Ring. They showed thousands of pictures to many different AI robots and asked, "What words describe the mood?" The robots shouted out thousands of words (like "festive," "anxious," "euphoric").
Step 2: The Librarian. They took all those words and organized them into a giant, structured library of emotions (based on a famous psychologist's map of feelings).
Step 3: The Storyteller. The system then automatically wrote thousands of "True" and "False" sentences based on those words, covering different angles: the mood, the story, the setting, and how different people might feel.

The result is a massive library called INSETS-462k (462,000 examples) and a final, high-quality test called MVEI.

4. The Results: The Robot is Smart, But Not Human

They ran the test on the world's best AI models (like GPT-4o).

The Good News: The robots are getting much better! They are great at explaining why a picture makes them feel a certain way (e.g., "The dark clouds make it feel gloomy").
The Bad News: They still struggle with Subjectivity.
- Analogy: If you show a picture of a scary movie scene, a human knows that a horror fan might feel "excited," while a scared child feels "terrified." The robot often misses this. It tends to think everyone feels the exact same way.
- Even the smartest robots lag behind humans by a significant margin, especially when it comes to guessing how specific people (like a firefighter or a teacher) would react.

5. Why This Matters

This paper is like building a better driver's license test for AI.

Before, the test was just "Can you steer the car?" (Basic emotion recognition).
Now, the test is "Can you navigate traffic, read the road signs, and understand how other drivers might react?" (Context, nuance, and subjectivity).

The authors hope that by using this new, fairer test, developers can build AI that doesn't just "see" pictures, but truly understands the emotional stories behind them.

In a nutshell: We stopped asking the robot to pick a color from a tiny box and started asking it to read a sentence and tell us if it makes sense. This revealed that while robots are getting smarter, they still have a lot to learn about the messy, personal, and complex world of human feelings.

1. Problem Statement

Despite the rapid advancement of Multimodal Large Language Models (MLLMs), their proficiency in Affective Image Content Analysis (AICA)—specifically perceiving and interpreting emotions from images—remains inconsistent and debated. Existing evaluation methods suffer from four critical limitations when applied to MLLMs:

Rigid Ground Truth: Traditional benchmarks rely on fixed, single-label answers for open-ended questions, ignoring the inherent subjectivity of emotion perception. Reasonable alternative responses generated by MLLMs are often penalized as incorrect.
Limited Taxonomies: Most benchmarks use coarse-grained emotion categories (e.g., only 8 basic emotions), failing to capture fine-grained affective variations.
Neglect of Context: Current methods focus on intrinsic image attributes while ignoring extrinsic factors like scene context and viewer identity, which are crucial for human emotional perception.
Scalability Issues: Existing datasets rely on labor-intensive crowdsourcing (e.g., EMOTIC required ~24k annotators), making it difficult to scale to large, diverse domains.

2. Methodology

The authors propose a dual-component solution: a new evaluation task and an automated annotation pipeline.

A. Emotion Statement Judgment (ESJ) Task

Instead of asking MLLMs to generate open-ended emotion labels, the ESJ task presents an image paired with a specific emotion-centric statement. The model must judge whether the statement is Correct or Incorrect. This reformulation reduces ambiguity while allowing for deep, multifaceted evaluation across four dimensions grounded in psychological theory:

Sentiment Polarity: Determining if the image evokes positive, negative, or mixed emotions without specific clues.
Emotion Interpretation: Verifying if the stated cause/explanation for an emotion aligns with the visual content.
Scene Context: Assessing if the image evokes specific emotions given a hypothetical scenario or context.
Perception Subjectivity: Evaluating if specific viewer identities (e.g., age, profession) would lead to different emotional responses.

B. INSETS Pipeline (INtelligent ViSual Emotion Tagger and Statement Constructor)

To overcome annotation bottlenecks, the authors developed INSETS, an automated pipeline that constructs the ESJ dataset with minimal human effort.

Open-Vocabulary Emotion Tagging:
- Multiple MLLMs extract potential emotions from images.
- A filtering step removes non-emotional words.
- Parrott-based Open-vocabulary Hierarchical Model (POM): Extracted emotions are mapped to Parrott's hierarchical emotion model (Primary $\to$ Secondary $\to$ Tertiary).
- Ensemble Voting: A majority voting mechanism across multiple MLLMs selects consensus emotion labels for each image, ensuring reliability.
Statement Construction:
- Prototype Generation: MLLMs generate prototype interpretations, contexts, and character profiles for each emotion label.
- Statement Synthesis: Correct and incorrect statements are constructed by combining prototypes with labels.
- Disruption Strategies: To create "Incorrect" statements, the pipeline employs strategies like Inter-image disruption (swapping interpretations between visually similar but emotionally different images) and Intra-image disruption (swapping interpretations between positive/negative labels within the same image).

C. Datasets

INSETS-462k: A large-scale corpus of 462,369 automatically annotated image-statement pairs derived from 17,716 images.
MVEI Benchmark: A refined subset of 3,086 high-quality image-statement pairs created through human refinement (5 graduate students) of the INSETS-462k data. It covers the four evaluation dimensions with balanced correct/incorrect labels.

3. Key Contributions

Task Redefinition: Introduced the Emotion Statement Judgment (ESJ) task, shifting from rigid classification to flexible statement validation to better accommodate MLLM capabilities and emotional subjectivity.
Scalable Pipeline: Developed INSETS, an automated pipeline that constructs open-vocabulary, multifaceted emotion datasets with minimal human intervention, significantly reducing annotation costs compared to traditional methods.
MVEI Benchmark: Curated the Multifaceted evaluation of Visual Emotion Intelligence (MVEI) benchmark, providing a comprehensive testbed covering sentiment polarity, interpretation, context, and subjectivity.
Comprehensive Evaluation: Conducted a systematic evaluation of 19 state-of-the-art MLLMs (including GPT-4o, InternVL2.5, Qwen2.5-VL) and human participants.

4. Results and Findings

MLLM Performance: Recent MLLMs show significant improvement over earlier models but still lag behind humans.
- Strengths: Models perform well in Emotion Interpretation (reasoning about causes) and basic tone recognition.
- Weaknesses: Significant gaps exist in Sentiment Polarity (distinguishing mixed/nuanced tones) and Perception Subjectivity (predicting individual differences).
Human vs. Machine: On a 300-sample subset, humans achieved 91.6% accuracy, while the best MLLM (GPT-4o) achieved 78.3%. The gap is widest in subjectivity tasks.
Adaptation Effects:
- Fine-Tuning: Techniques like LoRA and Full Parameter Fine-tuning on the INSETS-462k data significantly boosted performance, particularly in Sentiment Polarity (+15-20% accuracy).
- Subjectivity Limit: Even with fine-tuning, Perception Subjectivity remained the weakest dimension, suggesting that current MLLMs lack the inherent capability to model individual psychological differences without specialized pre-training.
Bias Analysis: The study notes that the dataset inherits positive biases from social media sources (65% positive images) and that MLLM-generated character descriptions may contain latent demographic stereotypes.

5. Significance

This work addresses a critical gap in the evaluation of MLLMs' emotional intelligence. By moving away from rigid, limited taxonomies toward an open-vocabulary, statement-based framework, the authors provide a more realistic and scalable method for assessing how well models understand human affect.

For Research: MVEI serves as a foundational benchmark to drive the development of more emotionally intelligent models.
For Application: The findings highlight that while MLLMs are becoming competent "emotion annotators," they are not yet reliable "emotion judges" for subjective or context-heavy tasks, cautioning against their uncritical use in high-stakes affective computing applications.
Methodological Impact: The INSETS pipeline demonstrates that high-quality, complex emotion datasets can be constructed efficiently using ensemble MLLM reasoning, paving the way for future large-scale affective data generation.