Social Norm Reasoning in Multimodal Language Models: An Evaluation

Imagine you are teaching a robot how to be a good neighbor. You don't just want it to know how to walk or talk; you want it to know when to knock on a door, when to pick up trash, and when to offer its seat to an elderly person. These unwritten rules of society are called social norms.

This paper is like a report card for five different "super-brains" (AI models) to see how good they are at learning these rules, both by reading stories about them and by looking at pictures of them.

Here is the breakdown of their "school day":

1. The Test: Reading vs. Watching

The researchers gave these AI brains two types of homework:

The Reading Test: They gave the AIs short stories (like little comic scripts) describing social situations.
The Visual Test: They turned those same stories into four-panel comic strips and asked the AIs to look at the pictures and explain what was happening.

The questions were tricky. They didn't just ask, "Did someone break a rule?" They asked deeper questions like:

"Did the person get praised for doing the right thing?"
"Did they get scolded for doing the wrong thing?"
"Did someone get in trouble for not scolding the rule-breaker?" (This is a very complex rule called a meta-norm—basically, a rule about enforcing rules).

2. The Students (The AI Models)

Five different AI "students" took the test:

GPT-4o: The top student, known for being very smart and versatile.
Qwen-2.5VL: A free-to-use model that turned out to be a very strong runner-up.
Gemini 2.0 Flash: A fast model, but a bit inconsistent.
Intern-VL3: A solid performer, but not the best.
Meta LLaMa-4 Maverick: The student who struggled the most, especially with pictures.

3. The Results: The "Text vs. Image" Gap

Here is the big surprise, like finding out a student is a genius at math but terrible at art class:

Reading was easy: When the AIs had to read the stories, they were almost perfect. GPT-4o got a 98.75% score! It was like they were reading a novel and understanding every nuance of human behavior.
Pictures were harder: When the AIs had to look at the comic strips, their scores dropped. GPT-4o still did well (92.5%), but others struggled more. It seems the AIs are great at understanding words, but they sometimes get confused by what is happening in a drawing. They might miss a subtle facial expression or a gesture that changes the meaning of the scene.

4. The Tricky Questions

The AIs found some specific rules very hard to grasp:

The "Meta-Norm" Trap: The hardest question was about "punishing the people who didn't punish the rule-breaker." Imagine a teacher scolding a student for not reporting a bully. The AIs got very confused here. It's like trying to explain a game of "chicken" to someone who has never seen a car; the layers of logic were too deep.
The "Praise" Problem: In the comic strips, it was hard for the AIs to tell if someone was being praised just by looking at the picture. They could easily spot a scolding (someone looking angry), but a "thumbs up" or a smile was often missed.

5. The Takeaway

What does this mean for the future?

If we want to build robots that can walk into a room and know exactly how to behave without being programmed with a million specific rules, we need to use these AI brains.

The Good News: These models are already very good at understanding social rules when they read about them. GPT-4o is the current champion, but Qwen-2.5VL is a fantastic, free alternative that researchers can use right now.
The Bad News: They still get confused when looking at complex pictures, especially when the rules get layered and complicated.

In a nutshell: These AI models are like brilliant students who can read a textbook on etiquette perfectly but sometimes trip over the actual social dance when they see it in real life. The researchers are now working on helping them get better at "seeing" the rules, not just reading them, so our future robots can be truly polite, safe, and socially aware neighbors.

Here is a detailed technical summary of the paper "Social Norm Reasoning in Multimodal Language Models: An Evaluation".

1. Problem Statement

In Normative Multi-Agent Systems (NorMAS), agents must understand, reason about, and adhere to social norms to maintain order and cooperation. Traditional approaches rely on symbolic reasoning (e.g., deontic logic), which requires manual encoding of norms and struggles with scalability and adaptability in dynamic, real-world environments.

While Large Language Models (LLMs) offer a promising alternative for context-sensitive reasoning, prior research has been limited to text-based scenarios. There is a significant gap in understanding how Multimodal Large Language Models (MLLMs)—which integrate text and vision—perform in reasoning about social norms when presented with visual inputs (images), which are critical for embodied agents like social robots.

2. Methodology

The authors developed a comprehensive evaluation framework to assess the norm-reasoning capabilities of five state-of-the-art MLLMs.

Models Evaluated

GPT-4o (OpenAI)
Gemini 2.0 Flash (Google)
Qwen-2.5VL (72B) (Alibaba)
Intern-VL3 (14B)
Meta LLaMa-4 Maverick

Dataset Construction

Norms: Five distinct social norms were selected:
1. Knocking before entering a room.
2. Not littering in a park.
3. Maintaining order in a line.
4. Being punctual.
5. Offering a seat to the elderly.
Variants: Each norm was tested across six variants representing different normative states and consequences:
- V1/V2: Adherence (without/with praise).
- V3/V4a/V4b: Violation (no sanction, gentle advice, or scolding).
- V5: Metanorm (sanctioning those who fail to sanction the violator).
Total Stories: 30 text-based stories and 30 corresponding image-based stories (generated as 4-panel comic strips).
Evaluation Questions: For each story, models answered 8 questions covering:
- Identification of the norm and the subject.
- Detection of adherence vs. violation.
- Recognition of consequences (praise, sanctions, meta-punishment).

Evaluation Process

Ground Truth: Established by two human authors and verified by two external evaluators (achieving >95% agreement and Cohen's Kappa >0.90).
Input Modalities:
- Text: Models answered questions based on the story text.
- Image: Models first generated a story description from the comic strip image, then answered the questions based on that description.
Metrics: Accuracy scores calculated as (Correct Predictions / Total Predictions). Statistical significance was tested using paired t-tests, Friedman tests, and post-hoc Nemenyi/Wilcoxon tests.

3. Key Contributions

First Multimodal Norm Evaluation: This is the first study to systematically evaluate MLLMs on social norm reasoning using both textual and visual inputs, bridging the gap between NorMAS theory and modern AI capabilities.
Granular Variant Analysis: Unlike previous works that focused on binary compliance, this study introduces complex variants including praise, sanctions, and metanorms (punishing non-enforcers).
Comprehensive Benchmarking: Provides a rigorous comparison of five leading MLLMs against human ground truth across 60 distinct scenarios.
Bias Mitigation: Conducted a secondary experiment using a different image generator (Seedream 4.0) to rule out bias from GPT-4o generating its own test images.

4. Key Results

Performance by Modality

Text vs. Image: MLLMs performed significantly better on text-based reasoning (Mean Accuracy = 95.33%) than on image-based reasoning (Mean Accuracy = 83.58%).
- Statistical Significance: $t(149) = 8.91, p < 0.001$ , with a large effect size ( $d_z = 0.82$ ).
- This suggests current models have stronger textual inference capabilities than visual social context understanding.

Model Rankings

GPT-4o: Best overall performer.
- Text: 98.75%
- Image: 92.5%
- Significantly outperformed LLaMA-4, Intern-VL, and Gemini 2.0 Flash.
Qwen-2.5VL: Second best and the top free-to-use model.
- Text: 97.5%
- Image: 85.41%
Meta LLaMa-4 Maverick: Lowest performer, particularly struggling with images (76.66% accuracy).

Complexity of Norms

Simple vs. Complex: Models excelled at detecting simple norm violations (Category 2) and adherence (Category 1).
Metanorms (Category 3): Reasoning about metanorms (V5) was the most challenging, with a median accuracy of only 75%.
- Models struggled to perform the multi-level reasoning required: (1) Identify violation $\to$ (2) Identify sanction $\to$ (3) Identify failure to sanction.
Scenario Specifics: "Littering" and "Knocking" were easiest; "Seat-offering" and "Maintain-line" were harder, likely due to visual ambiguity in the comic strips.

5. Significance and Future Directions

Implications for Robotics: The results indicate that while MLLMs (especially GPT-4o and Qwen-2.5VL) are viable engines for social robots to interpret norms, their visual reasoning needs improvement. Relying solely on text descriptions of visual scenes may be a bottleneck.
Metanorm Challenge: The difficulty in reasoning about metanorms highlights a current limitation in AI's ability to handle hierarchical social rules and second-order enforcement, which is crucial for complex multi-agent societies.
Future Work:
- Extending evaluation to video and audio inputs.
- Exploring fine-tuning and Retrieval-Augmented Generation (RAG) to improve domain specificity.
- Implementing Tree-of-Thought prompting to handle complex moral dilemmas.
- Real-world deployment in embodied agents to test dynamic norm learning.

Conclusion: The paper establishes that MLLMs are a promising foundation for socially intelligent agents, with GPT-4o leading the field. However, a significant performance gap remains between text and image reasoning, and complex metanorms remain a challenging frontier for current AI architectures.