Imagine you are teaching a robot how to be a good neighbor. You don't just want it to know how to walk or talk; you want it to know when to knock on a door, when to pick up trash, and when to offer its seat to an elderly person. These unwritten rules of society are called social norms.
This paper is like a report card for five different "super-brains" (AI models) to see how good they are at learning these rules, both by reading stories about them and by looking at pictures of them.
Here is the breakdown of their "school day":
1. The Test: Reading vs. Watching
The researchers gave these AI brains two types of homework:
- The Reading Test: They gave the AIs short stories (like little comic scripts) describing social situations.
- The Visual Test: They turned those same stories into four-panel comic strips and asked the AIs to look at the pictures and explain what was happening.
The questions were tricky. They didn't just ask, "Did someone break a rule?" They asked deeper questions like:
- "Did the person get praised for doing the right thing?"
- "Did they get scolded for doing the wrong thing?"
- "Did someone get in trouble for not scolding the rule-breaker?" (This is a very complex rule called a meta-norm—basically, a rule about enforcing rules).
2. The Students (The AI Models)
Five different AI "students" took the test:
- GPT-4o: The top student, known for being very smart and versatile.
- Qwen-2.5VL: A free-to-use model that turned out to be a very strong runner-up.
- Gemini 2.0 Flash: A fast model, but a bit inconsistent.
- Intern-VL3: A solid performer, but not the best.
- Meta LLaMa-4 Maverick: The student who struggled the most, especially with pictures.
3. The Results: The "Text vs. Image" Gap
Here is the big surprise, like finding out a student is a genius at math but terrible at art class:
- Reading was easy: When the AIs had to read the stories, they were almost perfect. GPT-4o got a 98.75% score! It was like they were reading a novel and understanding every nuance of human behavior.
- Pictures were harder: When the AIs had to look at the comic strips, their scores dropped. GPT-4o still did well (92.5%), but others struggled more. It seems the AIs are great at understanding words, but they sometimes get confused by what is happening in a drawing. They might miss a subtle facial expression or a gesture that changes the meaning of the scene.
4. The Tricky Questions
The AIs found some specific rules very hard to grasp:
- The "Meta-Norm" Trap: The hardest question was about "punishing the people who didn't punish the rule-breaker." Imagine a teacher scolding a student for not reporting a bully. The AIs got very confused here. It's like trying to explain a game of "chicken" to someone who has never seen a car; the layers of logic were too deep.
- The "Praise" Problem: In the comic strips, it was hard for the AIs to tell if someone was being praised just by looking at the picture. They could easily spot a scolding (someone looking angry), but a "thumbs up" or a smile was often missed.
5. The Takeaway
What does this mean for the future?
If we want to build robots that can walk into a room and know exactly how to behave without being programmed with a million specific rules, we need to use these AI brains.
- The Good News: These models are already very good at understanding social rules when they read about them. GPT-4o is the current champion, but Qwen-2.5VL is a fantastic, free alternative that researchers can use right now.
- The Bad News: They still get confused when looking at complex pictures, especially when the rules get layered and complicated.
In a nutshell: These AI models are like brilliant students who can read a textbook on etiquette perfectly but sometimes trip over the actual social dance when they see it in real life. The researchers are now working on helping them get better at "seeing" the rules, not just reading them, so our future robots can be truly polite, safe, and socially aware neighbors.