SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

The paper introduces SocialOmni, a novel benchmark designed to evaluate the social interactivity of Omni-modal large language models by assessing their ability to handle speaker identification, interruption timing, and natural interruption generation, revealing a significant gap between perceptual accuracy and conversational competence in current models.

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

Published 2026-03-18
📖 5 min read🧠 Deep dive

Imagine you are at a lively dinner party with friends. You aren't just listening to what people say; you are watching their faces, noticing who is looking at whom, sensing when someone is about to finish a sentence, and deciding exactly when to jump in with a joke or a comment without being rude.

Now, imagine teaching a robot to do the same thing.

This paper introduces SocialOmni, a new "exam" designed to test if AI models (specifically "Omni" models that can see, hear, and speak) are actually good at this social dance, or if they are just really good at taking a test.

Here is the breakdown of what the paper is about, using simple analogies:

1. The Problem: The "Robot Who Answers Too Late"

Current AI models are like brilliant students who are terrible at conversation.

  • If you ask them a question about a video ("Who is speaking?"), they get the answer right 90% of the time.
  • But in a real conversation, if you wait for them to answer, they might interrupt you while you're still talking, or they might wait until the conversation is over and the topic has changed.
  • Existing tests only check if the AI knows the facts. They don't check if the AI knows the social rules.

2. The Solution: The "SocialOmni" Exam

The researchers created a new benchmark called SocialOmni. Think of it as a driving test for conversation. Instead of just asking, "Do you know the traffic laws?" (facts), they put the AI behind the wheel in real traffic to see if it can actually drive without crashing.

The exam tests three specific skills, which the authors call Who, When, and How:

🗣️ Who: The "Spot the Speaker" Game

  • The Task: In a video with three people talking, the AI has to identify exactly who is making a sound at a specific second.
  • The Twist: Sometimes the camera shows Person A, but the voice belongs to Person B (who is off-screen).
  • The Result: Many AIs get tricked. They see the face and assume that's who is talking, ignoring the voice. It's like a robot seeing a puppet and thinking the puppet is the one speaking, not the person behind it.

⏱️ When: The "Perfect Timing" Game

  • The Task: The AI has to decide the exact moment to interrupt or take a turn in the conversation.
  • The Challenge: If you interrupt too early, you are rude. If you wait too long, the moment has passed, and the conversation feels awkward.
  • The Result: The paper found that some AIs are aggressive (they interrupt constantly, like a toddler), while others are too shy (they wait so long they miss the chance to speak). Very few models found the "sweet spot."

🗣️ How: The "Natural Response" Game

  • The Task: Once the AI decides to speak, what does it say?
  • The Challenge: It needs to sound natural, fit the mood, and make sense with what was just said.
  • The Result: Even when an AI gets the timing right, what it says is often robotic, generic, or emotionally tone-deaf. It's like a robot saying "That is interesting" to a friend who just told them they lost their job.

3. The Big Discovery: "Smart but Clueless"

The most surprising finding of the paper is a decoupling (a disconnect).

  • The Analogy: Imagine a student who gets an 'A' on a math test (Perception) but fails a group project because they can't talk to their teammates (Interaction).
  • The Finding: The AI models that were best at identifying who was speaking were not necessarily the ones that were best at saying something appropriate.
  • Why it matters: This proves that just making AI "smarter" at recognizing faces and voices isn't enough. We need to teach them the social rhythm of conversation.

4. The "Conflict" Test

The researchers also created tricky scenarios where the video and audio didn't match (e.g., the video shows a person smiling, but the audio is a sad voice).

  • The Result: Most AIs crumbled. They got confused and couldn't figure out what was real. This shows that current AI is fragile when the world gets messy, unlike humans who can easily tell, "Oh, that's a movie scene, not a real person."

Summary: What Does This Mean for the Future?

The paper concludes that we are currently building AI that is factually correct but socially awkward.

To make AI that feels like a real human friend (or a helpful assistant), we can't just focus on accuracy. We need to train them on the rhythm, timing, and emotional nuance of human interaction. SocialOmni is the first step in measuring how close we are to achieving that, showing us that we still have a long way to go before our AI can truly "hang out" with us.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →