SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Imagine you are at a lively dinner party with friends. You aren't just listening to what people say; you are watching their faces, noticing who is looking at whom, sensing when someone is about to finish a sentence, and deciding exactly when to jump in with a joke or a comment without being rude.

Now, imagine teaching a robot to do the same thing.

This paper introduces SocialOmni, a new "exam" designed to test if AI models (specifically "Omni" models that can see, hear, and speak) are actually good at this social dance, or if they are just really good at taking a test.

Here is the breakdown of what the paper is about, using simple analogies:

1. The Problem: The "Robot Who Answers Too Late"

Current AI models are like brilliant students who are terrible at conversation.

If you ask them a question about a video ("Who is speaking?"), they get the answer right 90% of the time.
But in a real conversation, if you wait for them to answer, they might interrupt you while you're still talking, or they might wait until the conversation is over and the topic has changed.
Existing tests only check if the AI knows the facts. They don't check if the AI knows the social rules.

2. The Solution: The "SocialOmni" Exam

The researchers created a new benchmark called SocialOmni. Think of it as a driving test for conversation. Instead of just asking, "Do you know the traffic laws?" (facts), they put the AI behind the wheel in real traffic to see if it can actually drive without crashing.

The exam tests three specific skills, which the authors call Who, When, and How:

🗣️ Who: The "Spot the Speaker" Game

The Task: In a video with three people talking, the AI has to identify exactly who is making a sound at a specific second.
The Twist: Sometimes the camera shows Person A, but the voice belongs to Person B (who is off-screen).
The Result: Many AIs get tricked. They see the face and assume that's who is talking, ignoring the voice. It's like a robot seeing a puppet and thinking the puppet is the one speaking, not the person behind it.

⏱️ When: The "Perfect Timing" Game

The Task: The AI has to decide the exact moment to interrupt or take a turn in the conversation.
The Challenge: If you interrupt too early, you are rude. If you wait too long, the moment has passed, and the conversation feels awkward.
The Result: The paper found that some AIs are aggressive (they interrupt constantly, like a toddler), while others are too shy (they wait so long they miss the chance to speak). Very few models found the "sweet spot."

🗣️ How: The "Natural Response" Game

The Task: Once the AI decides to speak, what does it say?
The Challenge: It needs to sound natural, fit the mood, and make sense with what was just said.
The Result: Even when an AI gets the timing right, what it says is often robotic, generic, or emotionally tone-deaf. It's like a robot saying "That is interesting" to a friend who just told them they lost their job.

3. The Big Discovery: "Smart but Clueless"

The most surprising finding of the paper is a decoupling (a disconnect).

The Analogy: Imagine a student who gets an 'A' on a math test (Perception) but fails a group project because they can't talk to their teammates (Interaction).
The Finding: The AI models that were best at identifying who was speaking were not necessarily the ones that were best at saying something appropriate.
Why it matters: This proves that just making AI "smarter" at recognizing faces and voices isn't enough. We need to teach them the social rhythm of conversation.

4. The "Conflict" Test

The researchers also created tricky scenarios where the video and audio didn't match (e.g., the video shows a person smiling, but the audio is a sad voice).

The Result: Most AIs crumbled. They got confused and couldn't figure out what was real. This shows that current AI is fragile when the world gets messy, unlike humans who can easily tell, "Oh, that's a movie scene, not a real person."

Summary: What Does This Mean for the Future?

The paper concludes that we are currently building AI that is factually correct but socially awkward.

To make AI that feels like a real human friend (or a helpful assistant), we can't just focus on accuracy. We need to train them on the rhythm, timing, and emotional nuance of human interaction. SocialOmni is the first step in measuring how close we are to achieving that, showing us that we still have a long way to go before our AI can truly "hang out" with us.

1. Problem Statement

Omni-modal Large Language Models (OLMs) are designed to natively integrate audio, vision, and text to facilitate real-time human-machine interaction. However, current evaluation benchmarks for OLMs are predominantly static and accuracy-centric. They focus on "what" a model knows (e.g., answering questions about pre-segmented video clips) rather than "how" a model interacts socially.

This creates a critical gap: existing benchmarks fail to assess social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. Specifically, they neglect three core conversational competencies:

Who: Identifying the active speaker in multi-party settings.
When: Determining the optimal timing for turn-taking or interruption.
How: Generating socially appropriate and contextually coherent responses.

The authors argue that high accuracy on static understanding tasks does not guarantee a model can handle the temporal, multimodal, and social complexities of real-time dialogue (e.g., avoiding premature interruptions or responding to visual-audio mismatches).

2. Methodology: The SocialOmni Benchmark

The authors propose SocialOmni, a comprehensive benchmark designed to operationalize the evaluation of social interactivity across the Who-When-How triad.

A. Dataset Construction

Scale: The benchmark comprises 2,209 evaluation instances derived from over 3,000 raw videos.
Domains: Data spans 15 subcategories across 4 domains: Entertainment, Professional, Daily Life, and Narrative.
Quality Control: Videos were filtered for audio clarity, face visibility, and clear turn structures. Automatic transcripts were generated using Whisper and FunASR.
Stratification:
- Perception Split (2,000 items): Multiple-choice questions (MCQs) for speaker identification. Includes consistent (audio-visual match) and inconsistent (audio-visual mismatch/conflict) scenarios to test robustness.
- Generation Split (209 items): Open-ended tasks requiring the model to decide when to speak and how to generate a response.

B. Task Design

The benchmark is structured into two complementary tasks:

Task I: Who (Perception):
- Goal: Identify the speaker at a specific timestamp $t$ .
- Design: A 4-way classification problem (Correct Speaker/Content, Wrong Speaker/Correct Content, etc.) to decouple visual grounding errors from speech recognition errors.
- Robustness Probe: Includes "inconsistent" clips where the visible face does not match the voice, testing the model's ability to resolve cross-modal conflicts.
Task II: When & How (Generation):
- Goal: Simulate real-time interaction. The model receives a video prefix and must decide if it should speak now (When). If yes, it must generate a context-appropriate utterance (How).
- Metrics:
  - When: Measured by Signed Response Offset ( $\Delta\tau$ ) and categorized into Interrupted, Perfect, Delayed, Too Late, or No Response.
  - How: Evaluated via an LLM-as-a-Judge protocol (using GPT-4o, Gemini 2.5 Pro, and Qwen3-Omni) scoring responses on a 4-level scale (25–100) for coherence, relevance, and social appropriateness.

C. Evaluation Protocol

Decoupled Analysis: The protocol explicitly separates perception accuracy from generation quality to analyze the "perception-generation decoupling."
Temporal Granularity: Evaluations are conducted at frame-level precision (30 fps) for perception and 1-second strides for turn-taking decisions to simulate real-time constraints.

3. Key Contributions

New Benchmark (SocialOmni): The first benchmark to simultaneously operationalize speaker attribution, turn-entry timing, and interruption generation in a unified framework with controlled cross-modal conflict.
Dual-Axis Evaluation Protocol: Introduces a method to decouple perception diagnosis from generation scoring, revealing that high perceptual accuracy does not necessarily correlate with high-quality social generation.
Robustness Probes: Systematically quantifies model robustness under realistic audio-visual inconsistency scenarios (e.g., reaction shots where the speaker is off-screen).

4. Experimental Results

The authors evaluated 12 leading OLMs (including GPT-4o, Gemini 2.5/3, Qwen3-Omni, and open-source models like VITA-1.5 and MiniOmni2).

Key Findings:

No Single Dominant Model: No model excels across all three axes.
- Who: Qwen3-Omni leads (69.25% accuracy).
- When: Gemini 3 Pro Preview leads (67.31% on-time rate).
- How: Gemini 2.5 Flash leads (85.08 score).
Perception-Generation Decoupling: There is a pronounced lack of correlation between a model's ability to identify speakers and its ability to generate natural interruptions. For instance, Qwen3-Omni-Thinking has competitive speaker identification (54.60%) but very poor generation quality (18.06 score). Conversely, GPT-4o has low speaker identification accuracy (36.75%) but strong generation quality (69.64).
Open-Source vs. Commercial: Commercial models generally outperform open-source models, particularly in response quality (How), where the gap is nearly 19 points.
Robustness Gaps: Models exhibit significant performance drops in "inconsistent" scenarios (audio-visual mismatch), with some models relying heavily on visual saliency rather than audio-voice binding.
Failure Modes:
- Who: Models often attribute speech to the most visually salient face, ignoring off-screen speakers.
- When: Models either interrupt too aggressively (misinterpreting pauses as turn endings) or are too conservative (missing the conversational window).
- How: Even with correct timing, generated responses are often generic or tonally mismatched, failing to capture emotional nuance.

5. Significance and Impact

Redefining OLM Evaluation: The paper argues that "understanding-centric" metrics are insufficient for conversational AI. True social competence requires a holistic evaluation of timing, attribution, and generation.
Architectural Insights: The results highlight specific architectural challenges, such as the need for fine-grained audio-visual alignment beyond standard temporal sampling and the difficulty of fusing prosodic, lexical, and visual cues for turn-taking.
Future Directions: The benchmark provides actionable signals for improving OLMs, suggesting that future models must be trained not just on content accuracy but on social timing and multimodal coherence. It calls for a shift from static QA benchmarks to dynamic, interaction-oriented evaluation protocols.

In conclusion, SocialOmni exposes a critical blind spot in current OLM development: the ability to "know" the content does not equate to the ability to "interact" socially. The benchmark serves as a necessary diagnostic tool to bridge the gap between perception and interaction in the next generation of multimodal AI.