Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

The paper introduces Daily-Omni, a comprehensive audio-visual benchmark with 1,197 questions designed to evaluate cross-modal temporal reasoning, revealing that current multimodal large language models still struggle with alignment-critical tasks despite strong unimodal performance.

Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang

Published Wed, 11 Ma
📖 5 min read🧠 Deep dive

Imagine you are watching a movie with the sound turned off. You see a character slam a door and look angry. You might guess they are mad. Now, imagine you have the sound on but the screen is black. You hear a loud CRASH and a shout. You might guess something broke.

But what if you need to know exactly when the shout happened relative to the crash? Did the shout come before the crash to warn them? Did it happen after to express frustration? Or did they happen at the exact same split second?

This is the specific problem the paper Daily-Omni is trying to solve. It's like a "driver's license test" for AI, but instead of driving a car, the AI has to drive a conversation between its eyes (video) and its ears (audio) at the exact same time.

Here is the breakdown of the paper in simple terms:

1. The Problem: The AI is "Deaf-Blind" in Time

Current super-smart AI models (called Multimodal Large Language Models) are great at looking at pictures or listening to music separately. But when you give them a video with sound, they often struggle to sync the two.

Think of it like a person trying to dance to a song while wearing noise-canceling headphones and blindfolds, then trying to guess the rhythm. They might know the steps (video) and the beat (audio), but they can't feel how they fit together in the moment. The paper argues that most AI today is bad at this "temporal alignment"—figuring out which sound matches which visual action at the exact same time.

2. The Solution: A New "Gym" for AI (Daily-Omni)

The researchers built a new testing ground called Daily-Omni.

  • The Workout: They collected 684 real-world videos (like people cooking, fixing cars, or playing music) and created nearly 1,200 questions about them.
  • The Challenge: The questions aren't just "What color is the car?" They are tricky, like: "Did the person drop the glass before or after the dog barked?" or "Why did the crowd cheer right when the singer hit that high note?"
  • The Goal: To force the AI to stop guessing and actually connect the dots between what it sees and what it hears in real-time.

3. How They Built It: The "Smart Assembly Line"

Making these questions by hand would take forever. So, the team built a semi-automatic factory line:

  1. The Scribes: They used powerful AI to write down what was happening in the video (visuals) and what was happening in the sound (audio) separately.
  2. The Editors: Another AI checked to make sure the story made sense (e.g., if the visual says "a man is running," the audio shouldn't say "a cat is sleeping").
  3. The Time-Keepers: This is the magic step. They taught the AI to link specific moments: "The sound of the door slamming happened at the exact same time the visual of the door closing appeared."
  4. The Filter: They made sure the questions couldn't be answered just by reading the text. If a human could guess the answer without watching the video or listening to the sound, the question was thrown out.

4. The Results: The AI is Still Stumbling

They tested 24 different top-tier AI models on this new gym. The results were a bit of a reality check:

  • The "Text-Only" Trap: Surprisingly, some AI models did almost as well just by reading the question text, without even looking at the video or listening to the audio. This means the questions were sometimes too easy to guess.
  • The "Sync" Struggle: Even the best AI models (like the ones from Google and Alibaba) struggled with the "time-sync" questions. They often got the facts right but the timing wrong.
  • The Simple Baseline Wins: The researchers built a simple, "training-free" tool called the Daily-Omni Agent. It's like a human taking notes: "First I see X, then I hear Y." Even though this tool is simple and not a super-complex neural network, it actually beat several of the massive, expensive AI models.

The Big Takeaway

The paper concludes that while AI is getting very good at seeing and hearing separately, it is still terrible at listening and watching at the same time.

The Metaphor:
Imagine a symphony orchestra. The current AI models are like musicians who can play their instruments perfectly on their own. But when they try to play together, they are out of sync. One is rushing, the other is dragging. Daily-Omni is the conductor trying to teach them how to play in perfect harmony.

The paper suggests that for AI to truly understand the real world (like a self-driving car hearing a siren while seeing a police car, or a robot understanding a human's tone of voice while watching their facial expression), we need to fix this "temporal alignment" problem first. Until then, our AI is still a bit out of step.