Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

This paper introduces a new dataset derived from football highlight reels to evaluate foundation models' ability to identify contextually important video moments, revealing that current state-of-the-art models perform near chance levels due to their reliance on single dominant modalities and failure to effectively synthesize cross-modal information.

Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle

Published 2026-03-06
📖 5 min read🧠 Deep dive

Imagine you are watching a 90-minute soccer match. It's a long, winding story with lots of boring parts (players jogging, passing the ball back and forth) and a few thrilling parts (a goal, a near-miss, a dramatic save).

Now, imagine you want to build a robot that can watch this entire game and write a short, exciting summary for you. Before the robot can write the story, it has to do one crucial thing first: It needs to know which moments actually matter.

This paper is about testing how good today's smartest AI robots are at that specific job.

The Big Question: Can AI Spot the "Highlight Reel"?

The researchers asked: If we show an AI a random 15-second clip from a soccer game, can it tell us if it's a "big deal" (like a goal) or just "background noise" (like a corner kick that goes nowhere)?

To test this, they built a new dataset called MOMENTS.

  • How they built it: They didn't ask humans to watch hours of video and label every second. Instead, they used a clever trick. They took official "Highlight Reels" (the short clips TV stations show after a game) and matched them up with the full 90-minute games.
  • The Logic: If a moment is in the highlight reel, it's "Important." If it's in the full game but not in the highlight reel, it's "Non-Important."
  • The Result: They created thousands of examples to test the AI.

The Test: The AI vs. The Human Eye

They took several of the most advanced AI models available (the "Foundation Models" that can see video, hear audio, and read text) and gave them a simple task: "Is this clip important? Yes or No?"

They tested the AI in three ways:

  1. Just the Video: Showing the AI the visual action.
  2. Just the Commentary: Showing the AI what the announcer said (transcribed as text).
  3. Everything: Showing the video, the audio, and the text together.

The Shocking Results

Here is the bad news: The AI struggled mightily.

  • The Score: The AI's performance was barely better than flipping a coin. It was essentially guessing.
  • The "Superpower" Myth: You might think that if you give the AI more information (video + audio + text), it would get smarter. But the researchers found that giving the AI all three didn't help much. In fact, the AI often ignored the extra information and relied on just one thing.

The "One-Track Mind" Problem

The researchers discovered something fascinating about how the AI failed. It had a "one-track mind" depending on what it was looking at:

  1. When looking for a Goal (Important): The AI relied almost entirely on sight. It saw the ball go into the net and said, "Yes, this is important!" It ignored the announcer screaming about it.
  2. When looking for a Boring Moment (Non-Important): The AI relied almost entirely on words. If the announcer said, "And now, a corner kick," the AI said, "No, this isn't important," even if the video showed a chaotic scramble.

The Analogy: Imagine a student taking a test.

  • When the question is about a picture, the student only looks at the picture and ignores the instructions.
  • When the question is about a story, the student only reads the story and ignores the picture.
  • The student never learns to combine the two to get the full picture.

The "Context" Trap

The hardest part for the AI was Context.

  • Example: A "Shot on Target" (the goalie saves the ball) can be a huge, exciting moment, or it can be a boring, routine save.
  • The AI's Failure: The AI couldn't tell the difference. It didn't understand that the announcer's tone or the score of the game (context) changed the importance of the moment. It just saw a ball hitting a goalie and didn't know if it was a "big save" or a "boring save."

The Takeaway: We Aren't There Yet

The paper concludes that while AI is getting better at describing what it sees, it is not yet ready to be a sports commentator or a video summarizer.

  • The Problem: Current AI models are like a person who can see a car crash but doesn't understand why it happened or how bad it was without reading the news report. They can't blend the visual and the narrative together seamlessly.
  • The Future: We need to build AI that is more like a human editor—one who can dynamically switch between looking at the screen and listening to the story, understanding that sometimes the sound tells you the story is exciting, and sometimes the sight does.

In short: The AI can see the ball, and it can read the words, but it still hasn't learned how to watch the game like a human fan does. It's still learning to distinguish between a "highlight" and just "happening."