Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments

Imagine you are watching a 90-minute soccer match. It's a long, winding story with lots of boring parts (players jogging, passing the ball back and forth) and a few thrilling parts (a goal, a near-miss, a dramatic save).

Now, imagine you want to build a robot that can watch this entire game and write a short, exciting summary for you. Before the robot can write the story, it has to do one crucial thing first: It needs to know which moments actually matter.

This paper is about testing how good today's smartest AI robots are at that specific job.

The Big Question: Can AI Spot the "Highlight Reel"?

The researchers asked: If we show an AI a random 15-second clip from a soccer game, can it tell us if it's a "big deal" (like a goal) or just "background noise" (like a corner kick that goes nowhere)?

To test this, they built a new dataset called MOMENTS.

How they built it: They didn't ask humans to watch hours of video and label every second. Instead, they used a clever trick. They took official "Highlight Reels" (the short clips TV stations show after a game) and matched them up with the full 90-minute games.
The Logic: If a moment is in the highlight reel, it's "Important." If it's in the full game but not in the highlight reel, it's "Non-Important."
The Result: They created thousands of examples to test the AI.

The Test: The AI vs. The Human Eye

They took several of the most advanced AI models available (the "Foundation Models" that can see video, hear audio, and read text) and gave them a simple task: "Is this clip important? Yes or No?"

They tested the AI in three ways:

Just the Video: Showing the AI the visual action.
Just the Commentary: Showing the AI what the announcer said (transcribed as text).
Everything: Showing the video, the audio, and the text together.

The Shocking Results

Here is the bad news: The AI struggled mightily.

The Score: The AI's performance was barely better than flipping a coin. It was essentially guessing.
The "Superpower" Myth: You might think that if you give the AI more information (video + audio + text), it would get smarter. But the researchers found that giving the AI all three didn't help much. In fact, the AI often ignored the extra information and relied on just one thing.

The "One-Track Mind" Problem

The researchers discovered something fascinating about how the AI failed. It had a "one-track mind" depending on what it was looking at:

When looking for a Goal (Important): The AI relied almost entirely on sight. It saw the ball go into the net and said, "Yes, this is important!" It ignored the announcer screaming about it.
When looking for a Boring Moment (Non-Important): The AI relied almost entirely on words. If the announcer said, "And now, a corner kick," the AI said, "No, this isn't important," even if the video showed a chaotic scramble.

The Analogy: Imagine a student taking a test.

When the question is about a picture, the student only looks at the picture and ignores the instructions.
When the question is about a story, the student only reads the story and ignores the picture.
The student never learns to combine the two to get the full picture.

The "Context" Trap

The hardest part for the AI was Context.

Example: A "Shot on Target" (the goalie saves the ball) can be a huge, exciting moment, or it can be a boring, routine save.
The AI's Failure: The AI couldn't tell the difference. It didn't understand that the announcer's tone or the score of the game (context) changed the importance of the moment. It just saw a ball hitting a goalie and didn't know if it was a "big save" or a "boring save."

The Takeaway: We Aren't There Yet

The paper concludes that while AI is getting better at describing what it sees, it is not yet ready to be a sports commentator or a video summarizer.

The Problem: Current AI models are like a person who can see a car crash but doesn't understand why it happened or how bad it was without reading the news report. They can't blend the visual and the narrative together seamlessly.
The Future: We need to build AI that is more like a human editor—one who can dynamically switch between looking at the screen and listening to the story, understanding that sometimes the sound tells you the story is exciting, and sometimes the sight does.

In short: The AI can see the ball, and it can read the words, but it still hasn't learned how to watch the game like a human fan does. It's still learning to distinguish between a "highlight" and just "happening."

Here is a detailed technical summary of the paper "Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments."

1. Problem Statement

The paper addresses a fundamental prerequisite for generating coherent narratives or summaries of temporally ordered multimodal events (such as sports broadcasts): the ability to identify the most important sub-events (moments).

While foundation models (omni-modal models) have shown promise in generating fluent descriptions, their practical applicability is limited by their inability to distinguish between salient and non-salient events. The authors argue that current models struggle with:

Contextual Importance: Determining why a specific moment matters (e.g., a corner kick that leads to a goal vs. a routine corner).
Multimodal Integration: Effectively synthesizing information from video, audio commentary, and text transcriptions to make these judgments.
Evaluation Gap: Existing benchmarks often rely on reference-based metrics (BLEU, METEOR) that only evaluate text fluency, failing to assess visual grounding or the logical selection of events.

2. Methodology

A. Dataset Construction: MOMENTS

To evaluate this capability without incurring high human annotation costs, the authors constructed a new dataset called MOMENTS using football (soccer) games.

Source Data: They utilized the GOAL dataset (highlight reels) and mapped them to full game videos from SoccerNet and SoccerReplay-1988.
Localization Algorithm: A three-step hierarchical algorithm was developed to align highlight frames ( $H$ $H$ ) with full game frames ( $G$ $G$ ):
1. Coarse Matching: Compare the first frame of every second in $H$ with representative frames in $G$ using Structural Similarity Index Measure (SSIM) to reduce search space complexity from $O(H \times G)$ to $O(H \times G_{1-FPS})$ .
2. Pruning: Use boundaries of "well-localized" seconds to refine the search for "poorly-localized" seconds.
3. Fine Matching: Loop around matched indices to align all remaining frames.
Labeling Strategy:
- Important Moments: Defined implicitly as segments included in expert-curated highlight reels.
- Non-Important Moments: Extracted from full games as contiguous segments not containing highlights. Their durations were sampled from a Gamma distribution fitted to the important moments to ensure class balance.
Modalities: Each sample includes three modalities:
- Visual (V): Video frames.
- Auditory (A): Raw audio commentary.
- Language (L): Transcribed commentary (using Whisper-turbo).
Dataset Size: 3,954 samples (1,977 important, 1,977 non-important).

B. Experimental Setup

Task: Binary classification ( $\Phi(M_i) \in \{0, 1\}$ ) to determine if a moment is "important."
Models Evaluated: A diverse set of 8 state-of-the-art foundation models, including:
- Omni-modal: Qwen2.5-Omni, Qwen3-Omni.
- Vision-Language: Qwen2.5-VL, Qwen3-VL.
- Audio-Language: Qwen2-Audio, Voxtral-Mini.
- Language-only: Llama-3.1, Qwen2.5, Qwen3.
Input Combinations: Models were tested under seven modality combinations: $A, L, V, AL, AV, LV, ALV$ .
Baselines: Logistic regression classifiers trained on n-grams (text), MFCCs (audio), and SwinTransformer embeddings (video).
Metrics: Primary focus on Matthews Correlation Coefficient (MCC) to handle class balance and confusion matrix nuances, alongside Accuracy, F1, and ROC AUC.

3. Key Results

A. Performance is Near Chance Level

Current foundation models perform poorly on this task, with MCC scores hovering near 0 (chance level) and accuracy around 0.5.
No single model significantly outperforms others within the margin of uncertainty.
Multimodal Advantage is Weak: Models receiving multimodal inputs ( $ALV$ ) do not show a statistically significant advantage over unimodal baselines. In many cases, adding modalities does not improve performance.

B. Modality Dominance and Collapse

Visual Modality ( $V$ ): Provides the strongest signal for identifying important moments (e.g., goals).
Language Modality ( $L$ ): Provides the strongest signal for identifying non-important moments. The commentary often clarifies that a visually active scene (e.g., a shot on target) is actually routine or unimportant.
Modality Collapse: Analysis of model confidence (logit differences) reveals that models tend to rely heavily on a single dominant modality rather than integrating them. For example, when a model sees a goal, it relies on vision; when it sees a corner kick, it relies on text. It fails to synthesize both to understand contextual importance.

C. Contextual Importance

For prototypical events (goals), models perform slightly better.
For contextual events (corners, throw-ins, shots on target), multimodal information fails to consistently boost confidence. The scatter plots of unimodal vs. multimodal confidence show a lack of correlation, indicating that models cannot effectively leverage cross-modal synergy for complex, non-prototypical events.

4. Key Contributions

MOMENTS Dataset: A novel, large-scale dataset for evaluating multimodal event importance, constructed via a fully automated, heuristic-free pipeline leveraging expert highlight reels as ground truth.
Empirical Benchmarking: The first comprehensive evaluation of foundation models on the specific task of distinguishing important vs. non-important sub-events in sports, revealing a significant performance gap compared to human capability.
Analysis of Multimodal Failure: The paper provides deep diagnostic insights showing that current architectures suffer from modality dominance (relying on one source) rather than true multimodal fusion, particularly failing to handle sample-level heterogeneity and contextual nuance.

5. Significance and Future Directions

Limitation of Current Architectures: The findings suggest that current "omni" models are not yet ready for real-world applications requiring complex event understanding (e.g., automatic sports commentary or summarization). They lack the ability to dynamically route information or resolve conflicts between modalities.
Architectural Recommendations: The authors advocate for modular architectures (e.g., Mixture-of-Experts) that can dynamically route samples to modality-specific experts based on sample-level heterogeneity, rather than using static fusion strategies (like fixed projectors).
Training Procedures: Future work must focus on training objectives that maximize cross-modal synergy, teaching models to recognize when visual signals are ambiguous and require linguistic context (or vice versa) to determine importance.

In conclusion, the paper argues that while foundation models can generate fluent text, they currently lack the "multimodal goal post"—the ability to accurately perceive and weigh the importance of events in a temporally ordered, multimodal context.