SUMMIR: A Hallucination-Aware Framework for Ranking Sports Insights from LLMs

Imagine you are a huge sports fan. Every day, thousands of news articles pop up about your favorite teams: pre-game predictions, post-game analysis, player interviews, and stats. It's like trying to drink from a firehose. You want the best, most exciting, and most true stories, but finding them in that massive pile of text is exhausting.

This paper introduces SUMMIR, a smart system designed to act as your ultimate "Sports News Butler." Its job is to read all those articles, pull out the juicy highlights, check if they are true, and then arrange them in the perfect order for you to read.

Here is how it works, broken down into simple steps:

1. The Great Filter (Cleaning the Mess)

First, the system goes out and grabs thousands of articles about specific matches (like Cricket, Soccer, Basketball, and Baseball). But the internet is messy. Sometimes it grabs an article about a game that happened last year or a different team entirely.

The Analogy: Imagine you are sorting a giant bag of mixed-up puzzle pieces. You need to throw away the pieces that don't belong to this specific puzzle.
The Solution: The team used a "Two-Step Validation Pipeline." Think of it as a Junior Editor (a smaller, faster AI) who does a quick sweep to remove obvious junk, followed by a Senior Editor (a super-smart, expensive AI like GPT-4o) who double-checks the remaining pieces to make sure they are actually about the right match. This ensures they only keep the 7,900 articles that truly matter out of the 32,000 they found.

2. The Storyteller (Generating Insights)

Once they have the right articles, the system needs to turn boring paragraphs into exciting "insights."

The Analogy: Imagine a chef taking a whole cow and turning it into a delicious, plated meal. The system takes the raw text and serves up specific dishes: "New Records Broken," "Key Moments," "Player Quotes," and "Post-Game Emotions."
The Magic: They used four different "Chef AIs" (GPT-4o, Qwen, Llama, Mixtral) to generate over 280,000 of these insights.

3. The Lie Detector (Hallucination Check)

Here is the tricky part. AI models sometimes "hallucinate"—they make things up that sound real but are completely false.

The Analogy: Imagine a student writing a history essay. Sometimes they confidently say, "Napoleon won at Waterloo," when he actually lost. You need a strict teacher to catch that lie.
The Solution: The team used a "Fact-Check Squad." They used two methods:
1. FactScore: Checks if every specific fact (like a score or a name) matches the original article.
2. SummaC: Checks if the story logically follows from the source text.
- The Result: They found that some AIs were better at telling the truth than others. GPT-4o was the most honest chef, while others occasionally served up "fake news" dishes.

4. The Ranking System (SUMMIR)

Now they have thousands of true, interesting insights. But which one should you read first?

The Analogy: Imagine a music festival with 1,000 bands playing at once. You can't listen to all of them. You need a DJ to pick the top 5 songs that will make the crowd go wild right now.
The Solution: They built SUMMIR (Sentence Unified Multimetric Model for Importance Ranking). This is the DJ. It doesn't just pick the longest story; it looks at:
- Emotion: Is the player crying or screaming in joy? (High emotion = higher rank).
- Buzzwords: Did a famous player like Virat Kohli or Lionel Messi do something? (Famous names = higher rank).
- Sarcasm: Is the writer being funny? (It detects this so it doesn't get confused).
- Relevance: Does this actually matter to the game's outcome?

5. The Training (Teaching the DJ)

How did they teach SUMMIR to be a good DJ?

The Analogy: They didn't just tell the DJ "play good music." They had a human judge taste-test the playlists. When the AI picked a good song, the human gave a "thumbs up" (reward). When it picked a bad one, a "thumbs down."
The Method: They used a technique called PPO (Proximal Policy Optimization). Think of this as a video game where the AI gets points for making the right choices. Over time, it learned to prioritize the insights that humans found most interesting and relevant.

The Big Takeaway

The paper shows that by combining smart filtering, strict fact-checking, and a human-like ranking system, we can automatically turn a chaotic ocean of sports news into a clean, reliable, and exciting highlight reel.

It's like having a personal sports journalist who never sleeps, never lies, and always knows exactly which story you want to hear first.

1. Problem Statement

The rapid proliferation of online sports journalism generates vast amounts of unstructured textual data. While valuable, extracting meaningful, factual, and contextually relevant pre-game and post-game insights from these articles remains challenging due to:

Information Overload: Difficulty in retrieving specific match-relevant articles amidst noise.
LLM Hallucinations: Large Language Models (LLMs) often generate plausible-sounding but factually incorrect information (hallucinations) when summarizing or extracting insights.
Lack of Ranking: Existing methods often focus on broad sentiment analysis or event extraction but lack a mechanism to rank insights based on user interest, relevance, and factual accuracy.

The authors propose a framework to automatically curate, validate, generate, and rank sports insights while rigorously detecting hallucinations.

2. Methodology

The proposed framework follows a multi-stage pipeline: Data Collection $\rightarrow$ Validation $\rightarrow$ Insight Generation $\rightarrow$ Hallucination Detection $\rightarrow$ Ranking.

A. Dataset Curation and Validation

Data Collection: The authors curated a dataset of 7,900 news articles covering 800 matches across four major sports: Cricket, Soccer, Basketball, and Baseball. Articles were collected via Google Search API with a three-day time window around match dates.
Two-Step Validation Pipeline: To ensure articles are relevant to specific matches (avoiding confusion between similar matchups or historical data):
1. First Tier (Open-Source): Used Qwen 2.5-32B Instruct to filter articles. This model achieved the best balance of precision (88.5%) and recall (89.1%) among tested open-source models.
2. Second Tier (Proprietary/High-Capacity): Validated the remaining articles using GPT-4o, Qwen 2.5-72B, Llama-3.3-70B, and Mixtral-8x7B.
Result: This process filtered 32,630 initial articles down to 7,900 highly relevant articles.

B. Insight Generation

Prompt Engineering: Sport-specific prompts were designed to extract structured insights into categories: New Records, Key Match Events, Pre-Game Insights, Post-Match Reflections, Miscellaneous Highlights, and Others.
Generation Models: Four advanced LLMs were used to generate insights from the validated articles, producing a total of 281,163 insights:
- GPT-4o
- Qwen 2.5-72B-Instruct
- Llama-3.3-70B-Instruct
- Mistral-8x7B-Instruct-v0.1

C. Hallucination Detection

To ensure factual integrity, a dual-evaluation strategy was employed:

FactScore: A metric that breaks down generated text into atomic facts and verifies them against the source document.
SummaC (Summary Consistency): Uses Natural Language Inference (NLI) to determine if generated sentences are logically entailed by the source article.

Findings: GPT-4o demonstrated the highest factual consistency (FactScore: 95–97%; SummaC: 60–72%), while Mixtral-8x7B showed higher hallucination rates, particularly in Baseball and Soccer.

D. SUMMIR: The Ranking Framework

The core contribution is SUMMIR (Sentence Unified Multimetric Model for Importance Ranking), a novel architecture designed to rank insights based on user-specific interests.

Feature Extraction: The system extracts six distinct features for each insight sentence:
1. Semantic Relevance: Using sentence-transformers and sports lexicons.
2. Emotional Intensity: Using a fine-tuned RoBERTa model (GoEmotions).
3. Sarcasm Detection: Using a T5-base model to adjust emotional scores.
4. TF-IDF Weighting: For term importance.
5. Buzzword Identification: Scoring high-impact sports terms.
6. Named Entity Recognition (NER): Ranking based on public figure popularity (Pantheon dataset).
Training Mechanism (PPO):
- A lightweight 1B-parameter LLaMA model is fine-tuned using Proximal Policy Optimization (PPO).
- Reward Signal: A novel ScoreNet (a differentiable scoring function) generates relevance priors. The reward is a convex combination of Gold Ranking (human/LLM annotated) and ScoreNet Ranking.
- Objective: Maximize NDCG (Normalized Discounted Cumulative Gain) and Recall while maintaining alignment with human preferences.

3. Key Contributions

Novel Problem Definition: Formulated the specific task of discovering and ranking pre- and post-game sports insights.
Large-Scale Dataset: Curated a high-quality dataset of 7,900 articles across 800 matches with a rigorous two-step LLM validation pipeline.
Structured Insight Generation: Generated over 280,000 structured insights using sport-specific prompts and four state-of-the-art LLMs.
Hallucination-Aware Evaluation: Applied a dual-metric evaluation (FactScore + SummaC) to reveal significant reliability differences between LLMs.
SUMMIR Architecture: Introduced a reinforcement learning-based ranking system that combines semantic, emotional, and contextual features to prioritize insights effectively.

4. Results

Factual Accuracy: GPT-4o emerged as the most reliable model for insight generation, achieving FactScores of 95–97% and SummaC scores up to 72%.
Ranking Performance:
- The SUMMIR model (fine-tuned LLaMA 3.2 1B) achieved an NDCG@10 of 0.943 and Recall@10 of 0.960.
- SUMMIR outperformed models trained solely on NDCG or Recall metrics, demonstrating that combining ScoreNet priors with human gold rankings yields more stable and human-aligned results.
- The model approached human performance in ranking (nDCG@3: 0.724 vs. human 0.649), though it lagged slightly in Recall@3.
Feature Importance: Ablation studies showed that Emotional Intensity and Named Entity Popularity were the most significant drivers for improving ranking quality.

5. Significance and Future Work

Significance: This work provides a robust, scalable framework for transforming raw sports news into actionable, factual, and engaging insights. It addresses the critical issue of LLM hallucinations in domain-specific summarization and offers a method to personalize content ranking.
Limitations & Future Directions:
- Error Analysis: The system showed over-sensitivity to famous players (ignoring context) and struggled with cultural sarcasm.
- Future Work: Extending the framework to non-sports domains (news, education), implementing adaptive reward balancing, incorporating user interaction signals for personalization, and automating prompt tuning via RLHF.

In conclusion, SUMMIR represents a significant step forward in Sports Analytics, bridging the gap between raw data retrieval and high-quality, user-centric insight generation while maintaining strict factual integrity.