Automatic Funny Scene Extraction from Long-form Cinematic Videos

Imagine you have a massive library of movies and TV shows, each one a long, winding story that takes hours to watch. Now, imagine you want to create a "highlight reel" of just the funniest moments to show a friend, or to play automatically when someone hovers their mouse over a title on a streaming app.

Doing this manually is like trying to find a specific needle in a haystack by looking at every single piece of hay one by one. It takes forever, and you might miss the good stuff.

This paper describes a smart, automated robot built by Amazon Prime Video that does this job for us. It's like a super-powered movie editor that watches the whole film, understands the story, and cuts out the jokes perfectly.

Here is how this "Robot Editor" works, broken down into three simple steps:

1. The "Shot Detective" (Finding the Pieces)

First, the robot has to figure out where one scene ends and another begins. Movies are made of thousands of tiny clips called "shots" (like individual photos in a flipbook).

The Problem: Sometimes the camera doesn't cut; it just zooms or pans. A human might miss the boundary, but the robot needs to know exactly where the story shifts.
The Solution: The robot uses a "detective" tool (called TransNetV2) that looks at the visual changes. But to get really good at it, the robot plays a game of "Find the Similar Twins."
- Analogy: Imagine you have a bag of mixed-up puzzle pieces. The robot learns to group pieces that belong to the same picture (the same scene) and separate them from pieces that belong to different pictures. It does this by comparing "triplets" of clips: "These two look alike (same scene), but this third one looks totally different (different scene)." By practicing this game millions of times, it becomes an expert at spotting scene boundaries.

2. The "Context Reader" (Understanding the Story)

Once the robot has the scenes, it needs to know what's happening in them.

The Problem: Just looking at the pictures isn't enough. A joke often depends on what someone says or the tone of their voice. Also, movies are long, so the robot needs to remember what happened 30 seconds ago to understand a joke happening now.
The Solution: The robot is multimodal, meaning it uses multiple senses:
- Eyes: It analyzes the video frames.
- Ears: It listens for laughter or specific tones of voice.
- Brain (Reading): It reads the subtitles (captions) generated by the video.
- Analogy: Think of the robot as a bilingual translator who can also read body language. If a character says, "I love this soup," but their face looks disgusted, a simple robot might think it's funny. But our robot reads the text, sees the face, and hears the tone to realize, "Ah, this is sarcasm. That's a joke!"

3. The "Safety & Scoring Judge" (Filtering and Ranking)

Just because something makes people laugh doesn't mean it's a good joke for a public ad.

The Problem: Sometimes people laugh at bullying, mean-spirited mockery, or scary situations. You don't want to show those to a family. Also, you don't want to show every funny moment; you want the best ones.
The Solution:
- The Safety Guard: The robot has a "bouncer" at the door. It listens for sounds like crying, screaming, or distress. If it hears those, it blocks the clip, even if people are laughing. It ensures the humor is "safe" and not mean.
- The Scorer: The robot gives every funny scene a score based on a recipe:
  - How much laughter was there?
  - How long did the laughter last?
  - How clever was the dialogue?
  - How short and punchy is the clip?
- Analogy: It's like a talent show judge. They don't just clap; they have a scorecard. They give points for a big laugh, points for a witty punchline, and deduct points if the clip is too long or boring.

The Results: Did it Work?

The team tested this robot on five different movies and eleven trailers.

Accuracy: It found the right scenes 98% of the time.
Quality: When human experts reviewed the clips the robot picked, 87% of them were actually intended to be funny by the movie makers.
Improvement: It was significantly better than previous methods at finding where scenes start and stop (an 18.3% improvement!).

Why Does This Matter?

In the past, a human editor had to watch hours of a movie to find a 10-second funny clip. This system does it automatically, instantly, and for thousands of movies at once.

The Big Picture:
This technology is like giving every streaming platform a magic wand. Instead of waiting for a human to find the funny moments, the wand instantly creates a "snackable" preview that makes you smile and want to watch the show. It makes finding entertainment faster, easier, and more fun for everyone.

1. Problem Statement

The paper addresses the challenge of automatically extracting engaging, high-quality humorous scenes from long-form cinematic content (movies, web series) for streaming platforms (e.g., Amazon Prime Video).

Context: Manual extraction is time-consuming and costly. Automated generation of "snackable" content (short clips) is crucial for user engagement, autoplay previews, and marketing.
Key Challenges:
1. Scene Localization: Long-form videos have complex temporal structures. Unlike short clips, scene detection requires reasoning over both short- and long-range dependencies, and large-scale annotated datasets are scarce.
2. Humor Identification: Humor is multi-modal (relying on text, audio, gestures, and context) and nuanced. It often involves setup-punchline structures that span extended durations, making single-modal approaches insufficient.
3. Safety & Ranking: Systems must filter out "improper" humor (bullying, mockery) and rank genuine funny segments effectively to prioritize user experience.

2. Methodology

The authors propose an end-to-end automated pipeline consisting of three primary stages: Shot Detection, Multimodal Scene Segmentation, and Multimodal Humor Tagging/Ranking.

A. Shot Detection

Tool: Utilizes the state-of-the-art pretrained TransNetV2 network to detect shot boundaries (sequences of frames captured by a single camera).
Performance: Outperforms previous methods (like AUTOSHOT) by 1.2% in F1 score on benchmark datasets.

B. Multimodal Scene Segmentation

This stage groups semantically cohesive shots into scenes.

Visual Encoder:
- Uses X-CLIP (cross-frame transformer) to capture long-range temporal dependencies.
- Refines features using a DINO projection head (3-layer MLP with GELU, L2-normalization) to produce 4096-dimensional visual embeddings.
- Contrastive Pre-training: Employs Triplet Loss to learn shot representations. Unlike prior work using artificial augmentations, this paper uses guided triplet mining based on ground-truth boundaries from the MovieNet-SSeg dataset.
- Triplet Strategy: Anchors ( $s_a$ ) and positives ( $s_p$ ) are shots within the same scene; negatives ( $s_n$ ) are shots from different scenes. Hard negatives are sampled from neighboring scenes ( $\pm3$ scene window).
Text Modality:
- Generates shot-level captions using VideoLLaVA.
- Encodes text using BLIP-2 to produce 768-dimensional features.
- Fusion: Visual (4096) and Text (768) features are concatenated to form a 4864-dimensional embedding.
Supervised Fine-tuning:
- A sliding window approach aggregates features from neighboring shots ( $N$ preceding, $N$ succeeding, and the center shot).
- A 4-layer MLP (Scene Boundary Detection head) classifies whether the center shot is a scene boundary.

C. Downstream Humor Tagging & Scoring

Once scenes are localized, they are analyzed for humor.

Audio Analysis:
- Laughter Detection: Uses a ResNet18 model trained on the Switchboard dataset to detect individual and audience laughter.
- Safety Guardrails: A pretrained audio-tagging CNN (trained on AudioSet) filters out improper content (crying, screaming, distress signals) to prevent bullying or mockery from being flagged as funny.
Text/Verbal Analysis:
- Uses a modified ColBERT architecture adapted for long texts (30s–2min).
- Training Strategy: Trained on the UR-FUNNY dataset. Uses a semi-deterministic sampling approach: preserving the first 2 sentences (setup) and last 2 sentences (punchline), while randomly sampling 6 middle sentences.
- Inference: Long texts are segmented into 10-sentence chunks, processed individually, and scores are averaged.
Humor Scoring (Ranking):
- A heuristic scoring function combines four normalized features:
  1. Average laughter score ( $f_1$ ).
  2. Duration of laughter above threshold ( $f_2$ ).
  3. ColBERT softmax score ( $f_3$ ).
  4. Scene length ( $f_4$ ) – shorter scenes are preferred.
- Weights are optimized via grid search using logistic regression against curator annotations.

3. Key Contributions

End-to-End Pipeline: A complete system for extracting funny scenes from long-form video, integrating shot detection, scene localization, and humor ranking.
Guided Triplet Mining: A novel scene segmentation method that leverages ground-truth boundaries (MovieNet-SSeg) to generate hard triplets, significantly improving shot representation learning compared to unsupervised or heuristic methods.
Novel Shot Encoder: A hybrid architecture combining X-CLIP and DINO, optimized with minimal training (80K triplets over 25 epochs).
Multimodal Humor Detection: A robust audio-text model achieving high accuracy in long-text humor detection, coupled with a safety filter for improper content.
Heuristic Ranking: A scoring mechanism validated by professional curators to prioritize the most engaging clips.

4. Experimental Results

The system was evaluated on the OVSD dataset, MovieNet-SSeg, and real-world data from five full-length films and 11 trailers.

Scene Detection:
- Achieved an 18.3% improvement in Average Precision (AP) over state-of-the-art methods on the OVSD dataset.
- On MovieNet-SSeg, adding text modality improved AP by 9.1% and F1 by 11% over visual-only models.
- Generalized well to OVSD without fine-tuning when trained on MovieNet-SSeg.
Humor Detection:
- The custom ColBERT model achieved an F1 score of 0.834 and accuracy of 0.728 on the MHD dataset, outperforming fine-tuned transformers and multimodal fusion baselines (like FunnyNet).
End-to-End Evaluation (Curator Study):
- Scene Localization: 98% accuracy for main content (films); slightly lower for trailers due to rapid cuts.
- Humor Accuracy: 87% of extracted clips were intended to be funny (films) and 100% for trailers.
- Subjective Appeal: Curators found 74% of film clips and 88% of trailer clips genuinely funny.
- Safety: The guardrail model successfully filtered improper humor (e.g., bullying) with 100% recall in tested scenarios.

5. Significance and Future Work

Impact: The pipeline significantly reduces the cost and time of creating promotional content for streaming platforms, directly enhancing user engagement through personalized, humorous previews (e.g., hover-to-play features).
Generalization: The system demonstrates strong generalization across genres (comedy, drama, action) and formats (movies, trailers).
Future Directions:
- Adaptive Windowing: Improving scene detection for fast-paced trailers using dynamic window sizes.
- Progressive Hardness: Implementing online hard triplet mining with progressive scheduling to prevent model collapse.
- Multilingual Support: Expanding humor detection beyond English to support global catalogs.
- Spoiler Avoidance: Refining the heuristic to better skip the final 20% of titles to prevent spoilers.

In conclusion, this work presents a robust, multi-modal solution that bridges the gap between raw cinematic video and engaging, short-form content, setting a new standard for automated humor extraction in long-form media.