Automatic Funny Scene Extraction from Long-form Cinematic Videos

This paper presents an end-to-end system that automatically extracts and ranks high-quality humorous scenes from long-form cinematic videos by integrating advanced shot detection, multimodal scene localization, and humor tagging, achieving significant performance improvements over state-of-the-art methods and demonstrating strong potential for enhancing user engagement through automated snackable content generation.

Sibendu Paul, Haotian Jiang, Caren Chen

Published 2026-02-18
📖 5 min read🧠 Deep dive

Imagine you have a massive library of movies and TV shows, each one a long, winding story that takes hours to watch. Now, imagine you want to create a "highlight reel" of just the funniest moments to show a friend, or to play automatically when someone hovers their mouse over a title on a streaming app.

Doing this manually is like trying to find a specific needle in a haystack by looking at every single piece of hay one by one. It takes forever, and you might miss the good stuff.

This paper describes a smart, automated robot built by Amazon Prime Video that does this job for us. It's like a super-powered movie editor that watches the whole film, understands the story, and cuts out the jokes perfectly.

Here is how this "Robot Editor" works, broken down into three simple steps:

1. The "Shot Detective" (Finding the Pieces)

First, the robot has to figure out where one scene ends and another begins. Movies are made of thousands of tiny clips called "shots" (like individual photos in a flipbook).

  • The Problem: Sometimes the camera doesn't cut; it just zooms or pans. A human might miss the boundary, but the robot needs to know exactly where the story shifts.
  • The Solution: The robot uses a "detective" tool (called TransNetV2) that looks at the visual changes. But to get really good at it, the robot plays a game of "Find the Similar Twins."
    • Analogy: Imagine you have a bag of mixed-up puzzle pieces. The robot learns to group pieces that belong to the same picture (the same scene) and separate them from pieces that belong to different pictures. It does this by comparing "triplets" of clips: "These two look alike (same scene), but this third one looks totally different (different scene)." By practicing this game millions of times, it becomes an expert at spotting scene boundaries.

2. The "Context Reader" (Understanding the Story)

Once the robot has the scenes, it needs to know what's happening in them.

  • The Problem: Just looking at the pictures isn't enough. A joke often depends on what someone says or the tone of their voice. Also, movies are long, so the robot needs to remember what happened 30 seconds ago to understand a joke happening now.
  • The Solution: The robot is multimodal, meaning it uses multiple senses:
    • Eyes: It analyzes the video frames.
    • Ears: It listens for laughter or specific tones of voice.
    • Brain (Reading): It reads the subtitles (captions) generated by the video.
    • Analogy: Think of the robot as a bilingual translator who can also read body language. If a character says, "I love this soup," but their face looks disgusted, a simple robot might think it's funny. But our robot reads the text, sees the face, and hears the tone to realize, "Ah, this is sarcasm. That's a joke!"

3. The "Safety & Scoring Judge" (Filtering and Ranking)

Just because something makes people laugh doesn't mean it's a good joke for a public ad.

  • The Problem: Sometimes people laugh at bullying, mean-spirited mockery, or scary situations. You don't want to show those to a family. Also, you don't want to show every funny moment; you want the best ones.
  • The Solution:
    • The Safety Guard: The robot has a "bouncer" at the door. It listens for sounds like crying, screaming, or distress. If it hears those, it blocks the clip, even if people are laughing. It ensures the humor is "safe" and not mean.
    • The Scorer: The robot gives every funny scene a score based on a recipe:
      • How much laughter was there?
      • How long did the laughter last?
      • How clever was the dialogue?
      • How short and punchy is the clip?
    • Analogy: It's like a talent show judge. They don't just clap; they have a scorecard. They give points for a big laugh, points for a witty punchline, and deduct points if the clip is too long or boring.

The Results: Did it Work?

The team tested this robot on five different movies and eleven trailers.

  • Accuracy: It found the right scenes 98% of the time.
  • Quality: When human experts reviewed the clips the robot picked, 87% of them were actually intended to be funny by the movie makers.
  • Improvement: It was significantly better than previous methods at finding where scenes start and stop (an 18.3% improvement!).

Why Does This Matter?

In the past, a human editor had to watch hours of a movie to find a 10-second funny clip. This system does it automatically, instantly, and for thousands of movies at once.

The Big Picture:
This technology is like giving every streaming platform a magic wand. Instead of waiting for a human to find the funny moments, the wand instantly creates a "snackable" preview that makes you smile and want to watch the show. It makes finding entertainment faster, easier, and more fun for everyone.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →