Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Imagine you are a movie director tasked with creating a 2-minute trailer for a 2-hour blockbuster. Your job is to watch the whole movie, pick out the most exciting scenes, and arrange them in a perfect order that tells a story, builds suspense, and makes people want to buy a ticket.

Doing this manually is hard. Doing it automatically with a computer is even harder. This paper introduces a new AI method called SSMP (Self-paced and Self-corrective Masked Prediction) that solves this problem by teaching the computer to think more like a human editor.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "One-Way Street" Mistake

Most previous AI methods tried to make trailers in two separate steps:

Pick the shots: "Okay, I'll grab the 20 best scenes."
Order them: "Now, I'll arrange those 20 scenes."

The Analogy: Imagine you are building a puzzle, but you are only allowed to look at one piece at a time. You pick a piece, glue it down, and then move to the next. If you pick the wrong piece for the first spot, you are stuck with it. You can't go back and fix it later. This leads to a messy puzzle (a bad trailer) because the AI can't see the "big picture" while making early decisions. This is called error propagation.

2. The Solution: The "Fill-in-the-Blanks" Game

The authors propose a new way called SSMP. Instead of picking and ordering one by one, the AI plays a game of "Fill-in-the-Blanks."

The Analogy: Imagine you have a blank movie trailer script with 20 empty slots. The AI sees the whole movie (the source material) and tries to guess what goes in all 20 slots at once.

It doesn't just guess one; it guesses all of them simultaneously.
Then, it looks at its own guesses. "Hmm, I'm 90% sure about slot #5, but I'm only 40% sure about slot #12."
The Magic Step: It keeps the confident guesses (slot #5) but erases the unsure ones (slot #12) and tries to guess them again, this time using the information from the confident ones it just kept.

It repeats this process, slowly filling in the blanks, getting more confident with every round, until the whole trailer is complete. This is the Self-Corrective part. It mimics how a human editor works: "I think this scene goes here... wait, no, that doesn't fit with the next scene. Let me swap it."

3. The Training: The "Video Game Level" System

To teach the AI how to do this, the researchers used a clever training method called Self-Paced Learning.

The Analogy: Imagine teaching a child to ride a bike.

Old Method: You put them on a bike on a steep hill immediately. They crash, get scared, and quit.
SSMP Method: You start them on a flat, grassy field (easy level). Once they get good, you move them to a slight slope. As they master that, you move them to a bigger hill.

The AI starts by trying to fill in the trailer with only a few "blanks" (easy task). As it gets better at the job, the system automatically adds more blanks (harder task). This ensures the AI learns steadily without getting overwhelmed or bored.

4. The Result: A Better Trailer

Because this method allows the AI to look at the whole trailer at once and fix its mistakes along the way, the results are much better.

Better Story: The scenes flow logically because the AI can see how Scene A affects Scene Z.
Better Timing: The scenes are arranged in a rhythm that feels natural, not random.
Human-Like: It doesn't just follow a rigid rule; it iterates and refines, just like a real human editor.

Summary

In short, previous AIs were like students taking a test who had to write answers from left to right without erasing. If they made a mistake on question 1, they failed the whole test.

SSMP is like a student who can write all the answers, check their work, erase the ones they aren't sure about, and rewrite them with better context. It learns at its own speed, starting easy and getting harder, resulting in a movie trailer that actually feels like it was made by a human.

1. Problem Statement

Movie trailer generation is a complex video editing task requiring the selection and reorganization of movie shots to create a coherent, engaging narrative. Existing automatic methods generally fall into two paradigms, both of which suffer from inevitable error propagation:

Selection-then-Ranking: These methods first select key shots and then rank them. This decouples the two steps, preventing the model from jointly reasoning about semantic relevance and temporal continuity.
Auto-Regressive (AR) Generation: These methods predict shots sequentially (one by one). While they model context, they lack a self-correction mechanism. Once an early shot is predicted incorrectly, the error propagates to subsequent predictions, unlike human editors who iteratively refine and replace shots.

The core challenge is to develop a generation framework that allows for bi-directional contextual modeling and progressive self-correction to mimic the iterative refinement process of human editors.

2. Methodology: SSMP

The authors propose SSMP (Self-paced and Self-corrective Masked Prediction), a novel framework that formulates trailer generation as a masked prediction problem using a Transformer encoder.

A. Core Architecture

Input: The model takes the full movie shot sequence ( $M$ ) as a prompt and the trailer shot sequence ( $V$ ) as the target.
Masked Prediction: Instead of generating tokens sequentially, the model reconstructs masked trailer shots based on the unmasked movie shots and the partially masked trailer context.
Bi-directional Modeling: Unlike AR models, SSMP attends to all available context simultaneously, allowing it to understand global dependencies between shots.

B. Training: Self-Paced Mask Ratio Scheduler

To optimize training efficiency and stability, the authors introduce a self-paced learning strategy for the mask ratio ( $t$ ):

Dynamic Difficulty: The mask ratio is not fixed. It starts low (easy tasks) and increases as the model improves.
Momentum-Based Scheduler: The scheduler adjusts $t$ $t$ based on the current training accuracy ( $a_n$ $a_{n}$ ) and historical momentum ( $b_n$ $b_{n}$ ).
- If accuracy is high, the mask ratio increases (task becomes harder).
- If accuracy is low, the ratio is maintained.
- Monotonicity Constraint: The mask ratio is constrained to be non-decreasing, ensuring the model does not retreat to easier tasks once it has mastered a certain difficulty level.
Loss Function: The model is trained using Cross-Entropy (CE) loss to maximize the conditional likelihood of predicting the correct ground-truth movie shot for each masked trailer position.

C. Generation: Progressive Self-Correction

During inference, the model generates the trailer through an iterative process:

Initialization: Start with a fully masked trailer sequence.
Prediction: The model predicts features for all masked positions simultaneously.
Confidence Scoring: A confidence vector ( $q$ ) is updated for each position based on the predicted probability.
Re-masking (Self-Correction):
- High-confidence shots are "filled" (locked in).
- Low-confidence shots are re-masked and re-predicted in the next iteration.
Convergence: The process repeats until all positions are filled. This allows the model to correct early mistakes by reconsidering uncertain positions in the context of newly confirmed shots.

3. Key Contributions

Novel Paradigm: First attempt to formulate movie trailer generation as a masked prediction problem with bi-directional context, moving away from selection-rank or auto-regressive paradigms.
Self-Correction Mechanism: Introduced a progressive re-masking strategy that mimics human editors' iterative refinement, effectively mitigating error propagation.
Self-Paced Learning: Developed a dynamic mask ratio scheduler that adapts task difficulty to the model's learning curve, improving convergence speed and final performance.
State-of-the-Art Performance: Demonstrated superior results across multiple datasets and evaluation metrics compared to existing methods.

4. Experimental Results

The method was evaluated on the CMTD dataset (Test-8, Test-74) and a new Test-2024 set of newly released movies.

Quantitative Metrics:
- Shot Selection: SSMP achieved the highest F1-scores (0.1618 on Test-8, 0.2373 on Test-74), outperforming the best selection-then-ranking method (MMSC) by ~2-4%.
- Shot Ranking: SSMP significantly improved Pairwise Agreement Accuracy (AA) by 10-17% over baselines, indicating superior temporal coherence.
- Levenshtein Distance (LD): Achieved the lowest LD, meaning the generated shot sequences are closest to the official trailers.
Qualitative & User Study:
- A user study with 25 participants rated SSMP higher than baselines in Theme, Rhythm, Attractiveness, and Appropriateness.
- Visualizations showed that the iterative generation process successfully corrected initial prediction errors, leading to more accurate shot ordering.
Ablation Studies:
- Mask Ratio: The self-paced scheduler outperformed random, linearly increasing, and linearly decreasing strategies.
- Self-Correction: Removing the self-correction mechanism (using a greedy strategy) resulted in lower F1 and AA scores.
- Loss Function: Cross-Entropy loss proved superior to Mean Squared Error (MSE) for this task.

5. Significance

This paper represents a significant shift in video summarization and editing tasks. By treating trailer generation as a masked prediction problem rather than a sequential generation task, SSMP addresses the fundamental limitation of error propagation inherent in auto-regressive models. The integration of self-paced learning and iterative self-correction provides a robust framework that not only improves quantitative metrics but also aligns more closely with the cognitive process of human editors. The method sets a new state-of-the-art benchmark for automatic movie trailer generation and offers a potential blueprint for other video editing and summarization tasks requiring global context and iterative refinement.