StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

This paper introduces StoryMovie, a dataset of 1,757 visual stories aligned with movie scripts and subtitles via LCS matching, which is used to fine-tune the Qwen Storyteller3 model to significantly improve semantic alignment and dialogue attribution accuracy compared to models relying solely on visual grounding.

Daniel Oliveira, David Martins de Matos

Published 2026-02-26
📖 4 min read☕ Coffee break read

Imagine you are trying to tell a story about a series of photos you took at a family reunion. You have the pictures, but you don't remember who everyone is, what they were saying, or why they were laughing.

If you ask a standard AI to write a story based only on those photos, it might get the visual details right ("a man in a blue shirt is holding a cake"), but it will likely hallucinate the rest. It might guess that the man is a stranger named "Bob" when he's actually your Uncle Joe. It might invent a romantic scene between two cousins who are just arguing about the weather. It sees the faces, but it doesn't know the script.

This paper introduces a solution called StoryMovie, which acts like a "truth-telling assistant" for AI storytellers. Here is how it works, broken down into simple concepts:

1. The Problem: The AI is a "Visual Guessing Game" Player

Previous AI models were great at describing what they saw (visual grounding). They could point to a dog and say, "That's a dog." But when it came to the story, they were like a person watching a movie with the sound off and the subtitles covered up. They had to guess the plot.

  • The Mistake: They might say, "The hero and the villain are hugging," when the script actually says they are fighting.
  • The Result: The story looks real but feels fake because the relationships and dialogue are made up.

2. The Solution: The "Script + Subtitle" Sandwich

The researchers realized that to tell the real story, the AI needs to read the movie script (the blueprint) and listen to the subtitles (the timing).

  • The Script: Tells you who is speaking, what they are saying, and how they feel (e.g., "John says this angrily"). But it doesn't tell you exactly when it happens in the video.
  • The Subtitles: Tell you exactly when a line is spoken, but they often don't say who is speaking.

The Magic Trick (LCS Matching):
The team built a clever pipeline that acts like a puzzle solver. They used a mathematical method called "Longest Common Subsequence" (think of it as finding the longest matching sentence between two lists) to lock the script lines up with the exact moment they appear in the subtitles.

  • Analogy: Imagine you have a recipe (the script) and a timer (the subtitles). The AI matches the recipe step "Add salt" to the exact second the timer beeps, so it knows exactly when the salt was added.

3. The New Dataset: "StoryMovie"

Using this puzzle-solving method, they created a new dataset called StoryMovie. It contains 1,757 stories where the AI didn't just guess the plot; it was fed the actual movie script and subtitles.

  • Now, instead of guessing the man in the blue shirt is "Bob," the AI knows he is "Uncle Joe."
  • Instead of inventing a fake argument, the AI knows exactly what Uncle Joe said and that he was saying it angrily.

4. The New Model: "Qwen Storyteller3"

They took a smart AI model (Qwen) and gave it a special training course using this new dataset.

  • Stage 1: It learned to point at objects (Visual Grounding).
  • Stage 2: It learned to recognize the same person across different photos (Entity Re-identification).
  • Stage 3 (The New Stuff): It learned to read the script. It learned that "John" is the guy in the red hat and that he is supposed to be sad, not happy.

5. The Results: From "Guessing" to "Knowing"

When they tested the new model against the old ones, the difference was like night and day:

  • Dialogue: The old model guessed who was speaking 90% of the time wrong. The new model got it right nearly 90% of the time.
  • Relationships: The old model often thought strangers were lovers. The new model knew they were just neighbors.
  • Facts: In a quiz about the story, the new model got 94% of the answers right, while the old model only got 63%.

The Big Takeaway

Think of visual storytelling like a blindfolded detective.

  • Old AI: The detective looks at a crime scene photo and guesses, "The butler did it because he looks suspicious." (It's a guess based on looks).
  • New AI (StoryMovie): The detective puts on the blindfold, but someone hands them the police transcript and the witness timeline. Now, the detective knows, "The butler didn't do it; the gardener confessed at 3:00 PM."

By combining what the AI sees with what the movie actually says, the new system stops making up fake stories and starts telling the true story, complete with the right names, the right words, and the right emotions.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →