StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles

Imagine you are trying to tell a story about a series of photos you took at a family reunion. You have the pictures, but you don't remember who everyone is, what they were saying, or why they were laughing.

If you ask a standard AI to write a story based only on those photos, it might get the visual details right ("a man in a blue shirt is holding a cake"), but it will likely hallucinate the rest. It might guess that the man is a stranger named "Bob" when he's actually your Uncle Joe. It might invent a romantic scene between two cousins who are just arguing about the weather. It sees the faces, but it doesn't know the script.

This paper introduces a solution called StoryMovie, which acts like a "truth-telling assistant" for AI storytellers. Here is how it works, broken down into simple concepts:

1. The Problem: The AI is a "Visual Guessing Game" Player

Previous AI models were great at describing what they saw (visual grounding). They could point to a dog and say, "That's a dog." But when it came to the story, they were like a person watching a movie with the sound off and the subtitles covered up. They had to guess the plot.

The Mistake: They might say, "The hero and the villain are hugging," when the script actually says they are fighting.
The Result: The story looks real but feels fake because the relationships and dialogue are made up.

2. The Solution: The "Script + Subtitle" Sandwich

The researchers realized that to tell the real story, the AI needs to read the movie script (the blueprint) and listen to the subtitles (the timing).

The Script: Tells you who is speaking, what they are saying, and how they feel (e.g., "John says this angrily"). But it doesn't tell you exactly when it happens in the video.
The Subtitles: Tell you exactly when a line is spoken, but they often don't say who is speaking.

The Magic Trick (LCS Matching):
The team built a clever pipeline that acts like a puzzle solver. They used a mathematical method called "Longest Common Subsequence" (think of it as finding the longest matching sentence between two lists) to lock the script lines up with the exact moment they appear in the subtitles.

Analogy: Imagine you have a recipe (the script) and a timer (the subtitles). The AI matches the recipe step "Add salt" to the exact second the timer beeps, so it knows exactly when the salt was added.

3. The New Dataset: "StoryMovie"

Using this puzzle-solving method, they created a new dataset called StoryMovie. It contains 1,757 stories where the AI didn't just guess the plot; it was fed the actual movie script and subtitles.

Now, instead of guessing the man in the blue shirt is "Bob," the AI knows he is "Uncle Joe."
Instead of inventing a fake argument, the AI knows exactly what Uncle Joe said and that he was saying it angrily.

4. The New Model: "Qwen Storyteller3"

They took a smart AI model (Qwen) and gave it a special training course using this new dataset.

Stage 1: It learned to point at objects (Visual Grounding).
Stage 2: It learned to recognize the same person across different photos (Entity Re-identification).
Stage 3 (The New Stuff): It learned to read the script. It learned that "John" is the guy in the red hat and that he is supposed to be sad, not happy.

5. The Results: From "Guessing" to "Knowing"

When they tested the new model against the old ones, the difference was like night and day:

Dialogue: The old model guessed who was speaking 90% of the time wrong. The new model got it right nearly 90% of the time.
Relationships: The old model often thought strangers were lovers. The new model knew they were just neighbors.
Facts: In a quiz about the story, the new model got 94% of the answers right, while the old model only got 63%.

The Big Takeaway

Think of visual storytelling like a blindfolded detective.

Old AI: The detective looks at a crime scene photo and guesses, "The butler did it because he looks suspicious." (It's a guess based on looks).
New AI (StoryMovie): The detective puts on the blindfold, but someone hands them the police transcript and the witness timeline. Now, the detective knows, "The butler didn't do it; the gardener confessed at 3:00 PM."

By combining what the AI sees with what the movie actually says, the new system stops making up fake stories and starts telling the true story, complete with the right names, the right words, and the right emotions.

1. Problem Statement

Visual storytelling models, even those capable of accurate visual grounding (identifying objects and entities in images), suffer from semantic hallucinations. While they can correctly identify who or what is in an image, they often fail to:

Correctly attribute dialogue to specific characters.
Accurately describe relationships between characters (e.g., mistaking family members for romantic partners).
Infer correct emotional states or narrative context based solely on visual cues.

Existing approaches rely on visual-only inference, which lacks the "ground truth" of the actual narrative. Without access to the original screenplay, models default to generic names (e.g., "John," "Sarah") and fabricate dialogue that contradicts the actual plot.

2. Methodology

The authors propose a three-stage progressive training framework culminating in the StoryMovie dataset and the Qwen Storyteller3 model.

A. The StoryMovie Dataset

Composition: 1,757 stories derived from movie sequences (subset of the StoryReasoning dataset).
Alignment Pipeline: The core innovation is synchronizing movie scripts (rich in character names, dialogue, and emotional cues but lacking timestamps) with subtitles (accurate timestamps but lacking character attribution).
- Algorithm: Uses Longest Common Subsequence (LCS) matching to align screenplay dialogue blocks with subtitle text.
- Process: Once dialogue blocks are matched, the algorithm extends bidirectionally to capture full sequences, assigning subtitle timestamps to script segments.
Data Structure: Each story retains visual grounding tags (XML tags like <gdo>, <gda>, <gdl>, <gdi> linking text to bounding boxes) from previous work, while adding script-grounded content (actual character names, dialogue, and action lines).
Split: 1,494 training samples / 263 test samples (85%/15%).

B. Model Architecture: Qwen Storyteller3

The model represents the third stage of a progressive training pipeline:

Stage 1 (Qwen Storyteller): Basic visual grounding and Chain-of-Thought (CoT) reasoning.
Stage 2 (Qwen Storyteller2): Improved entity re-identification via contrastive Reinforcement Learning (RL).
Stage 3 (Qwen Storyteller3 - Current Work): Fine-tuned on StoryMovie using Supervised Fine-Tuning (SFT).
- Input: Original images + Structured CoT + Aligned Script Segment + Subtitle Text.
- Mechanism: The model learns to map visual entities (e.g., "char1") to specific script characters (e.g., "Mary") based on dialogue cues and visual context. It also learns to interpret screenplay delivery cues (e.g., "(angrily)") to ground emotional states.
- Configuration: Fine-tuned on Qwen3-VL-235B-A22B using LoRA (Rank 32, Alpha 64), AdamW optimizer, and cosine annealing learning rate ( $2 \times 10^{-4}$ ) for 3 epochs.

3. Key Contributions

StoryMovie Dataset: The first dataset of 1,757 visual stories explicitly aligned with ground-truth movie scripts and subtitles, providing a benchmark for semantic alignment in visual storytelling.
Script-Subtitle Alignment Pipeline: A robust method using LCS token matching to synchronize screenplay content with temporal subtitle data, enabling precise dialogue attribution.
Qwen Storyteller3: A model demonstrating that semantic alignment with authentic narratives significantly reduces hallucinations regarding character relationships and dialogue, outperforming models trained solely on visual data.

4. Evaluation Results

The model was evaluated using DeepSeek V3 as an LLM judge for pairwise preference comparisons and a Question-Answering (QA) task for factual accuracy.

A. Pairwise Preference Evaluation (Win Rates)

vs. Base Model (Qwen2.5-VL 7B):
- Subtitle Alignment (Dialogue Attribution): 89.9% win rate (vs. 3.5% for base).
- Synopsis Alignment (Narrative Context): 87.6% win rate.
- Description Alignment: 63.4% win rate.
vs. Previous Baseline (Qwen Storyteller, trained without scripts):
- Subtitle Alignment: 48.5% win rate vs. 38.0% (10.5% improvement).
- Synopsis Alignment: 42.7% win rate vs. 28.5%.
- Note: The high tie rate in description alignment (49.2%) indicates both models share strong visual grounding foundations; the gains are specifically in semantic/narrative alignment.

B. Factual Accuracy (QA Evaluation)

Using GPT-5 to generate and answer multiple-choice questions based on the script:

Overall Accuracy: 93.9% (Ours) vs. 63.2% (Baseline).
Relationship Accuracy: 94.7% vs. 55.3% (Largest gap, confirming visual-only models fail at inferring non-visual relationships).
Action Accuracy: 97.4% vs. 68.4%.

5. Significance and Conclusion

Beyond Visual Grounding: The paper proves that visual grounding alone is insufficient for high-fidelity storytelling. Models must be grounded in authentic narrative sources (scripts) to resolve ambiguities in character identity, relationships, and dialogue.
Reducing Hallucinations: By anchoring generation to screenplays, the model eliminates "plausible but ungrounded" hallucinations, such as fabricating dialogue or misidentifying character dynamics.
Framework for Future Work: The progressive training approach (Visual Grounding $\to$ Entity Re-ID $\to$ Semantic Alignment) offers a scalable framework for reducing different types of hallucinations in Large Vision-Language Models (LVLMs).
Limitations: The current work is limited to English-language movie content and cinematic visual styles. Future work aims to expand to diverse visual sources (e.g., social media) and multilingual settings.

In summary, StoryMovie and Qwen Storyteller3 establish a new standard for visual storytelling by bridging the gap between visual perception and authentic narrative context, significantly improving the accuracy of dialogue attribution and character relationship modeling.