SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

Imagine you are watching a long, unedited home video of a birthday party. You want to create a highlight reel with short clips and captions like "The boy blows out the candles" or "The cake falls on the floor."

Doing this manually is a nightmare. You have to scrub through the video, find the exact second the candle is lit, the exact second it's blown out, and write a sentence for each. This is what computers call Dense Video Captioning.

For a long time, computers could only do this if humans gave them a massive, expensive manual with exact timestamps for every single event. But that's too much work. So, researchers tried Weakly-Supervised learning: teaching the computer using only the sentences (the captions) without the timestamps, hoping the computer could figure out when things happened just by reading the text.

The problem? The computers were bad at it. They were like a clumsy editor who just sliced the video into equal-sized chunks (e.g., "First 10 seconds," "Next 10 seconds") and guessed what happened. They didn't understand that "blowing out candles" is a 2-second event, while "eating cake" might be 30 seconds. They just sliced blindly.

Enter SAIL, a new method from researchers at Hanyang University. Think of SAIL as a smart, intuitive film editor that uses two superpowers to fix this mess.

1. The "Magnet" Strategy (Similarity-Aware Guidance)

The Old Way: Imagine trying to find a specific book in a library by just picking up books at random intervals. You might grab a book about cooking when you're looking for a book about cars. The old computer methods did this: they grabbed video segments randomly and hoped the caption matched.

The SAIL Way: SAIL uses a "magnet." It knows that the sentence "The boy blows out the candles" is magnetically attracted to the visual pixels of fire and a boy's face.

Instead of guessing, SAIL looks at the video and the caption simultaneously.
It asks: "Which part of this video feels most like this sentence?"
It then creates a "mask" (a spotlight) that shines brightly on the candle-blowing moment and dims everything else.
The Result: The computer learns to highlight the right moments because it's constantly checking, "Does this video clip match the meaning of this sentence?"

2. The "Imaginative Assistant" (LLM-Based Augmentation)

The Problem: Even with the magnet, the computer sometimes gets stuck. Why? Because the training data is "sparse."
Imagine a 10-minute video of a cooking show, but the human only wrote down two sentences: "He chops the onions" and "He stirs the pot."
The computer sees a huge gap between those two sentences. It doesn't know what happened in between. Did he taste the soup? Did he cry from the onions? Did he drop the knife? Without knowing, it guesses poorly.

The SAIL Solution: SAIL brings in a creative AI assistant (a Large Language Model, or LLM) to fill in the blanks.

SAIL shows the AI assistant the two existing sentences: "He chops the onions" and "He stirs the pot."
It asks the assistant: "What is a logical, plausible thing that happens between these two?"
The AI assistant invents a new sentence: "He wipes the tears from his eyes."
SAIL then treats this invented sentence as a "ghost clue." It tells the computer: "Hey, there's probably a 'wiping tears' moment in the video between the chopping and the stirring. Go find it!"

This turns a sparse, 2-sentence manual into a dense, detailed guide, helping the computer learn to spot tiny, specific events it would have otherwise missed.

The Grand Finale

By combining the Magnet (making sure the video parts match the meaning of the words) and the Imaginative Assistant (filling in the missing gaps with smart guesses), SAIL becomes a master editor.

It doesn't just slice the video evenly. It slices it exactly where the action happens.
It doesn't just guess. It uses the "magnet" to pull the right visual features to the right words.
It doesn't get lost in the gaps. It uses the AI assistant to imagine what happened in the silence.

In tests, this approach beat all previous methods, creating video summaries that are not only more accurate in when events happen but also much better at describing what is happening. It's like upgrading from a robot that blindly cuts film to a human director who actually understands the story.

1. Problem Statement

Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize distinct events within untrimmed videos and generate descriptive captions for each, using only video-caption pairs as training data (without precise temporal boundaries).

The paper identifies two critical limitations in existing state-of-the-art methods (specifically ILCACM):

Lack of Semantic Alignment: Current approaches use Gaussian masking strategies that ensure masks cover different temporal regions but ignore the semantic relationship between the masked video region and its corresponding caption. This results in simplistic, uniformly distributed masks that fail to capture semantically meaningful event regions, leading to poor localization and low-quality captions.
Annotation Sparsity: Real-world datasets (e.g., ActivityNet, YouCook2) often have sparse annotations where a long video contains only a few event descriptions. This sparsity provides insufficient signal for the model to learn fine-grained event boundaries, causing it to miss potential transitional events.

2. Methodology: SAIL

The authors propose SAIL (Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning), which addresses the above issues through two core mechanisms:

A. Similarity-Aware Mask Guide

Instead of relying solely on non-overlap constraints, SAIL leverages cross-modal alignment to guide mask generation.

Mechanism: The model utilizes a pre-trained Vision-Language model (CLIP) to compute the similarity between masked video features and event captions.
Objective: A similarity-aware loss ( $\mathcal{L}_{sim}$ ) is introduced. It maximizes the cosine similarity between the average-pooled masked video features and their corresponding ground-truth caption embeddings while minimizing similarity to other captions in the same video (margin ranking loss).
Result: This forces the mask generation module to focus on video regions that are semantically relevant to the specific event caption, rather than just partitioning time uniformly.

B. LLM-Based Inter-Caption Augmentation

To overcome data sparsity, SAIL introduces a strategy to generate synthetic supervision signals using a Large Language Model (LLM).

Synthetic Caption Generation: The LLM (Qwen3-8B) is prompted to act as a "Video Context Inference Expert." Given two consecutive ground-truth captions, the LLM infers and generates a plausible synthetic caption describing the transitional event occurring between them.
Inter-Mask Mechanism: These synthetic captions are not used as hard constraints (which could introduce noise). Instead, they serve as auxiliary guidance:
1. New "inter-masks" are constructed to cover the temporal gaps between predicted ground-truth events.
2. An auxiliary loss ( $\mathcal{L}_{aug}$ ) aligns the features of these inter-masks with the synthetic caption embeddings.
Effect: This creates a denser supervision signal, encouraging the model to learn fine-grained boundaries and transitional events that are missing in the original annotations.

Final Objective

The total training loss combines:

Positive and Negative captioning losses (from the baseline complementary masking).
Similarity-aware mask loss ( $\mathcal{L}_{sim}$ ).
Auxiliary augmentation loss ( $\mathcal{L}_{aug}$ ).

3. Key Contributions

Similarity-Aware Masking: A novel training objective that aligns mask generation with cross-modal semantic similarity, ensuring masks highlight event-relevant visual regions rather than just distinct time segments.
LLM-Driven Data Augmentation: A method to generate synthetic captions for intermediate events using LLMs, effectively densifying sparse annotations and improving the model's ability to localize fine-grained events.
State-of-the-Art Performance: The proposed method achieves superior results on standard benchmarks, outperforming previous weakly-supervised methods and even surpassing several fully-supervised approaches.

4. Experimental Results

The method was evaluated on ActivityNet Captions and YouCook2 datasets.

Captioning Performance:
- On ActivityNet, SAIL achieved a CIDEr score of 35.38, surpassing the previous best weakly-supervised method (ILCACM: 33.42) and outperforming several fully-supervised models (e.g., E2DVC: 33.63).
- On YouCook2, SAIL also achieved the highest scores across all metrics (SODA_c, METEOR, CIDEr) among weakly-supervised methods.
Localization Performance:
- SAIL achieved an F1 score of 57.00 on ActivityNet, significantly higher than ILCACM (56.20) and competitive with fully-supervised methods.
- The method demonstrated improved Recall and Precision, indicating better event boundary detection.
Ablation Studies:
- Removing the similarity guide or the synthetic captions resulted in performance drops, confirming the necessity of both components.
- Using synthetic captions as auxiliary signals (Inter-Mask) proved more effective than using them as hard negatives.
- The method showed robust performance even with reduced annotation density (25% of ground truth), whereas baselines degraded significantly.

5. Significance

Bridging the Supervision Gap: SAIL demonstrates that weakly-supervised methods can rival or exceed fully-supervised performance by leveraging semantic alignment and synthetic data, reducing the reliance on expensive, labor-intensive temporal annotations.
Semantic Understanding over Heuristics: By shifting from uniform temporal partitioning to semantic similarity-driven masking, the paper establishes a new paradigm for learning discriminative event representations in video understanding.
Efficiency: The LLM augmentation is a one-time preprocessing step with negligible impact on training time or GPU memory, making the approach highly scalable and practical for real-world applications.

In conclusion, SAIL effectively solves the dual challenges of semantic misalignment and annotation sparsity in WSDVC, setting a new benchmark for dense video captioning tasks.

SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning

1. The "Magnet" Strategy (Similarity-Aware Guidance)

2. The "Imaginative Assistant" (LLM-Based Augmentation)

The Grand Finale

1. Problem Statement

2. Methodology: SAIL

A. Similarity-Aware Mask Guide

B. LLM-Based Inter-Caption Augmentation

Final Objective

3. Key Contributions

4. Experimental Results

5. Significance

More like this

AgenticGEO: A Self-Evolving Agentic System for Generative Engine Optimization

ProMAS: Proactive Error Forecasting for Multi-Agent Systems Using Markov Transition Dynamics

Domain-Specialized Tree of Thought through Plug-and-Play Predictors

FactorSmith: Agentic Simulation Generation via Markov Decision Process Decomposition with Planner-Designer-Critic Refinement

Me, Myself, and π\piπ : Evaluating and Explaining LLM Introspection

Me, Myself, and $\pi$ : Evaluating and Explaining LLM Introspection