BLM-Guard: Explainable Multimodal Ad Moderation with Chain-of-Thought and Policy-Aligned Rewards

Imagine you are the head of a massive, 24/7 digital carnival. Thousands of vendors (advertisers) are setting up booths every second, trying to sell everything from magic beans to miracle cures. Some are honest, but many are trying to trick the crowd with loud music, flashy lights, and lies hidden in the fine print.

Your job is to stop the scams before they hurt anyone. But here's the problem: the scammers are getting smarter. They don't just shout "I'm a liar!" anymore. They might show a picture of a healthy person while the audio says, "Drink this tea and you'll live forever," or they might use a cute puppy to distract you from a shady deal.

BLM-Guard is the new, super-smart security team you hired to solve this. Here is how it works, broken down into simple parts:

1. The Problem: Why Old Guards Fail

Previous security guards were like bouncers with a simple checklist: "Is there violence? Is there nudity?" If the answer was no, they let the ad in.
But for ads, that's not enough. An ad can be perfectly safe-looking but still be a scam.

The Trap: A video shows a phone (visual) but the voiceover says, "It's free!" (audio). The guard sees a phone and thinks, "Safe!" but misses the lie in the voice.
The Issue: Old systems couldn't connect the dots between what you see and what you hear. They also couldn't understand the specific, complicated rules of the carnival (the platform's policies).

2. The Solution: BLM-Guard's "Three-Step Brain"

BLM-Guard isn't just a filter; it's a detective that thinks before it acts. It uses three main tricks:

Trick A: The "Detective's Notebook" (Chain-of-Thought)

Instead of just guessing "Safe" or "Unsafe," BLM-Guard is forced to write down its thoughts, like a detective solving a mystery.

Old Guard: "See phone. See text. Result: Safe."
BLM-Guard: "Wait, the video shows a luxury car, but the voice says 'free'. That doesn't match. Also, the text says 'guaranteed profit,' which is against the rules. Therefore, this is a scam."
This "Chain-of-Thought" (CoT) forces the AI to explain why it made a decision, making it much harder to trick.

Trick B: The "Rulebook Tutor" (Rule-Guided Training)

Before the AI starts working, we don't just throw it into the deep end. We give it a crash course using a "Rulebook Tutor."

We take thousands of ads and have the AI practice spotting specific violations (like "exaggerated income claims" or "feudal superstition").
We teach it to look for Key Frames (the most important 3 seconds of a video) and Key Regions (the specific part of the screen where the lie is happening), ignoring the boring parts.
This is like giving the security guard a map of exactly where the scammers usually hide their tricks.

Trick C: The "Strict Coach" (Reinforcement Learning)

Once the AI has learned the basics, it goes into a training camp with a "Strict Coach" (a reward system).

The Game: The AI tries to moderate an ad.
The Score: If it gets the verdict right and explains it clearly according to the rules, it gets a high score. If it guesses wrong or gives a vague excuse, it gets a low score.
The Twist: The coach is "Adaptive." If the carnival rules change (e.g., "No more claims about weight loss"), the coach instantly updates the AI's training to match the new rules. This is called Self-Consistency, meaning the AI learns to be consistent with the current rules, not just old ones.

3. The Result: A Smarter, Fairer Carnival

The paper tested BLM-Guard against other top AI models.

Accuracy: It caught way more scams than the others, especially the tricky ones where the video and audio didn't match.
Explainability: Because it writes down its reasoning, humans can look at its "Detective's Notebook" and say, "Ah, I see why you flagged that."
Generalization: It didn't just memorize the training ads; it learned the logic of scams, so it could spot new types of tricks it had never seen before.

The Big Picture Analogy

Think of previous ad moderators as metal detectors at an airport. They beep if they see metal (violence/nudity), but they can't tell if a metal spoon is being used to steal a wallet (a subtle scam).

BLM-Guard is like a highly trained security agent who:

Reads the manual (Policy Rules).
Watches the whole scene (Visuals + Audio + Text).
Talks through their logic ("The person is smiling, but the text says 'pay now'... that's suspicious").
Learns from every mistake (Reinforcement Learning).

By combining these skills, BLM-Guard ensures that the digital carnival stays fun and safe, catching the clever scammers that others miss.

1. Problem Definition

The rapid proliferation of short-video platforms (e.g., TikTok, Instagram Reels) has led to an explosion of multimodal advertising content. Unlike general community safety moderation (which focuses on coarse risks like violence or nudity), commercial ad moderation requires:

Fine-grained, policy-driven compliance: Detecting subtle violations such as exaggerated claims, misleading cues, or rule evasion.
Cross-modal consistency: Identifying mismatches between modalities (e.g., benign text with provocative visuals, or deceptive audio contradicting visual truth).
Explainability: Providing structured reasoning traces to justify decisions, which is crucial for regulatory compliance and user trust.

Existing approaches, such as static rule-based filters or general-purpose Vision-Language Models (VLMs), struggle due to limited cross-modal causal reasoning, poor adaptability to shifting policies, and a lack of task-specific reasoning for nuanced commercial risks.

2. Methodology: BLM-Guard Framework

BLM-Guard is a two-stage training pipeline designed to fuse rule-based principles with deep learning capabilities. It consists of a Rule-Guided Cold Start (SFT) and Self-Adaptive Reinforcement Learning (RL).

A. Data Construction & Benchmark

The authors introduced the BLM-Guard Benchmark, a real-world dataset of short-video ads annotated with a three-level hierarchical taxonomy:

Severity: High, Medium, Low.
Scenario: e.g., Illegal content, False marketing, Misleading operations.
Violation Type: e.g., Income exaggeration, Privacy leak, Feudal superstition.
The dataset includes structured reasoning traces to support supervised learning and reward modeling.

B. Stage 1: Rule-Guided Cold Start (SFT)

To initialize the model with policy priors and reasoning skills, the authors propose an Interleaved-modal Chain-of-Thought (ICoT) pipeline:

Keyframe & Region Selection: Uses a hybrid strategy (BIN+TOP) with CLIP-based prompt similarity to select 3 keyframes and salient visual regions representing high-risk cues.
ICoT Generation: A frozen VLM (InternVL-3-78B) generates structured reasoning traces in four steps:
1. Observation: Describe visuals and ASR transcripts; check modality consistency.
2. Risk Screening: Identify potential policy violations.
3. Causal Analysis: Analyze underlying causes.
4. Final Verdict: Integrate reasoning to reach a compliance decision.
Training Objective: Supervised Fine-Tuning (SFT) using a composite loss:
$L = L_{CE}(\text{answer}) + \lambda \cdot KL(p_{think} \parallel p_{rule})$
The KL divergence term aligns the model's reasoning distribution with a rule-guided prior (constructed from keyword sets of violation types), ensuring the reasoning process is anchored to policy rules.

C. Stage 2: Self-Adaptive Reinforcement Learning (RL)

To refine the model's behavior against evolving risk patterns, the authors employ Group-wise Relative Policy Optimization (GRPO) with a hybrid reward mechanism:

Data Curation: Uses rejection sampling to identify "hard" samples and safety-aware concatenation to simulate complex decision contexts.
Hybrid Reward Design ( $r = r_{rule} + r_{format} + r_{scaR}$ ):
1. Rule-Based Reward: Discrete score based on matching ground-truth violation scenes and types.
2. Format-Aware Reward: Binary score ensuring the output follows the required <answer> and <thought> structure.
3. SCA-R (Self-Consistency and Adaptive Reward): A guide model acts as a critic, dynamically constructing scoring principles (e.g., causal clarity, risk attribution) to evaluate the reasoning trace. This addresses policy drift and ensures alignment with moderation principles.
Optimization: The GRPO algorithm is modified with token-level normalization (to mitigate length bias) and dynamic sampling (skipping batches with zero reward variance) to ensure stable training.

3. Key Contributions

BLM-Guard Benchmark: A real-world dataset for ad moderation featuring a hierarchical risk taxonomy (Severity, Scenario, Violation Type) and structured reasoning traces, enabling policy-grounded evaluation.
BLM-Guard Framework: A novel multimodal moderation architecture combining:
- ICoT: Interleaved-modal Chain-of-Thought for explainable, rule-anchored reasoning.
- SCA-R: A self-consistency reward mechanism that adapts to policy shifts.
- Multi-task Modeling: Simultaneously models intra-modal manipulations (e.g., exaggerated imagery) and cross-modal mismatches (e.g., subtitle-audio drift).
Two-Stage Training Pipeline: A robust strategy moving from rule-guided SFT to adaptive GRPO-based RL, significantly improving generalization and policy alignment.

4. Experimental Results

The model was evaluated on the BLM-Guard Benchmark and five public datasets (LSPD, XD-Violence, UCF-Crime, FakeSV, FVC).

Performance on Benchmark: BLM-Guard significantly outperformed state-of-the-art baselines (including Qwen2.5-VL, InternVL3, LLaVA-Next, and specialized guard models like LlavaGuard).
- Strict Accuracy: Achieved 91.4% (vs. ~70% for the next best baseline).
- Reasoning Consistency: Achieved 0.845 (vs. ~0.66 for baselines), indicating highly interpretable and policy-aligned reasoning.
- Severity Classification: Showed superior performance across High, Medium, and Low risk levels.
Generalization: The model demonstrated strong transferability to public datasets, particularly excelling in misinformation detection (FakeSV/FVC) where prior models struggled.
Ablation Studies:
- Rule-SFT proved superior to single-stage SFT, improving both accuracy and interpretability.
- RL with SCA-R provided the final boost, achieving the best balance of factual precision and policy alignment.

5. Significance

BLM-Guard addresses a critical gap in the AI safety landscape: the lack of fine-grained, explainable, and policy-adaptive moderation for commercial content.

Regulatory Compliance: By providing structured reasoning traces, it offers transparency required for regulatory audits.
Adaptability: The self-adaptive reward mechanism allows the system to handle evolving platform policies without retraining from scratch.
Cost Efficiency: The rule-driven ICoT data synthesis pipeline reduces the cost of manual annotation while generating high-quality training data.
Robustness: The multi-task architecture effectively handles complex, deceptive advertising tactics that combine visual, textual, and audio modalities.

In conclusion, BLM-Guard sets a new standard for commercial ad moderation by successfully integrating structured reasoning, rule-based supervision, and adaptive reinforcement learning.