Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

Imagine you are teaching a robot how to be a good guardian. You want it to spot danger before it happens. But here's the problem: the robot has been trained mostly on cartoons of danger.

If you show the robot a picture of a cartoon fire and ask, "Is this safe?" it says, "No, fire is bad!" It's easy to spot danger when it's screaming at you from a comic book.

But in the real world, danger is quiet. It's a safe-looking picture of a library paired with a safe-looking sentence like, "I want to start a fire to keep warm." Alone, the library is fine. Alone, the sentence is fine. But put them together? Disaster. The robot, trained only on obvious cartoons, misses the trap completely.

This paper introduces a new way to build a "training gym" for these AI guardians, called RMS (Real-World Multimodal Safety). Here is how they did it, explained simply:

1. The Problem: The "Cartoon" Trap

Current safety tests are like a driving test where the instructor only puts a giant red "STOP" sign on the road. The AI learns to stop for red signs. But in real life, the danger is a slippery patch of ice that looks like normal pavement, or a car that looks like a toy but is actually speeding.

Old Way: Create fake, obvious dangers (synthetic images) and tell the AI, "This is bad."
The Flaw: The AI gets good at spotting the fake signs but fails when the danger is hidden in a normal, everyday scene.

2. The Solution: The "Lego" Approach (Image-Oriented)

Instead of starting with a scary story and trying to find a picture to match it, the authors started with real-world photos (like a picture of a cliff, a library, or a kitchen) and asked a smart AI assistant: "What if someone said something harmless here that would actually be dangerous?"

Think of it like a Lego set:

Piece A (The Image): A picture of a high cliff. (Safe on its own).
Piece B (The Text): A sentence saying, "I want to jump." (Safe on its own).
The Combination: When you snap them together, you get a suicide risk.

The authors built a machine that automatically snaps millions of these "safe" pieces together to find the hidden "dangerous combinations." They call this Information Complementarity. It's like realizing that a "knife" is safe, and "cooking" is safe, but "stabbing" is not. The danger comes from the mix.

3. The Result: A Massive "Trap" Dataset

They used this method to build a dataset of 35,000 scenarios.

The Images: All taken from the real world (no cartoons).
The Text: All harmless sentences.
The Trap: The combination creates a hidden risk (like falling, fire, or self-harm).

They also created two types of "answers" for the AI to learn from:

The "Bad" Answer: An AI that ignores the trap and says, "Great idea! Go jump!" (This teaches the AI what not to do).
The "Good" Answer: An AI that spots the trap and says, "Wait, that cliff is dangerous! Let's find a safer way." (This teaches the AI what to do).

4. The New Scorecard

The authors realized we didn't have a good way to measure if a safety dataset was actually good. So, they invented a new test:

The "Teacher" Test: Take a safety dataset, use it to train a "Safety Judge" AI, and then see how well that Judge performs on other safety tests.
The Result: The AI trained on their new "Real-World" dataset became a much better judge than those trained on old "Cartoon" datasets. It learned to see the ice on the road, not just the red stop signs.

5. The Big Discovery

When they tested current famous AI models (like GPT-4o or Gemini) on this new dataset, they failed miserably.

Most models couldn't see the danger. They saw a picture of a cliff and a text about "jumping" and thought, "Oh, a fun park!" or "A great adventure!"
They only realized it was dangerous when the text was explicitly screaming "I want to kill myself." They missed the subtle, real-world risks.

Summary

This paper is like building a driving simulator that doesn't just show you red lights and stop signs. Instead, it shows you a rainy day, a slippery road, and a distracted driver, and asks, "What happens if you don't slow down?"

By training AI on these subtle, real-world combinations of safe images and safe text, the authors hope to build guardians that can actually protect us in the messy, complex real world, not just in a cartoon.

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

1. The Problem: The "Cartoon" Trap

2. The Solution: The "Lego" Approach (Image-Oriented)

3. The Result: A Massive "Trap" Dataset

4. The New Scorecard

5. The Big Discovery

Summary

1. Problem Statement

2. Methodology: Image-Oriented Self-Adaptive Construction

A. Pattern Generation (Inspiration Phase)

B. Data Augmentation (Scalability Phase)

C. Guidance Response Generation

D. Safety Review

3. Key Contributions

A. The RMS Dataset

B. New Evaluation Metric: "Fine-tuned Model as Metric"

C. Standardized Categorization

4. Experimental Results

A. Incremental Experiments

B. Baseline MLLM Performance (The "RMS Test")

C. Evaluation Metric Validation

5. Significance and Impact

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios

1. The Problem: The "Cartoon" Trap

2. The Solution: The "Lego" Approach (Image-Oriented)

3. The Result: A Massive "Trap" Dataset

4. The New Scorecard

5. The Big Discovery

Summary

1. Problem Statement

2. Methodology: Image-Oriented Self-Adaptive Construction

A. Pattern Generation (Inspiration Phase)

B. Data Augmentation (Scalability Phase)

C. Guidance Response Generation

D. Safety Review

3. Key Contributions

A. The RMS Dataset

B. New Evaluation Metric: "Fine-tuned Model as Metric"

C. Standardized Categorization

4. Experimental Results

A. Incremental Experiments

B. Baseline MLLM Performance (The "RMS Test")

C. Evaluation Metric Validation

5. Significance and Impact

More like this

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

CIPHER: Conformer-based Inference of Phonemes from High-density EEG

SWAY: A Counterfactual Computational Linguistic Approach to Measuring and Mitigating Sycophancy

Skeleton-based Coherence Modeling in Narratives

Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets