MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

This paper introduces MuSaG, the first German multimodal sarcasm dataset featuring aligned text, audio, and video annotations from television shows, and demonstrates through benchmarking that current models struggle to match human reliance on audio cues, thereby highlighting a critical gap for future multimodal sarcasm detection research.

Aaron Scott, Maike Züfle, Jan Niehues

Published 2026-03-05
📖 4 min read☕ Coffee break read

Imagine you are at a dinner party. Someone says, "Oh, great weather we're having," while standing in a torrential downpour, looking miserable. You know immediately they are being sarcastic. You didn't just hear the words; you saw their rain-soaked clothes, heard the sigh in their voice, and saw the eye-roll. Your brain instantly combined all those clues to understand the real meaning.

Now, imagine trying to teach a robot to do the same thing. That is exactly what this paper, MuSaG, is about.

Here is the story of the paper, broken down into simple concepts:

1. The Problem: Robots Are "Text-Blind"

For a long time, computers have been great at reading text. If you type "I love this movie" into a computer, it thinks you are happy. But if a human says it with a groan and a face-palm, a human knows you hate it.

Current AI models are like people who can only read the menu but can't taste the food or see the chef's expression. They are terrible at catching sarcasm because they rely too much on the literal words and ignore the tone of voice (audio) and facial expressions (video).

2. The Solution: The "German Sarcasm Library" (MuSaG)

The researchers at the Karlsruhe Institute of Technology decided to build a special training library for AI. They call it MuSaG.

  • What is it? It's a collection of 33 minutes of clips from German TV shows (like talk shows and comedy).
  • Why German? Most AI training data is in English. The researchers wanted to see if this "sarcasm blindness" happens in other languages too.
  • The Magic Ingredient: They didn't just grab random clips. They manually picked them, then had humans label them three different ways:
    1. Just the text (transcript).
    2. Just the audio (sound only, no video).
    3. Just the video (muted video, no sound).
    4. And finally, all three together.

Think of this like a cooking class where the teacher gives students the recipe (text), the sound of the sizzling pan (audio), and a video of the chef's knife skills (video) separately, so they can learn which clue is most important.

3. The Experiment: Humans vs. Robots

The researchers took this new library and tested nine different AI models (some open-source, some from big tech companies) against it. They asked the same question: "Is this sarcastic?"

Here is what they found, using a simple analogy:

  • How Humans Do It: When humans listen to a sarcastic comment, they rely heavily on the voice (the tone, the pitch, the pause). It's like listening to a song; the melody tells you if it's sad or happy, even if the lyrics are neutral.
    • Result: Humans were best at detecting sarcasm using audio.
  • How Robots Do It: The AI models were terrible at listening to the voice or watching the face. Instead, they acted like a robot reading a script. They ignored the tone and the eye-rolls and just looked at the words.
    • Result: The robots were best at detecting sarcasm using text.

The Big Gap: The robots are like a person trying to guess your mood by only reading your text messages, while you are standing right in front of them crying. They miss the most important clues.

4. The "Context" Trap

The researchers also tried to help the robots by giving them more background information (like the 15 seconds of conversation before the sarcastic comment).

You might think, "More context = smarter robot!"
Wrong.

It was like giving a detective a whole novel to read when they only needed to solve one specific clue. The extra information actually confused the robots. They got distracted by the surrounding chatter and started guessing wrong. It turns out, adding more noise made the task harder for them, not easier.

5. The Takeaway

The paper concludes with a clear message for the future of AI:

"We are building robots that are too good at reading and too bad at listening and watching."

To make AI that can truly understand human conversation (like a real friend), we need to stop teaching them just to read text. We need to teach them to listen to the tone and watch the face.

MuSaG is now being released to the public so other scientists can use it to build better, more "human-like" AI that doesn't miss the joke when someone says, "Oh, great," while it's raining.

Summary in One Sentence

The researchers built a German TV dataset to prove that while humans detect sarcasm by listening to voices and watching faces, current AI is still stuck reading the script, and we need to teach them to pay attention to the whole picture.