SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

Here is an explanation of the SarcasmMiner paper, translated into simple, everyday language using analogies.

The Big Problem: AI is Bad at "Reading the Room"

Imagine you are at a party. Someone says, "Oh, great, another meeting," but they are rolling their eyes, slumping their shoulders, and speaking in a bored, flat tone. You immediately know they are being sarcastic.

Now, imagine an AI trying to figure that out.

The Text: "Great meeting." (Sounds positive!)
The Audio: Bored tone. (Sounds negative!)
The Video: Eye rolls. (Sounds negative!)

Current AI models are like a student who only reads the text. They see the word "Great" and think, "This is a happy sentence!" They miss the eye rolls and the tone. Even worse, if you force them to explain why they think it's sarcasm, they might lie. They might say, "The person is smiling," even though the video clearly shows them frowning. This is called hallucination—making up evidence to fit a guess.

The Solution: SarcasmMiner

The researchers built a new training system called SarcasmMiner. Think of it as a rigorous "boot camp" for an AI to teach it how to be a detective, not just a guesser.

Here is how it works, step-by-step:

1. The "Teacher" and the "Student" (Dual-Track Distillation)

Imagine a master detective (the Teacher) and a rookie cop (the Student).

The Teacher looks at thousands of video clips and writes out detailed reports on why something is sarcastic.
The Problem: Sometimes the Teacher makes mistakes or writes confusing reports.
The SarcasmMiner Strategy:
- Track A (The Good Stuff): The Student only copies the perfect reports from the Teacher to learn the basics.
- Track B (The Bad Stuff): The Student also studies the bad reports (where the Teacher made mistakes or lied about the evidence). But instead of copying them, the Student uses them to train a Referee (a "Reward Model").

2. The "Referee" (Generative Reward Model)

This is the secret sauce. Usually, AI just gets a grade: "Right" or "Wrong."
SarcasmMiner adds a Referee that checks the logic.

If the AI says, "This is sarcasm because the person is rolling their eyes," the Referee checks the video. If the eyes are actually wide open, the Referee gives a failing grade, even if the AI guessed "sarcasm" correctly.
If the AI says, "This is sarcasm because the person is rolling their eyes," and the video does show eye rolls, the Referee gives a gold star.

This teaches the AI: "Don't just get the answer right; prove it with real evidence."

3. The "Game of Reinforcement" (GRPO)

Finally, the Student plays a game to get better.

The AI tries to solve a problem 8 different ways.
The Referee scores each attempt based on two things:
1. Did you get the right answer?
2. Did you make up any fake evidence (hallucinations)?
The AI learns to keep the strategies that get high scores and throw away the ones that lie.

The Results: From "Guessing" to "Detecting"

Before this training, the AI was like a nervous student guessing on a test.

Zero-Shot (No training): 59% accuracy. (Basically guessing).
Standard Training: 68% accuracy. (Better, but still makes up evidence).
SarcasmMiner: 70%+ accuracy.

But the real win isn't just the score; it's the trust.

Old AI: "This is sarcasm! (Even though I made up a fake smile to prove it)."
SarcasmMiner: "This is sarcasm! (Because I saw the eye roll and heard the flat tone, and I didn't lie about it)."

The Takeaway

SarcasmMiner is like teaching an AI to stop "faking it till they make it." It forces the AI to look at the whole picture (words, voice, and face) and demands that it tell the truth about what it sees before it makes a judgment. It turns a "guessing machine" into a "reasoning detective."

Here is a detailed technical summary of the paper "SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning."

1. Problem Statement

Multimodal sarcasm detection is a complex task requiring the resolution of pragmatic incongruity between textual, acoustic, and visual cues. Unlike standard emotion recognition, sarcasm often involves positive lexical content delivered with contradictory non-verbal signals (e.g., flat prosody or exaggerated facial expressions).

Current approaches face three primary challenges:

Hallucination in Reasoning: Multimodal Large Language Models (MLLMs) often fabricate acoustic or visual evidence to justify a correct prediction, a phenomenon known as "hallucinated reasoning."
Lack of Structured Supervision: Existing datasets (like MUStARD++) lack multi-step reasoning annotations (Chain-of-Thought), making it difficult to train models on how to reason rather than just what to predict.
Limitations of Standard RL: Reinforcement Learning (RL) optimized solely for prediction accuracy tends to encourage "reward hacking," where models exploit statistical shortcuts or fabricate cues to achieve high scores without genuine multimodal grounding.

2. Methodology: SarcasmMiner Framework

The authors propose SarcasmMiner, a three-stage, reinforcement learning-based post-training framework designed to equip omni-modal LLMs with hallucination-resistant reasoning capabilities.

Stage 1: Multimodal Reasoning Manifold Generation

Teacher Model: A powerful teacher model (Qwen3-Omni-30B) is prompted to analyze multimodal tuples (video, audio, text) and generate structured reasoning chains (inside <thought> tags) followed by a binary prediction.
Stochastic Sampling: Instead of greedy decoding, the teacher generates a diverse pool of $n=8$ trajectories per input using high-temperature sampling ( $T=0.6$ ). This creates a dataset containing correct deductions, erroneous predictions, and crucially, hallucinated multimodal inferences.

Stage 2: Dual-Track Distillation Strategy

To maximize the utility of the generated data, the framework employs a dual-track approach:

Track A (High-Quality SFT): A "golden" subset of trajectories is selected for Supervised Fine-Tuning (SFT) to initialize the student model. Selection criteria include:
1. Ground-Truth Consistency: The predicted label must match the true label.
2. Anti-Repetition: Filtering out low-entropy or repetitive generations.
- Strategies compared: Greedy decoding, Best-of-N, and Diverse Sampling (retaining multiple correct/diverse paths).
Track B (Generative Reward Model - GenRM): The entire set of trajectories (including flawed and hallucinated ones) is used to train a Generative Reward Model.
- Labeling: Trajectories are labeled 1 if they reach the correct label via logically coherent steps, and 0 if they are incorrect or rely on hallucinated evidence (even if the final guess is right).
- Architecture: A lightweight model (Qwen2.5-3B) is fine-tuned to autoregressively predict a binary token ("1" or "0") based on the reasoning context, providing a stable supervision signal.

Stage 3: GRPO with Decoupled Rewards

The student model (Qwen2.5-Omni-7B) is aligned using Group Relative Policy Optimization (GRPO). Unlike standard RL that optimizes for a scalar score, SarcasmMiner uses a decoupled reward mechanism:
$R(o_i) = \lambda_a R_{acc}(o_i) + \lambda_f R_{fmt}(o_i) + \lambda_g R_{GenRM}(o_i)$

$R_{acc}$ : Accuracy reward (binary match with ground truth).
$R_{fmt}$ : Format reward (penalizes malformed outputs).
$R_{GenRM}$ : Reasoning Validity Reward. This is the core innovation. It uses the GenRM from Track B to evaluate the logical soundness of the reasoning chain. It explicitly penalizes hallucinated acoustic/visual evidence, forcing the model to ground its predictions in actual multimodal cues.

3. Key Contributions

Reasoning-Aware Formulation: Reframed sarcasm detection from a pure classification task to a structured cross-modal reasoning problem, explicitly modeling the incongruity between modalities.
Dual-Track Distillation: Introduced a novel strategy that uses high-quality trajectories for initialization while leveraging all trajectories (including failures) to train a reward model that discriminates between valid reasoning and hallucination.
Generative Reward Modeling (GenRM): Proposed a paradigm where a generative model evaluates reasoning validity, moving beyond scalar rewards to explicitly penalize fabricated evidence.
Decoupled Optimization: Demonstrated that separating accuracy rewards from reasoning validity rewards via GRPO significantly reduces hallucination and improves multimodal grounding.

4. Experimental Results

The framework was evaluated on the MUStARD++ dataset (1,202 labeled utterances).

Performance Metrics:
- Zero-Shot Baselines: Qwen2.5-Omni-7B achieved 59.83% F1.
- Supervised Fine-Tuning (SFT): Improved F1 to 68.23%.
- SarcasmMiner (Full Framework): Achieved 70.22% F1 and 70.23% Accuracy.
- Significance: The 7B SarcasmMiner model outperformed the 30B teacher model in zero-shot settings, proving the efficacy of task-specific post-training.
Reasoning Quality (GAR):
- The GenRM Acceptance Rate (GAR) (an automatic metric for logical soundness) increased from 64.01% (Zero-shot) to 90.43% (SarcasmMiner).
Ablation Studies:
- Teacher Mode: Using "Thinking" mode for the teacher improved reasoning quality over standard instruction prompting.
- Initialization: Diverse sampling for SFT initialization yielded better results than greedy or Best-of-N.
- Reward Design: Removing the GenRM reward (standard GRPO) led to instability and a trade-off where the model sacrificed sensitivity to genuine sarcasm for non-sarcasm detection. The full decoupled reward was essential for balanced performance.
Error Analysis:
- SFT-only models showed a bias toward over-predicting sarcasm (high False Positives) by hallucinating pragmatic conflicts.
- SarcasmMiner significantly reduced False Positives (from 0.45 to 0.33) and increased True Negatives, demonstrating that the model learned to withhold sarcasm predictions unless supported by concrete audio-visual evidence.

5. Significance and Conclusion

SarcasmMiner addresses a critical gap in multimodal foundation models: the tendency to hallucinate evidence to justify predictions. By introducing a reasoning-aware reward model and a dual-track distillation strategy, the authors demonstrate that:

Hallucination can be mitigated in pragmatic tasks through explicit penalization of invalid reasoning chains.
Smaller models (7B) can outperform larger zero-shot models (30B) when equipped with robust, reasoning-focused post-training.
Trustworthy AI: The framework provides a pathway for adapting foundation models to high-level pragmatic inference tasks where logical consistency and multimodal grounding are as important as prediction accuracy.

This work establishes a new benchmark for training MLLMs to reason about subtle cross-modal inconsistencies without relying on fabricated cues.