MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection

Imagine you are the moderator of a massive, chaotic global town square where 170 million people speak Urdu. This town square is the internet: full of great conversations, but also full of people shouting insults, spreading hate, and trying to start fights.

For a long time, the security guards (the AI systems) trying to keep this place safe had a big problem. They could look at a whole speech and say, "This whole speech is bad, delete it!" But they couldn't tell you exactly which words were the problem. It's like a guard saying, "This entire book is dangerous," and throwing the whole library away, even though only one paragraph was actually mean.

This paper introduces two new tools to fix that: URTOX and MUTEX. Think of them as a new, high-tech flashlight and a pair of smart glasses for the security guards.

1. The Problem: The "Blurry Lens"

Urdu is a beautiful but tricky language. It's like a rich tapestry with many layers:

Code-Switching: People often mix Urdu and English in the same sentence (like saying, "You are stupid").
Scripts: People write it in the traditional flowing script (Nastaliq) or type it out using English letters (Roman Urdu, like "tu bewakoof hai").
Morphology: Words change shape a lot depending on how they are used.

Old systems were like a blurry camera. They could see a "toxic blob" but couldn't pinpoint the specific words causing the trouble. This made it hard to be fair. If you delete a whole comment because of one bad word, you might silence a harmless joke or a legitimate complaint.

2. The Solution: URTOX (The Map)

First, the researchers needed a map. They created URTOX.

What is it? It's a giant, hand-drawn map of 14,342 real-life examples of toxic and non-toxic Urdu text.
How was it made? Humans (not robots) read every single sentence and put a "sticky note" on the exact words that were toxic. They used a system called BIO tagging (Begin, Inside, Outside).
- Analogy: Imagine a sentence is a train. The "B" (Begin) note goes on the first toxic car, the "I" (Inside) notes go on the rest of the toxic cars, and "O" (Outside) notes go on the safe cars.
Why it matters: Before this, no one had a map this detailed for Urdu. It's the "training manual" for the new AI.

3. The Engine: MUTEX (The Smart Glasses)

Once they had the map, they built MUTEX, the first system that can actually see the toxic words.

How it works: MUTEX is like a pair of smart glasses that reads the text and highlights the bad words in red, while leaving the good words alone.
The Secret Sauce: It uses a powerful brain (a Transformer model called XLM-RoBERTa) that understands context, combined with a rule-checker (CRF).
- Analogy: The Transformer is like a genius who understands the meaning of the sentence. The CRF is like a strict editor who says, "Wait, if you call this word 'bad,' the next word must also be 'bad' if they are part of the same insult." This prevents the AI from getting confused and labeling random words as toxic.
The Result: It achieved a 60% success rate in finding the exact toxic words. While this isn't perfect (humans are still better), it is the first time anyone has done this for Urdu at this level of detail.

4. Why "Explainable" Matters

The coolest part of MUTEX is that it doesn't just say "Delete this." It explains why.

The Flashlight: If the AI flags a comment, it can point to the specific words and say, "I flagged this because of the word 'stupid' and the phrase 'bad person'."
Trust: This helps human moderators trust the AI. Instead of a black box making mysterious decisions, the AI shows its work, like a student showing their math homework.

5. The Challenges They Faced

The researchers had to fight some tough battles:

The "Roman" Problem: About 18% of people type Urdu using English letters. The AI had to learn that "badtameez" (Roman) and "badtameez" (Nastaliq) mean the same thing.
The "Mix" Problem: When people switch between Urdu and English mid-sentence, it confuses older models. MUTEX learned to handle this "code-switching" much better.
The "Sarcasm" Problem: Sometimes people say "Great job!" when they mean "You failed!" The AI still struggles a bit with this kind of hidden toxicity, but it's getting better.

The Big Picture

This paper is a huge leap forward. Before, we were trying to catch a needle in a haystack by throwing away the whole haystack. Now, with URTOX (the map) and MUTEX (the smart glasses), we can find the needle and leave the hay alone.

It proves that even for languages that don't have as much digital data as English (called "low-resource" languages), we can build smart, fair, and understandable tools to keep our online communities safe. It's a step toward a digital world where everyone, regardless of their language, can be heard without fear of abuse.

Here is a detailed technical summary of the paper "MUTEX: A Framework for Toxic Span Detection in Urdu Using URTOX."

1. Problem Statement

The paper addresses the critical gap in toxic language detection for Urdu, a low-resource language with over 170 million speakers. Existing solutions suffer from three main limitations:

Granularity: Most systems perform sentence-level classification (labeling an entire post as toxic or not), failing to identify the specific words or phrases causing toxicity. This limits interpretability and prevents targeted moderation (e.g., masking only the toxic span).
Linguistic Complexity: Urdu presents unique challenges, including rich morphology, complex Nastaliq script, frequent code-switching (mixing Urdu and English), and the use of Romanized Urdu (Urdu written in Latin script).
Resource Scarcity: There is a lack of token-level annotated datasets and supervised baselines for Urdu toxic span detection. Existing datasets are either sentence-level or non-existent for span detection.

2. Methodology

The authors propose MUTEX, the first explainable toxic span detection framework for Urdu, supported by a new dataset, URTOX.

A. Dataset: URTOX

Scale: 14,342 manually annotated samples collected from three domains: Social Media (X, Instagram, Reddit), Urdu News outlets, and YouTube.
Annotation Scheme: Uses BIO tagging (Begin, Inside, Outside) at the token level.
- B-TOXIC: Start of a toxic span.
- I-TOXIC: Continuation of a toxic span.
- O: Non-toxic token.
Quality: Achieved high inter-annotator agreement (Cohen's $\kappa = 0.82$ , Krippendorff's $\alpha = 0.81$ ) through a multi-round adjudication process.
Distribution: 54% toxic, 46% non-toxic. Toxicity is sparsely distributed (avg. <30% of tokens per post are toxic).

B. Model Architecture: MUTEX

The framework utilizes a hybrid architecture combining Transformer embeddings with sequence modeling constraints:

Encoder: XLM-RoBERTa (Multilingual BERT) is used to generate contextual embeddings. It was chosen for its superior performance in low-resource languages compared to mBERT.
Sequence Layer: A Conditional Random Field (CRF) layer is placed on top of the transformer outputs.
- Purpose: To enforce valid BIO tag transitions (e.g., preventing an I-TOXIC tag without a preceding B-TOXIC), ensuring logical span boundaries.
Explainability (XAI): The system integrates Integrated Gradients to provide token-level attribution. This highlights exactly which words contributed to the toxicity prediction, aiding human moderators in understanding the "why" behind the decision.

C. Preprocessing Pipeline

A rigorous preprocessing step is crucial for Urdu due to script variations:

Unicode normalization (NFC).
Removal of diacritics and noise (URLs, emojis).
Roman Urdu Conversion: Transliteration of Romanized Urdu back to Nastaliq script to handle the ~18% of data written in Latin characters.
Word segmentation to handle concatenated words common in social media.

D. Training Strategy

Multi-Domain Training: The model is trained on a balanced mix of Social Media, News, and YouTube data to improve cross-domain generalization.
Class Imbalance Handling: Uses weighted loss functions and Focal Loss to address the scarcity of toxic tokens (B-TOXIC and I-TOXIC) compared to non-toxic tokens (O).
Data Augmentation: Includes back-translation, synonym replacement, and code-switching augmentation to simulate real-world variations.

3. Key Contributions

URTOX Dataset: The first manually annotated, token-level toxic span dataset for Urdu, serving as a benchmark for future research.
MUTEX Framework: The first explainable system for Urdu toxic span detection, combining XLM-RoBERTa and CRF.
First Supervised Baseline: Establishes the first performance benchmark for Urdu toxic span detection, moving beyond sentence-level classification.
Explainability: Integrates gradient-based attribution to make the model's decisions transparent and auditable.
Comprehensive Analysis: Provides deep insights into the effects of code-switching, script variation (Roman vs. Nastaliq), and domain differences on model performance.

4. Experimental Results

The model was evaluated using Token-level F1-score (the standard for sequence labeling tasks).

Overall Performance: MUTEX (XLM-RoBERTa + CRF) achieved a 60.0% Token-level F1-score, establishing the first supervised baseline for this task.
Ablation Studies:
- CRF Layer: Added a 1.3% F1 improvement and reduced invalid BIO sequences from 8.3% to 0%.
- Preprocessing: The cumulative effect of preprocessing steps (especially Roman Urdu conversion) contributed a 6.2% F1 gain.
- Data Size: Performance showed diminishing returns after 11,474 samples (80% of data), suggesting the current dataset size is near optimal for the current architecture.
Cross-Domain Analysis:
- Multi-domain training proved superior to single-domain training for generalization, reducing the performance gap between domains from ~12% to ~3.6%.
- Domain Specifics: News text yielded the highest F1 (62.3%) due to formality, while Social Media was the most challenging (57.6%) due to slang and code-switching.
Error Analysis:
- Boundary Errors: Accounted for 34% of failures (incorrect span start/end).
- Context/Sarcasm: 28% of errors were due to context-dependent toxicity or sarcasm, which remains a challenge for current models.
- Code-Switching: Mixed Urdu-English spans caused a 1.4% performance drop.

5. Significance and Impact

Advancement in Low-Resource NLP: This work demonstrates that transformer-based sequence labeling is effective for morphologically rich, cursive-script languages like Urdu, provided that specific preprocessing and hybrid architectures (Transformer + CRF) are used.
Practical Moderation: By shifting from sentence-level to span-level detection, the framework enables selective masking and explainable moderation, which are critical for user trust and regulatory compliance.
Benchmarking: The release of URTOX and the MUTEX baseline allows the research community to rigorously evaluate and improve Urdu content safety tools.
Scalability: The findings on cross-domain transfer and few-shot learning (using 50 samples to boost performance by ~6-7%) offer a roadmap for deploying similar systems in other under-resourced South Asian languages (e.g., Hindi, Punjabi, Bengali).

In conclusion, MUTEX and URTOX represent a significant leap forward in Urdu NLP, moving the field from coarse-grained toxicity detection to fine-grained, interpretable, and robust span detection.