Not All Pretraining are Created Equal: Threshold Tuning… — Plain-Language Explanation

Imagine social media as a giant, noisy town square where people are shouting their opinions. Sometimes, these shouts are just normal disagreements. But other times, the shouting turns into a toxic mob mentality where people hate "outsiders" and blindly support their own "tribe." This is polarization.

This paper is like a report card from a student (Abass) who tried to build a smart security guard (an AI) to patrol this town square. The guard's job is to spot the toxic shouting, figure out who they are attacking (politicians, a specific religion, a gender, etc.), and identify how they are attacking (using insults, dehumanizing words, or extreme language).

Here is the story of how the student built this guard, using simple analogies:

1. The Challenge: The "Rare Event" Problem

The biggest problem the student faced was that the town square is mostly calm. For every 100 posts, maybe only 30 are actually toxic. The rest are just normal talk.

The Analogy: Imagine trying to train a dog to find a specific rare flower in a massive field of grass. If you just show the dog the whole field, it will get lazy and just say "I see grass" every time because that's the most common thing.
The Fix: The student had to teach the AI to pay extra attention to the rare flowers (the toxic posts). They did this by giving the AI a "special score" (Class Weighting) that made mistakes on the rare toxic posts hurt more than mistakes on the common normal posts.

2. The Tools: Choosing the Right "Brain"

The student had to choose which "brain" (AI model) to give the guard. They had two types of candidates:

The Local Experts: Models trained specifically on Swahili (the local language of the area).
The World Travelers: Models trained on many languages (English, Swahili, French, etc.) but not specifically on Swahili.

The Surprise: The student expected the "Local Experts" to win because they knew the local slang better. But the World Travelers (specifically one called mDeBERTa) actually won by a landslide!

The Analogy: It's like hiring a local guide who knows every shortcut but gets confused by the specific rules of this new game, versus hiring a world-class athlete who has played every sport in the world. The athlete adapted to the new game faster than the local guide. The paper found that having a "big brain" that understands many languages was more important than having a "small brain" that only knows one language perfectly.

3. The Secret Sauce: "Threshold Tuning"

This is the most important trick the student used.

The Analogy: Imagine the AI is a bouncer at a club. It has to decide: "Is this person toxic? Yes or No?"
- Normally, the bouncer uses a strict rule: "If you are 50% toxic, you are out."
- But because toxic posts are so rare, the bouncer was being too strict and letting the bad guys in.
- The Fix: The student adjusted the bouncer's sensitivity. They said, "Hey, if you think someone is even 30% toxic, let's flag them." They did this for every single type of toxicity separately.
The Result: This simple tweak was like magic. It boosted the AI's performance by a huge margin (over 20 points!). It's like turning up the volume on a radio so you can finally hear the quiet music you were missing.

4. The Results: How Good Was the Guard?

Binary Detection (Toxic vs. Not Toxic): The guard got very good at this, scoring around 80% accuracy. It could tell the difference between a heated political argument and actual hate speech.
The Hard Part (Multi-Label): When asked to identify exactly what kind of hate it was (e.g., "Is this racist? Is it sexist?"), the guard struggled a bit more. This is because some posts are tricky.
- Example: A post saying "Those people don't understand our way of life" sounds polite but is actually a hidden insult (a "dog whistle"). The AI missed these sometimes.
- Example: People mixing English and Swahili in one sentence confused the AI's tokenizer (the part that breaks words into pieces), making it hard to read.

5. What Didn't Work?

The student tried to train the guard on both English and Swahili data at the same time, hoping it would learn faster.

The Analogy: It was like trying to teach a student French and Japanese simultaneously by mixing the textbooks. The student got confused, and their performance actually got worse.
The Lesson: Sometimes, it's better to train on one language at a time and then combine the knowledge later, rather than mixing everything up from the start.

Summary

This paper teaches us that when building AI to detect hate speech in low-resource languages (like Swahili):

Don't just use local models: A general, multilingual model often works better than a specialized one.
Adjust the sensitivity: You can't just use a standard "yes/no" rule. You have to fine-tune the "alarm threshold" for every specific type of hate speech.
Beware of mixing languages too early: Training on multiple languages at once can sometimes confuse the AI.

The student's system is now a strong tool for spotting online toxicity, but it still needs to get better at understanding hidden insults and mixed languages.

1. Problem Statement

The paper addresses the challenge of detecting and classifying attitude polarization in social media text, specifically within English and Swahili languages. Polarization is defined as content displaying negative attitudes toward out-groups while showing solidarity with in-groups, manifesting through stereotyping, vilification, dehumanization, and divisive rhetoric.

The task is framed as a shared task (SemEval-2025) with three hierarchical subtasks:

Binary Detection: Classifying text as polarized (1) or non-polarized (0).
Multi-Label Target Classification: Identifying the target of polarization (Political, Racial/Ethnic, Religious, Gender/Sexual, Other).
Multi-Label Manifestation Identification: Identifying how polarization is expressed (Stereotype, Vilification, Dehumanization, Extreme Language, Lack of Empathy, Invalidation).

Key Challenges:

Severe Class Imbalance: The datasets exhibit extreme skew, particularly in the multi-label subtasks where rare categories (e.g., Gender/Sexual targets) have very few positive instances compared to dominant categories (e.g., Political).
Low-Resource Context: While English has moderate data, Swahili is a low-resource language, raising questions about the efficacy of language-specific pretraining versus general multilingual models.
Nuance and Ambiguity: Distinguishing between heated political discourse and genuine polarization, handling code-switching (mixing English and Swahili), and detecting implicit cues (dog whistles).

2. Methodology

The author developed Transformer-based systems employing a rigorous pipeline designed to handle imbalance and optimize for macro-F1 scores.

A. Data Preprocessing

Normalization: Emojis converted to text descriptions; URLs, mentions, and hashtags removed (though hashtag text retained); text lowercased and whitespace normalized.
Tokenization: Text truncated or padded to 128 tokens.
Splitting Strategy:
- Binary Task: Standard stratified 80/20 split.
- Multi-Label Tasks: Iterative Stratified Splitting (Sechidis et al., 2011) used to preserve the distribution of multiple labels simultaneously, preventing the validation set from losing rare label combinations.

B. Model Architecture Selection

Six Transformer architectures were evaluated to test the hypothesis that language-specific pretraining might not always outperform general multilingual models:

Multilingual Models: TwHIN-BERT (Twitter-specialized), DistilBERT-multilingual, mDeBERTa-v3-base.
African Language-Specialized Models: SwahBERT (Swahili-specific), AfriBERTa-large, AfroXLMR-large.

C. Training Strategies for Imbalance

To address the severe class imbalance, the following techniques were applied:

Class-Weighted Loss Functions:
- Binary Task: Used BCEWithLogitsLoss with weights calculated via compute_class_weight (balanced strategy) to penalize errors on minority classes more heavily.
- Multi-Label Tasks: Applied per-label positive class weights ( $w_{pos,i} = n_{neg,i} / n_{pos,i}$ ) to dynamically emphasize rare labels during training.
Two-Stage Threshold Tuning:
- Instead of using a default 0.5 threshold, a per-label threshold optimization was performed on the validation set.
- Stage 1 (Coarse): Global search for a base threshold maximizing macro-F1.
- Stage 2 (Fine): Iterative refinement of each label's threshold within a range of the base threshold to maximize the overall macro-F1.

D. Experimental Setup

Hyperparameters: Learning rates between 2e-5 and 2e-4, early stopping (patience 3–5), and label smoothing.
Cross-Lingual Test: An experiment was conducted to train on combined English+Swahili data to test for positive transfer, which ultimately resulted in performance degradation.

3. Key Contributions & Findings

The paper offers several counter-intuitive insights and technical contributions:

Architecture > Language-Specific Pretraining:
- Finding: The general multilingual model mDeBERTa-v3-base outperformed all African language-specialized models (including SwahBERT and AfriBERTa) by 10–15 percentage points on Swahili tasks.
- Implication: For polarization detection, the architectural advantages of mDeBERTa (disentangled attention, strong cross-lingual transfer) outweigh the benefits of pretraining specifically on Swahili corpora, likely due to the limited size of the downstream task data.
Critical Impact of Threshold Tuning:
- Finding: For multi-label tasks, default thresholds (0.5) yielded poor results (e.g., 0.14 macro-F1). Per-label threshold tuning improved performance by 20–26 percentage points (reaching 0.556 macro-F1).
- Implication: In highly imbalanced multi-label settings, optimizing decision boundaries per label is more critical than model architecture selection.
Negative Transfer in Naive Multilingual Training:
- Finding: Training a single model on combined English and Swahili data resulted in a 5–15 point performance drop compared to single-language models.
- Implication: Simple concatenation of low-resource and high-resource languages without specific adaptation (like adapters) causes negative transfer, likely due to conflicting gradient signals or vocabulary mismatches.
Error Analysis Insights:
- False Positives: Models often misclassify intense but non-polarized political criticism as polarization (relying on intensity markers like "failed" rather than semantic understanding).
- False Negatives: Models struggle with implicit polarization (e.g., dog whistles like "those people") and code-switching (mixed English/Swahili), which confuses subword tokenizers.

4. Results

The system achieved the following performance on the official test sets (Codabench):

Subtask	Language	Metric (Macro-F1)	Best Model
1. Binary Detection	English	0.815	mDeBERTa-v3-base
	Swahili	0.785	mDeBERTa-v3-base
2. Target Classification	English	0.341	AfriBERTa-large
	Swahili	0.4977	AfriBERTa-large
3. Manifestation ID	English	0.464	mDeBERTa-v3-base
	Swahili	0.556	mDeBERTa-v3-base

Note: While binary detection scores were strong, multi-label scores were lower, reflecting the extreme difficulty of the imbalance and the gap between validation and test distributions.

5. Significance

This paper is significant for the NLP community, particularly in the context of African languages and social media safety:

Methodological Rigor: It demonstrates that for low-resource, imbalanced tasks, data handling strategies (iterative splitting, class weighting, threshold tuning) are as important as, if not more important than, the choice of pre-trained model.
Challenging Assumptions: It challenges the prevailing assumption that language-specific models are always superior for low-resource languages, showing that high-quality multilingual models can leverage transfer learning effectively even for specific African languages.
Practical Guidelines: The paper provides a blueprint for handling multi-label imbalance in social media moderation, emphasizing that "one-size-fits-all" thresholds are insufficient and that naive multilingual training can be detrimental.
Future Directions: It highlights the need for better handling of code-switching and implicit cues, suggesting future work in contrastive learning, prompt-based few-shot learning, and adapter architectures.

Not All Pretraining are Created Equal: Threshold Tuning and Class Weighting for Imbalanced Polarization Tasks in Low-Resource Settings