ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

Imagine you are the head librarian of a massive, chaotic library that contains books in nearly every language on Earth. Your job is to sort these books onto the correct shelves. If you put a French novel on the Spanish shelf, the next person looking for Spanish books will get confused, and the whole system breaks down.

This is the job of Language Identification (LID). It's the digital "sorting hat" that decides what language a piece of text is written in.

For popular languages like English or Spanish, this is easy. We have millions of books (data) to learn from. But for low-resource languages (languages with very few speakers or digital texts), it's like trying to sort a library where you only have one single book for a specific language, and that book happens to be a religious text (like the Bible). If you only read the Bible to learn a language, you might think every sentence in that language sounds like a prayer. When you see a news article or a text message in that same language, your brain gets confused because it doesn't look like the Bible.

The Problem: The "Bible Bias"

The researchers found that current AI models are great at sorting common languages but terrible at these rare ones. Why? Because the data they are trained on is often biased.

The Analogy: Imagine trying to learn to recognize a "Dog" by only looking at pictures of Golden Retrievers. If you then see a Chihuahua or a Great Dane, you might not recognize it as a dog. Similarly, if an AI learns a language only from religious texts, it fails when it sees that language used in a weather report or a tweet.

The Solution: ConLID (The "Group Hug" Method)

The authors propose a new method called ConLID, which uses something called Supervised Contrastive Learning.

Here is how it works, using a simple analogy:

1. The Old Way (Cross-Entropy):
Imagine a teacher asking a student, "Is this a dog?" The student just memorizes the answer: "Yes." They don't really understand why it's a dog, they just know the label. This works fine if the dog looks exactly like the ones in the textbook, but fails if the dog looks different.

2. The New Way (Contrastive Learning):
Instead of just memorizing labels, the teacher organizes a giant game of "Find Your Tribe."

The Rule: All sentences in the same language must stand close together in a giant room (the "embedding space").
The Rule: Sentences in different languages must stand as far apart as possible.
The Twist: The teacher makes sure the "same language" group includes people wearing different clothes (different domains: news, Bible, chat logs). This forces the AI to learn the core essence of the language, ignoring whether it's talking about religion or politics.

The Secret Sauce: The "Memory Bank"

There's a catch. To play "Find Your Tribe" effectively, you need a huge crowd. But for rare languages, you might only have 10 people in the room. That's not enough to learn who belongs where.

The researchers solved this with a Memory Bank.

The Analogy: Imagine the teacher has a giant photo album of everyone who has ever walked through the door in the last hour. Even if the current group of students is small, the teacher can say, "Look at this person in the photo album; they speak the same language as you, so stand next to them!"
This allows the AI to learn from a much larger, more diverse group of examples than it physically has in its current training batch, making the "tribe" much stronger and more accurate.

The "Hard" Training

They also added a "Hard Mode" to the training.

Soft Mode: "Don't stand next to anyone who speaks a different language." (Easy).
Hard Mode: "Don't stand next to anyone who speaks a different language but uses the same alphabet and talks about the same topic."
Why? If you have two languages that both use the Latin alphabet and both have religious texts, they look very similar. The AI needs to be forced to learn the subtle differences between them, not just the obvious ones.

The Results: Sorting the Library

When they tested this new system:

Low-Resource Languages: The system got significantly better at identifying rare languages, especially when the text wasn't religious. It improved by about 3.2% (which is huge in AI terms).
Generalization: It didn't just memorize the training data; it learned to recognize the language even when the topic changed (e.g., from a Bible verse to a news headline).
Real World: They tested it on a massive dataset of internet text (FineWeb-2). Even though the AI sometimes disagreed with the old "best" system, the researchers believe the new system was actually more correct for those difficult, rare languages.

The Bottom Line

ConLID is like upgrading the librarian's sorting system. Instead of just memorizing a list of book titles, the new system learns to recognize the soul of a language, regardless of whether the text is a holy book, a news article, or a text message. By grouping similar languages together and pushing different ones apart, it creates a much more robust and fair system for identifying languages, especially the ones that have been left behind.

Here is a detailed technical summary of the paper "ConLID: Supervised Contrastive Learning for Low-Resource Language Identification."

1. Problem Statement

Language Identification (LID) is a critical preprocessing step for curating multilingual Large Language Model (LLM) training corpora from web crawls. While existing LID models perform well on high-resource languages, they struggle significantly with low-resource languages. The paper identifies two primary causes for this failure:

Data Scarcity and Imbalance: Low-resource languages often have very few training examples, leading to class imbalance during training.
Domain Entanglement (Bias): Available data for low-resource languages is frequently concentrated in specific, narrow domains (e.g., religious texts like the Bible). Models trained on such narrow data fail to generalize to diverse text types (e.g., news, social media, literature), resulting in poor performance on out-of-domain (OOD) data.

Current state-of-the-art (SOTA) models, such as GlotLID, rely heavily on Cross-Entropy (CE) loss and simple architectures (FastText), which do not explicitly learn domain-invariant representations.

2. Methodology: ConLID

The authors propose ConLID, a novel framework that integrates Supervised Contrastive Learning (SCL) with standard classification to learn robust, domain-invariant language representations.

Core Architecture

The model follows the FastText architecture (character n-grams + word embeddings) but modifies the training objective. The total loss function is a combination of:
$\mathcal{L} = \mathcal{L}_{CE} + \mathcal{L}_{SCL}$

$\mathcal{L}_{CE}$ (Cross-Entropy): Standard classification loss to ensure the model predicts the correct language label.
$\mathcal{L}_{SCL}$ (Supervised Contrastive Loss): Encourages embeddings of the same language to cluster together while pushing embeddings of different languages apart in the vector space.

Key Technical Innovations

Memory Bank for Scalability:
- SCL performance is highly dependent on batch size (to sample enough positive/negative pairs). However, LID involves ~2,000 languages, making it impossible to fit all classes in a single GPU batch.
- Solution: The authors implement a Memory Bank that stores embeddings from the last $M$ batches (e.g., $M=2048$ ). During training, the contrastive loss is computed using the current batch plus the stored embeddings, effectively increasing the number of available negative and positive samples without increasing GPU memory requirements.
Hard Negative Mining:
- To specifically address the domain bias issue, the authors introduce a Hard Negative Selection strategy.
- Instead of selecting random negative samples (different languages), the model specifically selects negative samples that share the same domain and same script but have a different language label.
- Goal: This forces the model to learn features that distinguish languages within a specific domain, thereby learning representations that are invariant to domain shifts.
Ensemble Inference:
- The final system often combines the predictions of the CE-only baseline ( $LID_{CE}$ ) and the SCL-enhanced model ( $ConLID-S$ ) by selecting the maximum probability, leveraging the strengths of both approaches.

3. Key Contributions

First Application of SCL to LID: This is the first work to apply Supervised Contrastive Learning to the LID task, which involves a massive number of classes (~2,099) compared to typical NLP classification tasks (<10 classes).
Domain Generalization: Demonstrates that SCL with hard negative mining significantly improves performance on out-of-domain data, particularly for languages trained on narrow domains (like the Bible).
Comprehensive Analysis: Provides a deep dive into misclassification patterns, revealing that errors predominantly occur between linguistically related languages and are exacerbated by domain limitations.
Real-World Validation: Evaluates the model on FineWeb-2, a massive real-world web crawl corpus, showing practical improvements in low-resource language detection.

4. Experimental Results

The model was evaluated on three benchmarks: GlotLID-C, FLORES-200, and UDHR (Universal Declaration of Human Rights).

Low-Resource Performance: ConLID-S achieved a 3.2 percentage point improvement in F1 score over the CE baseline for low-resource languages on the UDHR dataset.
Domain Generalization: For languages trained on diverse domains, the improvement was 5.4 percentage points.
Out-of-Domain (OOD) Gains: The most significant gains were observed on the UDHR dataset (OOD), where the model trained on GlotLID-C (which includes Bible-heavy data) successfully generalized to the diverse text in UDHR.
Comparison with SOTA:
- ConLID-S outperformed AfroLID and NLLB-LID across all benchmarks.
- While GlotLID-M performed well on in-domain data (FLORES-200), ConLID-S (and its ensemble) outperformed it on the more challenging OOD data (UDHR).
- The ensemble of GlotLID-M and ConLID-S achieved the highest overall scores, proving the methods are complementary.

Specific Metrics (UDHR Dataset):

ConLID-S: F1 = 0.9012 (vs. 0.8925 for LIDCE baseline).
Low-Resource Languages: +3.23% F1 gain.
Bible-only Domain Languages: +1.90% F1 gain.

5. Significance and Impact

Robust Multilingual NLP: By improving LID accuracy for low-resource languages, this work directly enhances the quality of data filtering for multilingual LLM pretraining. Better LID means less contamination of training corpora with wrong-language data.
Scalable Contrastive Learning: The proposed Memory Bank and Hard Negative Mining strategies offer a blueprint for applying contrastive learning to tasks with thousands of classes, a common challenge in NLP beyond simple classification.
Practical Utility: The evaluation on FineWeb-2 demonstrates that even small percentage improvements in low-resource LID translate to tens of thousands of correctly identified documents in large-scale web crawls, highlighting the real-world economic and scientific value of the approach.

Conclusion

ConLID successfully addresses the "long-tail" problem in language identification by moving beyond simple classification loss. By explicitly optimizing the embedding space to separate languages while grouping them by domain, the model achieves superior generalization, making it a critical tool for the next generation of multilingual AI systems.