ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

Imagine you are trying to teach a robot to understand a story. If you tell the robot a story in "Standard English," it does a great job. But if you tell it the same story in a thick Scottish accent, a Southern US drawl, or a specific New York slang, the robot might get confused. It might think "y'all" is a name, or it might miss that "biscuit" refers to a person in a specific context.

This paper is about doing the exact same thing, but for the Bangla language in Bangladesh.

Here is the breakdown of the paper, explained simply with some analogies:

1. The Problem: The "One-Size-Fits-All" Robot

For a long time, computer scientists built "Named Entity Recognition" (NER) systems. Think of NER as a robot's ability to read a sentence and point its finger at important things, saying, "That's a Person," "That's a City," or "That's a Food."

The Status Quo: Scientists built excellent robots for "Standard Bangla" (the formal language used in schools and news).
The Glitch: Bangladesh is full of amazing regional dialects (like Chittagong, Sylhet, Barishal, etc.). These dialects are like different flavors of ice cream; they taste like Bangla, but the ingredients (words, grammar, pronunciation) are totally different.
The Issue: The existing robots were trained only on "Standard Bangla" (Vanilla Ice Cream). When they tried to read "Chittagong Bangla" (Spicy Mango Ice Cream), they got lost. They couldn't tell if a word was a city or just a random noise.

2. The Solution: Building a "Dialect Dictionary" (ANCHOLIK-NER)

The authors decided to stop trying to force the robot to understand everything and instead build a better training manual. They created a new dataset called ANCHOLIK-NER.

What is it? It's a massive library of 17,405 sentences.
The Twist: These sentences aren't just in one language. They are in five different regional dialects: Sylhet, Chittagong, Barishal, Noakhali, and Mymensingh.
The Magic: They didn't just copy-paste. They took a sentence in Standard Bangla and had native speakers translate it into the local dialect, ensuring that if the original sentence mentioned "Dhaka," the dialect version also mentioned the local word for Dhaka. They made sure the "Person" and "Location" tags stayed aligned, like a perfect translation.

Analogy: Imagine you have a map of a city. The old map only shows the main highways. The authors drew a new map that includes all the tiny alleyways, local shortcuts, and neighborhood signs in five different districts. Now, the robot has a GPS that actually works in the neighborhoods, not just the highways.

3. The Process: Cleaning and Labeling

Building this library wasn't easy. It was like organizing a giant, messy attic.

The Mess: The raw data had typos, mixed languages, and weird punctuation.
The Cleanup: They used computer scripts (like a digital vacuum cleaner) to remove the trash and separate words properly.
The Human Touch: They hired 10 native speakers (the "experts") to read every single sentence and tag the important words.
- Example: In Standard Bangla, "Dhaka" is a location. In Sylheti dialect, the word might sound different, but it's still a location. The humans made sure the robot learned this.
The Quality Check: They had two people label the same sentence to make sure they agreed. If they disagreed, they fixed it. This ensured the "training manual" was perfect.

4. The Test: Who is the Best Robot?

Once they built the library, they tested three different "Robot Brains" (AI models) to see who could learn from this new data best:

Bangla BERT: A robot trained specifically on Bangla.
Bangla BERT Base: A slightly lighter version of the above.
BERT Multilingual: A robot trained on many languages (like a polyglot).

The Results:

The Winner: The Multilingual Robot (BERT Base Multilingual Cased) turned out to be the smartest overall. It got the highest score (about 82.6%) in the Mymensingh dialect. It was like a traveler who had visited many countries and could adapt quickly to local customs.
The Runner Up: The Bangla BERT robot was very strong in Barishal and Mymensingh.
The Struggle: The Chittagong dialect was the hardest for all robots. It's like a very thick, fast-paced accent that even the smartest robots found hard to decode. They made more mistakes there, confusing some words.

5. Why Does This Matter?

You might ask, "Why do we care about dialects?"

Inclusivity: Right now, if you use a Bangla app in Chittagong or Sylhet, it might not understand you. This research helps build apps that understand everyone, not just people who speak the "textbook" version of the language.
Real World: People speak in dialects on social media, in local news, and in hospitals. If a doctor's AI assistant doesn't understand the local dialect, it could miss important details about a patient's location or symptoms.

The Bottom Line

The authors didn't just build a better robot; they built a bridge. They created the first-ever "dictionary" that teaches computers how to understand the rich, diverse, and colorful dialects of Bangladesh.

Future Plans:
The authors admit the job isn't done. The "Chittagong" dialect still confuses the robots a bit. In the future, they want to:

Add more dialects (like Khulna or Rajshahi).
Teach the robots even better tricks to handle the tricky accents.
Make sure no one is left out of the digital world because of how they speak.

In short: They took a language that was being ignored in the AI world, gave it a spotlight, and taught the machines to listen to the real voices of the people.

1. Problem Statement

Named Entity Recognition (NER) is a fundamental task in Natural Language Processing (NLP) for extracting structured information from unstructured text. While significant progress has been made in NER for Standard Bangla, existing systems fail to address the linguistic diversity of Bangla regional dialects.

The Gap: Dialects spoken in regions like Chittagong, Sylhet, Barishal, Noakhali, and Mymensingh exhibit unique vocabulary, grammar, pronunciation, and semantics. Models trained solely on Standard Bangla corpora perform poorly on these dialects due to vocabulary mismatches and structural differences.
Consequences: This lack of dialect-aware resources leads to poor performance in practical applications such as social media analysis, regional news summarization, and public health communication, creating a form of linguistic bias that excludes large segments of the Bangla-speaking population.
Current Limitations: Existing Bangla NER datasets are often small, synthetic, or focused only on Standard Bangla, lacking the gold-standard annotations required for robust dialectal modeling.

2. Methodology

The authors propose a data-centric approach, prioritizing the creation of a high-quality, dialect-aware dataset over novel model architectures. The methodology follows a systematic pipeline:

A. Dataset Construction (ANCHOLIK-NER)

Scope: The dataset covers five major regional dialects: Chittagong, Sylhet, Barishal, Noakhali, and Mymensingh.
Data Sources:
- Vashantor Corpus: 12,500 sentences.
- ONUBAD Dataset: 2,940 sentences (covering Sylhet, Barishal, Chittagong).
- Manual Translation: 1,965 sentences manually translated from Standard Bangla to Noakhali and Mymensingh dialects to ensure balanced representation.
- Total: 17,405 sentences comprising over 100,000 tokens.
Preprocessing: A custom Python pipeline was developed to clean noise, separate punctuation, remove English numerals (replacing them with Bengali equivalents), and tokenize text.
Annotation Scheme:
- Tagging: Uses the BIO (Begin, Inside, Outside) scheme.
- Entity Types: 10 categories including Person (PER), Location (LOC), Organization (ORG), Food, Animal, Color, Role, Relationship, Object, and Non-Entity (O).
- Quality Control: Ten native-speaking annotators (2 per region) with backgrounds in Linguistics and NLP performed the tagging. Inter-annotator agreement was measured using Cohen's Kappa, showing high consistency across regions.
- Validation: An automated outlier detection algorithm was used to identify and correct formatting errors (e.g., lowercase tags, missing annotations).

B. Experimental Setup

Models Evaluated: Three transformer-based models were fine-tuned on the ANCHOLIK-NER dataset:
1. Bangla BERT: Pre-trained specifically on Bangla data.
2. Bangla BERT Base: A variant optimized for computational efficiency.
3. BERT Base Multilingual Cased: A general-purpose multilingual model supporting over 100 languages.
Training Configuration: Models were trained with learning rates of $2e^{-5}$ , batch sizes of 8 and 16, and evaluated over 5, 10, 15, and 20 epochs.
Metrics: Performance was assessed using Precision, Recall, and F1-Score.

3. Key Contributions

First Regional Dialect NER Dataset: Introduction of ANCHOLIK-NER, the first benchmark dataset for NER in Bangla regional dialects, addressing a critical gap in low-resource language resources.
Hybrid Translation with Alignment: Creation of a parallel corpus between Standard Bangla and regional dialects, ensuring named entities are consistently preserved and aligned across translations.
Systematic Data Refinement: Implementation of anomaly detection and manual correction strategies to ensure high-quality, noise-free training data.
Comprehensive Benchmarking: Extensive evaluation of state-of-the-art transformer models across five distinct dialects, providing baseline performance metrics for future research.

4. Results and Analysis

The study evaluated the three models across the five regions. Key findings include:

Best Overall Performer: BERT Base Multilingual Cased generally achieved the highest performance across most regions.
- Mymensingh: Achieved the highest F1-score of 82.611% (Epoch 20).
- Sylhet: Achieved an F1-score of 82.315%.
- Noakhali: Achieved an F1-score of 81.553%.
Strong Regional Performance: Bangla BERT showed superior performance in specific regions, particularly Barishal (F1: 81.481%) and Mymensingh (F1: 82.268% at Epoch 20), suggesting that language-specific pre-training is highly effective for certain dialects.
Challenges:
- Chittagong: All models struggled the most with this dialect, with the highest F1-score reaching only 75.307% (Bangla BERT). This indicates significant linguistic divergence from Standard Bangla.
- Entity Confusion: Confusion matrix analysis revealed that models frequently misclassified Role (ROLE) and Organization (ORG) entities, often confusing them with other categories.
Data Distribution: The dataset is imbalanced, with Non-Entities (O) making up ~83.4% of tokens, while Named Entities comprise ~16.6%.

5. Significance and Future Work

Significance: This work represents a foundational step toward inclusive NLP for Bangladesh. By providing a gold-standard dataset, it enables the development of systems that understand the linguistic reality of millions of dialect speakers, moving beyond the "Standard Bangla" bias.
Future Directions:
- Expansion: Extend the dataset to cover more underrepresented dialects and sub-dialects.
- Model Improvement: Focus on improving performance in challenging regions like Chittagong and Noakhali through domain adaptation, data augmentation, and region-specific fine-tuning.
- Advanced Techniques: Explore hybrid models, unsupervised learning, and knowledge graph integration to better capture complex dialectal variations.

In conclusion, ANCHOLIK-NER provides the necessary infrastructure to advance Bangla NLP from a monolithic, standard-language focus to a diverse, dialect-aware ecosystem, ensuring that NLP technologies are accessible and accurate for all Bangla speakers.