ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition

This paper introduces ANCHOLIK-NER, the first benchmark dataset for Named Entity Recognition in five Bangla regional dialects, and evaluates transformer-based models on it to establish a foundational step for developing dialect-aware NLP systems.

Bidyarthi Paul, Faika Fairuj Preotee, Shuvashis Sarker, Shamim Rahim Refat, Shifat Islam, Tashreef Muhammad, Mohammad Ashraful Hoque, Shahriar Manzoor

Published 2026-02-27
📖 5 min read🧠 Deep dive

Imagine you are trying to teach a robot to understand a story. If you tell the robot a story in "Standard English," it does a great job. But if you tell it the same story in a thick Scottish accent, a Southern US drawl, or a specific New York slang, the robot might get confused. It might think "y'all" is a name, or it might miss that "biscuit" refers to a person in a specific context.

This paper is about doing the exact same thing, but for the Bangla language in Bangladesh.

Here is the breakdown of the paper, explained simply with some analogies:

1. The Problem: The "One-Size-Fits-All" Robot

For a long time, computer scientists built "Named Entity Recognition" (NER) systems. Think of NER as a robot's ability to read a sentence and point its finger at important things, saying, "That's a Person," "That's a City," or "That's a Food."

  • The Status Quo: Scientists built excellent robots for "Standard Bangla" (the formal language used in schools and news).
  • The Glitch: Bangladesh is full of amazing regional dialects (like Chittagong, Sylhet, Barishal, etc.). These dialects are like different flavors of ice cream; they taste like Bangla, but the ingredients (words, grammar, pronunciation) are totally different.
  • The Issue: The existing robots were trained only on "Standard Bangla" (Vanilla Ice Cream). When they tried to read "Chittagong Bangla" (Spicy Mango Ice Cream), they got lost. They couldn't tell if a word was a city or just a random noise.

2. The Solution: Building a "Dialect Dictionary" (ANCHOLIK-NER)

The authors decided to stop trying to force the robot to understand everything and instead build a better training manual. They created a new dataset called ANCHOLIK-NER.

  • What is it? It's a massive library of 17,405 sentences.
  • The Twist: These sentences aren't just in one language. They are in five different regional dialects: Sylhet, Chittagong, Barishal, Noakhali, and Mymensingh.
  • The Magic: They didn't just copy-paste. They took a sentence in Standard Bangla and had native speakers translate it into the local dialect, ensuring that if the original sentence mentioned "Dhaka," the dialect version also mentioned the local word for Dhaka. They made sure the "Person" and "Location" tags stayed aligned, like a perfect translation.

Analogy: Imagine you have a map of a city. The old map only shows the main highways. The authors drew a new map that includes all the tiny alleyways, local shortcuts, and neighborhood signs in five different districts. Now, the robot has a GPS that actually works in the neighborhoods, not just the highways.

3. The Process: Cleaning and Labeling

Building this library wasn't easy. It was like organizing a giant, messy attic.

  • The Mess: The raw data had typos, mixed languages, and weird punctuation.
  • The Cleanup: They used computer scripts (like a digital vacuum cleaner) to remove the trash and separate words properly.
  • The Human Touch: They hired 10 native speakers (the "experts") to read every single sentence and tag the important words.
    • Example: In Standard Bangla, "Dhaka" is a location. In Sylheti dialect, the word might sound different, but it's still a location. The humans made sure the robot learned this.
  • The Quality Check: They had two people label the same sentence to make sure they agreed. If they disagreed, they fixed it. This ensured the "training manual" was perfect.

4. The Test: Who is the Best Robot?

Once they built the library, they tested three different "Robot Brains" (AI models) to see who could learn from this new data best:

  1. Bangla BERT: A robot trained specifically on Bangla.
  2. Bangla BERT Base: A slightly lighter version of the above.
  3. BERT Multilingual: A robot trained on many languages (like a polyglot).

The Results:

  • The Winner: The Multilingual Robot (BERT Base Multilingual Cased) turned out to be the smartest overall. It got the highest score (about 82.6%) in the Mymensingh dialect. It was like a traveler who had visited many countries and could adapt quickly to local customs.
  • The Runner Up: The Bangla BERT robot was very strong in Barishal and Mymensingh.
  • The Struggle: The Chittagong dialect was the hardest for all robots. It's like a very thick, fast-paced accent that even the smartest robots found hard to decode. They made more mistakes there, confusing some words.

5. Why Does This Matter?

You might ask, "Why do we care about dialects?"

  • Inclusivity: Right now, if you use a Bangla app in Chittagong or Sylhet, it might not understand you. This research helps build apps that understand everyone, not just people who speak the "textbook" version of the language.
  • Real World: People speak in dialects on social media, in local news, and in hospitals. If a doctor's AI assistant doesn't understand the local dialect, it could miss important details about a patient's location or symptoms.

The Bottom Line

The authors didn't just build a better robot; they built a bridge. They created the first-ever "dictionary" that teaches computers how to understand the rich, diverse, and colorful dialects of Bangladesh.

Future Plans:
The authors admit the job isn't done. The "Chittagong" dialect still confuses the robots a bit. In the future, they want to:

  1. Add more dialects (like Khulna or Rajshahi).
  2. Teach the robots even better tricks to handle the tricky accents.
  3. Make sure no one is left out of the digital world because of how they speak.

In short: They took a language that was being ignored in the AI world, gave it a spotlight, and taught the machines to listen to the real voices of the people.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →