A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

Imagine four government-run banks in Bangladesh (Sonali, Agrani, Janata, and Rupali) that have built mobile apps to help people manage their money. These apps are like digital branches, but instead of walking in, you tap a screen.

This paper is essentially a massive "report card" for these four apps, written by analyzing thousands of user reviews left on the Google Play Store. The researchers wanted to know: Are people happy? What are they complaining about? And can computers understand their complaints in both English and Bangla?

Here is the story of their findings, broken down into simple concepts:

1. The Detective Work: Cleaning the Messy Data

The researchers started with over 11,000 reviews. It was a messy pile of data—some were duplicates, some were in languages other than English or Bangla, and some were just gibberish.

The Analogy: Imagine trying to sort a giant bag of mixed-up marbles. They had to throw out the broken ones and the wrong colors until they were left with a clean bag of 5,652 reviews (mostly English, some Bangla).

2. The Labeling Problem: Stars vs. Words

Usually, when you rate an app, 1 or 2 stars means "I hate it," and 4 or 5 means "I love it." But sometimes, people write a nice review but give 1 star because of a bug, or vice versa.

The Solution: The researchers used a "hybrid" approach. They let the star rating be the first guess, but then they used a smart computer program (an AI) to read the actual words. If the stars and the words disagreed, they threw that review out to avoid confusion.
The Result: They ended up with a smaller, very reliable set of reviews where the stars and the words agreed.

3. The Race: Old School vs. New School AI

The researchers pitted two types of computer models against each other to see which one could best guess if a review was positive or negative:

The "Old School" Team: These are traditional, simpler math models (like Random Forest and SVM). Think of them as experienced, no-nonsense accountants who have seen it all.
The "New School" Team: These are massive, complex AI models (like XLM-RoBERTa). Think of them as genius PhD students who have read the entire internet but might be overthinking things.

The Surprise: The Old School accountants won. They were slightly more accurate and faster than the genius PhD students. The researchers realized that for this specific job (banking reviews), the simpler tools were actually better because the "genius" models needed more data to learn the specific slang and banking terms.

4. The "Aspect" Detective: What Exactly Are People Mad About?

Using a different, highly specialized AI (DeBERTa), they didn't just ask "Is this happy or sad?" They asked, "What specifically is making them sad?"
They looked at six categories: Speed, Security, Design, Customer Service, Features, and Transactions.

The Verdict: The biggest complaints were about Speed (the app is slow) and Design (it's confusing to use).
The Loser: One app, eJanata, was the clear underperformer. It had the worst ratings, the slowest speeds, and the most complaints about its design. It was like the student who failed every subject.
The Winner: Rupali e-Bank was the most liked, though none of them were perfect.

5. The Language Gap: The "English vs. Bangla" Inequality

This is the most critical finding. The researchers tested how well the AI understood reviews in English versus Bangla.

The Result: The AI was 16% better at understanding English than Bangla.
The Metaphor: Imagine a translator who is a native English speaker but only took a basic Bangla class. If you ask them to translate a complex legal document, they might get the English part right but miss the nuance in the Bangla part.
Why it matters: Many rural users in Bangladesh speak only Bangla. If the bank uses this AI to automatically sort complaints, the Bangla speakers' complaints might get ignored or misunderstood because the computer doesn't "get" them as well as it gets English. This is a fairness issue.

6. The Time Travel: How Sentiment Changed Over Time

Looking at reviews from 2021 to 2025, they noticed a pattern:

The "Update" Curse: Every time the banks released a new version of the app, complaints would spike. It's like a restaurant changing its menu; people get confused and angry until they get used to it.
The Trend: Over the years, the apps got slightly worse, with more negative reviews piling up, especially for the eJanata app.

The Final Recommendations (The "To-Do List")

Based on this report, the authors suggest three things for the banks:

Fix the Basics: Stop releasing apps that are slow or hard to use. Test them thoroughly before launching.
Trust Management: When you update an app, do it slowly (like a "beta test" for a small group) so you don't anger everyone at once. Also, be transparent about security so people trust you.
Respect the Language: The banks need to build better AI tools specifically for the Bangla language. If they want to serve their rural customers fairly, they can't rely on tools that only understand English well.

In a nutshell: The government banking apps are struggling with speed and design, one app is doing particularly poorly, and the technology used to listen to customers isn't fair to the Bangla-speaking majority yet. The banks need to listen to their users, fix the bugs, and build better tools to understand their local language.

1. Problem Statement

The study addresses the critical need to analyze user feedback for mobile banking applications in developing economies, specifically focusing on four state-owned banks in Bangladesh (Sonali, Agrani, Janata, and Rupali). The core challenges identified are:

Bilingual Complexity: User reviews contain a mix of English, standard Bangla script, and Romanized Bangla, making monolingual pipelines ineffective without introducing translation noise.
Label Noise: There is often a discrepancy between the star rating (quantitative) and the written text (qualitative) in app reviews.
Domain Gap: Existing Natural Language Processing (NLP) research in Bangladesh focuses largely on private fintech (e.g., bKash) or non-banking corpora, leaving a gap in understanding state-owned banking services.
Model Performance: It is unclear whether classical machine learning models or large pre-trained transformers (like XLM-RoBERTa) perform better on this specific, low-resource, bilingual domain.

2. Methodology

Data Collection and Preprocessing

Source: 11,414 raw reviews were scraped from the Google Play Store for four apps (Sonali e-Wallet, Agrani Smart, eJanata, Rupali e-Bank) covering the period from January 2021 to December 2025.
Filtering: After removing duplicates, noise, and non-English/Bangla entries, the final corpus consisted of 5,652 reviews (80.1% English, 19.9% Bangla).
Split: The data was split 80/20 into training (4,521) and test (1,131) sets with stratified classes.

Hybrid Labeling Strategy

To mitigate label noise, the authors employed a hybrid consensus approach:

Star-Rating Heuristic: 1–2 stars = Negative, 3 = Neutral, 4–5 = Positive.
Model Validation: An independent XLM-RoBERTa classifier was used to predict sentiment.
Consensus Filtering: Only samples where the star-rating heuristic and the model prediction matched were retained for the training set.
- Result: 1,564 (34.6%) of initial training samples were discarded due to inconsistency.
- Final Training Set: 2,957 consensus-labeled instances.
- Inter-method Agreement: Cohen's $\kappa$ = 0.459 (moderate agreement).

Model Architecture

The study compared three parallel tracks:

Classical Baselines: Four TF-IDF-based classifiers (Naïve Bayes, Linear SVM, Logistic Regression, Random Forest) trained on unigrams and bigrams.
Off-the-Shelf (OTS) Transformer: Pre-trained cardiffnlp/twitter-xlm-roberta-base-sentiment without further training.
Fine-Tuned Transformer: XLM-RoBERTa fine-tuned for 3 epochs on the consensus training set.
Aspect-Based Sentiment Analysis (ABSA): A separate pipeline using DeBERTa-v3 to classify sentiment across six specific dimensions: UI/UX, Security, Speed/Performance, Customer Service, Features, and Transaction Processing.

Evaluation Metrics

Accuracy, Precision, Recall, and Weighted F1-Score.
Statistical Significance: McNemar's test and 95% bootstrap confidence intervals were used to compare model performance.

3. Key Contributions

Bilingual Dataset: Creation of a specialized English-Bangla sentiment dataset for state-owned banking apps in Bangladesh, utilizing a hybrid labeling approach to ensure data quality.
Model Benchmarking: A systematic comparison showing that, contrary to the trend of "transformers are always better," classical models outperformed the fine-tuned XLM-RoBERTa on this specific dataset.
Aspect-Level Insights: Application of DeBERTa-v3 to identify specific pain points (Speed and UI/UX) rather than just overall sentiment.
Linguistic Equity Analysis: Quantification of the performance gap between English and Bangla reviews, highlighting a significant disparity in model accuracy for low-resource languages.

4. Key Results

Model Performance

Best Performer: Random Forest achieved the highest accuracy (0.815).
Best F1 Score: Linear SVM achieved the highest weighted F1 score (0.804).
Transformer Performance:
- Fine-tuned XLM-RoBERTa: 0.793 (Weighted F1).
- Off-the-Shelf XLM-RoBERTa: 0.740 (Weighted F1).
Statistical Significance: McNemar's test confirmed that all classical models were significantly superior to the OTS transformer ( $p < 0.001$ ). However, the difference between classical models and the fine-tuned transformer was not statistically significant, likely due to the limited size of the consensus training set ( $n=2,957$ ).

App Rankings & Sentiment

Best App: Rupali e-Bank (Positive Sentiment Score: 58.4%).
Worst App: eJanata (Negative Sentiment Score: 80.4%, Average Rating: 2.20).
Trends: Negative sentiment has increased by 17 percentage points from 2021–2025, with spikes correlating strongly to app updates.

Aspect-Based Analysis (DeBERTa-v3)

Primary Complaints: Speed/Performance and UI/UX were the dominant drivers of negative sentiment.
- eJanata had 61.3% negative sentiment regarding speed and 52.4% regarding UI/UX.
Security: While security complaints were less frequent (18.7–31.8% of mentions), they generated disproportionately high "thumbs-up" counts, indicating high user concern.

Language Disparity (Critical Finding)

A significant performance gap exists between languages for the fine-tuned XLM-RoBERTa model:
- English Accuracy: 0.715
- Bangla Accuracy: 0.554
- Gap: 16.1 percentage points.
This gap is attributed to English-heavy pretraining corpora, token fragmentation in Bangla, and orthographic variability in informal Bangla.

5. Significance and Policy Recommendations

The study provides actionable insights for state-owned banks and regulators in Bangladesh:

Remediate App Quality: Banks must establish Service Level Agreements (SLAs) specifically for transaction speed and interface design, as these are the primary causes of user dissatisfaction.
Trust-Centered Release Management: To mitigate negative sentiment spikes after updates, banks should adopt staged rollouts (beta testing) and proactively disclose security audit results to maintain user trust.
Bangla-First NLP Adoption: The 16.1% accuracy gap indicates that automated complaint routing systems currently disadvantage Bangla-speaking users (often in rural areas). The authors recommend deploying domain-adapted Bangla models (e.g., BanglaBERT) to ensure linguistic equity in service delivery.

Conclusion

The paper demonstrates that for small-to-medium sized, bilingual, domain-specific datasets, classical machine learning models can outperform or match fine-tuned large language models. Furthermore, it highlights a critical equity issue in NLP deployment: without specific adaptation for low-resource languages like Bangla, automated systems risk systematically under-serving non-English speaking populations.