Original authors: Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

Published 2026-05-26✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Rez Samantha Z. Floresca, Edric Castel C. Hao, Hannah Grachiella Buñales, Chelsea Dominique E. Temprosa, Georgianna Z. Reyes, Kervin Gabriel L. Chua

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

Imagine you are trying to teach a computer to spot the early signs of dementia just by listening to how people talk. The computer needs to recognize specific "tells" in speech, like repeating words, getting stuck, or using simpler sentences, which often happen when someone's memory is starting to fade.

The problem is that most of these "smart computers" (AI models) have only been trained on English. They are like expert detectives who have only ever solved crimes in London. If you suddenly show them a crime scene in Manila, where people speak a mix of Filipino and English (often called "Taglish"), the London detective gets confused and fails to solve the case.

This paper, titled "Forgotten Words," is a report card on how well these AI detectives perform when we switch the language from English to Filipino. Here is what the researchers found, broken down simply:

1. The "London Detective" vs. The "Manila Detective"

The researchers built a special test set. They took 2,000 real speech transcripts from English dementia patients and healthy people, and they manually translated them into Filipino. They didn't use a robot translator because robots tend to "clean up" messy speech, and the messiness (the pauses and repeats) is actually the clue they are looking for.

They then tested five different types of AI models:

The Old School: A simple math-based system (TF-IDF).
The Standard: The classic English-trained model (BERT).
The New Tech: A modernized English-only model (NeoBERT).
The Polyglot: A model trained on 100 languages (XLM-RoBERTa).
The Local Expert: A model trained specifically on Filipino text (RoBERTa-Tagalog).

2. The Big Surprise: "One Language, One Brain"

The most important finding is that knowing the disease in English doesn't help you know it in Filipino.

The Failure: When they trained the standard English model on English data and tested it on Filipino, its performance crashed. It went from being a 95% accurate detective in English to a 45% accurate one in Filipino. It was essentially guessing.
The Asymmetry: Interestingly, it was slightly easier for a model trained on Filipino to understand English than vice versa. This is likely because Filipino conversation naturally includes a lot of English words (code-switching), so the Filipino-trained model accidentally learned some English patterns. But a pure English model had no idea what to do with Filipino grammar.
The "New Tech" Trap: They tested NeoBERT, a fancy, modernized version of the English model. You might think, "Newer and faster means better, right?" Not here. NeoBERT was actually worse at switching languages. It became so specialized in English that it became rigid and couldn't adapt to Filipino at all. It's like a chef who is so perfect at making French cuisine that they can't even make a simple sandwich if you ask them to switch to Italian ingredients.

3. The Solution: The "Bilingual Classroom"

So, how do you fix a detective who only speaks one language? You don't buy a new detective; you teach the current one to speak both.

The researchers tried Bilingual Fine-Tuning. This is like putting the AI in a classroom where it has to learn from a mix of English and Filipino students at the same time.

The Result: This was a magic bullet. When the models were trained on both languages together, the performance gap disappeared. Whether the model was the "Old School" type, the "New Tech" NeoBERT, or the "Local Expert," they all suddenly became excellent detectives in both languages, scoring around 97% accuracy.
The Lesson: It didn't matter how fancy the model's architecture was. What mattered was what languages it was exposed to during its training. If the training data included both languages, the model learned to recognize the patterns of dementia regardless of the language. If it only saw one language, it got lost in the other.

4. Why This Matters (According to the Paper)

The paper concludes that for low-resource settings (places where there isn't a lot of data) and places where people mix languages (like the Philippines), you don't need a bigger or more complex AI model.

You just need to make sure the model learns from a mix of languages. The "secret sauce" isn't a better brain; it's a better vocabulary list that includes both English and Filipino.

Summary Analogy

Think of dementia detection like recognizing a specific song.

English-only models are like people who only know the song in English. If you play the song in Filipino, they don't recognize the melody.
NeoBERT is like a person who knows the English song perfectly and can sing it faster, but still doesn't recognize the Filipino version.
Bilingual Training is like teaching the person to listen to the song in both languages at the same time. Suddenly, they realize, "Oh, it's the same tune!" and they can recognize it no matter which language is sung.

The paper proves that to build a system that works for everyone, we must teach the AI to listen to everyone, not just the English speakers.

Technical Summary: Forgotten Words – Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech

Problem Statement

Dementia detection via spontaneous speech offers a scalable approach for cognitive screening, yet current Natural Language Processing (NLP) systems remain predominantly English-centric. This limitation is critical in the Philippines, where everyday speech frequently involves Filipino–English code-switching (Taglish), and no prior work has addressed NLP-based dementia detection in this context. Existing benchmarks for Filipino NLP focus on written text (e.g., news, social media) and fail to address naturalistic speech, clinical discourse, or cognitive diagnostic tasks. Furthermore, while transformer-based encoders dominate clinical NLP, their application to dementia detection has largely relied on architectural variants that differ only in pretraining data, leaving open the question of whether architectural modernization (e.g., NeoBERT) improves robustness in low-resource, cross-lingual clinical settings.

Methodology

Dataset Construction

To isolate language effects from domain effects, the authors constructed a parallel bilingual dataset of 4,000 conversational transcripts derived from DementiaBank.

Source: 2,000 English transcripts (1,000 dementia-positive, 1,000 healthy controls) drawn from multiple corpora and conversational/picture-description tasks within DementiaBank (rather than from the Cookie Theft task alone), giving a broader and more representative spread of speech samples for comparison against healthy controls.
Filipino Translation: The English set was manually translated into Filipino by human translators. Crucially, translators were instructed to preserve discourse-level markers of cognitive decline (repetitions, hesitations, false starts, syntactic degradation) rather than normalizing speech to fluency. Machine translation was avoided to prevent erasing diagnostic features.
Preprocessing: All transcripts underwent Unicode/whitespace normalization and lowercasing. Disfluencies were retained as they are established correlates of cognitive impairment. No stemming or lemmatization was applied to avoid degrading diagnostic signals. Sequences were truncated to 128 tokens.

Model Families and Baselines

Five model families were evaluated across three training regimes: English-only (EN), Filipino-only (TL), and Bilingual (EN+TL).

TF-IDF + Logistic Regression: A lexical baseline to assess surface-level token statistics.
BERT-base-uncased: Standard English-only pretraining.
NeoBERT: A modernized encoder architecture (using Rotary Positional Embeddings, Pre-LayerNorm, SwiGLU) pretrained exclusively on English (RefinedWeb).
XLM-RoBERTa: A 100-language multilingual model.
RoBERTa-Tagalog: A language-matched model pretrained on a large-scale Filipino corpus (TLUnified).

Experimental Protocol

Training: Models were fine-tuned using mean pooling over final hidden states (rather than [CLS] tokens) and AdamW optimization. Hyperparameters were selected via grid search to prevent loss divergence on small datasets.
Evaluation: Performance was measured using Macro-F1 and Accuracy via stratified 10-fold cross-validation.
Settings:
- In-domain: Train and test on the same language.
- Zero-shot Cross-lingual: Train on one language, test on the other.
- Bilingual: Train on the combined corpus, test on held-out mixed-language folds.
Metric: The Cross-lingual Generalization Gap ( $\Delta F1$ ) was defined as the absolute difference between in-domain and cross-lingual F1 scores.

Key Results

1. Cross-Lingual Failure in Monolingual Training

Strong in-domain performance did not transfer across languages.

English-trained BERT achieved an in-domain F1 of 0.952 on English but dropped to 0.455 on Filipino ( $\Delta = 0.497$ ).
Filipino-trained BERT achieved 0.981 on Filipino but dropped to 0.705 on English ( $\Delta = 0.276$ ).
The asymmetry suggests that English remains a stronger prior in representation space due to pretraining exposure, and fine-tuning on Filipino does not fully overwrite this geometry.

2. Architectural Modernization Does Not Ensure Robustness

NeoBERT, despite its architectural advances, did not improve cross-lingual robustness.

English-trained NeoBERT performed comparably to BERT in-domain (F1=0.952) but degraded significantly on Filipino (F1=0.617) with high variance ( $\sigma=0.109$ ).
This indicates that architectural modernization alone creates tighter monolingual decision boundaries that improve in-domain fidelity but reduce tolerance to linguistic variation.

3. The Role of Pretraining Coverage

XLM-RoBERTa (Multilingual) showed the smallest transfer gap from English to Filipino ( $\Delta=0.013$ ), suggesting a shared representation space. However, transfer from Filipino to English was weaker ( $\Delta=0.161$ ), likely due to English dominating its pretraining corpus.
RoBERTa-Tagalog (Language-matched) surprisingly achieved near-identical English-to-Filipino transfer ( $\Delta=0.017$ ) to XLM-RoBERTa. The authors attribute this to the extensive English lexical borrowing and code-switching inherent in conversational Filipino, allowing a Filipino-pretrained model to capture embedded English structures. However, it struggled more in the reverse direction ( $\Delta=0.218$ ).

4. Bilingual Fine-Tuning Eliminates Degradation

The most significant finding is that bilingual fine-tuning (training on both languages simultaneously) eliminated cross-lingual degradation across all transformer models.

All models converged to a Macro-F1 of 0.969–0.973 on the combined test set.
The cross-lingual gap shrank to 0.027–0.037 for all architectures, including NeoBERT.
This suggests that the primary bottleneck is not architectural capacity but representational alignment. Bilingual supervision forces the model to learn compatible regions in the embedding space for both languages.

5. Clinical Sensitivity

Under language shift, aggregate accuracy can obscure failure modes.

English-trained BERT maintained high dementia recall on Filipino (0.931) but collapsed on the Healthy class (F1=0.216), effectively predicting most Filipino samples as dementia-positive.
Bilingual training resolved these instabilities, with all transformer models achieving dementia recall >0.93 with low variance.

Significance and Claims

The paper claims to provide the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting.

The core conclusion is that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.

Architectural modernization (e.g., NeoBERT) alone does not yield consistent cross-lingual gains and may increase sensitivity to language shift.
Bilingual supervision is the most effective strategy for achieving stable, clinically consistent performance across languages, effectively eliminating the cross-lingual generalization gap.
The study highlights that for low-resource, code-switched settings like the Philippines, ensuring adequate linguistic coverage during task training is more critical than architectural modifications.

Limitations Acknowledged by the Authors

Data Source: The Filipino dataset was constructed via manual translation of English transcripts, not from organically collected local patient speech. While structural disfluencies were preserved, the semantic content reflects the original English source.
Modality: The study focuses exclusively on text, excluding acoustic features (pitch, pause duration) that are also diagnostic markers.
Interpretability: The mechanisms driving model decisions in multilingual contexts remain opaque, necessitating future work on interpretability for clinical trust.

Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech