Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

This paper introduces the LLM Evaluation Framework (LEAF) to assess Large Language Models' responses to 14,000 Nepali sexual and reproductive health queries, revealing that only 35.1% of responses were "proper" due to significant gaps in accuracy, cultural appropriateness, and safety, thereby highlighting the urgent need for improved evaluation criteria in low-resource, sensitive domains.

Medha Sharma, Supriya Khadka, Udit Chandra Aryal, Bishnu Hari Bhatta, Bijayan Bhattarai, Santosh Dahal, Kamal Gautam, Pushpa Joshi, Saugat Kafle, Shristi Khadka, Shushila Khadka, Binod Lamichhane, Shilpa Lamichhane, Anusha Parajuli, Sabina Pokharel, Suvekshya Sitaula, Neha Verma, Bishesh Khanal

Published 2026-03-25
📖 5 min read🧠 Deep dive

Imagine you have a very smart, well-read robot friend who knows a lot about the world. You can ask it anything, and it answers instantly, without judging you. This is what Large Language Models (LLMs) like ChatGPT are.

Now, imagine you are in Nepal, and you have a very personal, sensitive question about your body, your health, or your family planning. Maybe you are shy to ask a doctor, or there isn't one nearby. You turn to this robot friend.

This paper is like a report card for that robot friend, but specifically for how well it handles these tricky, personal health questions in the Nepali language.

Here is the story of what the researchers did, explained simply:

1. The Problem: The "Perfect" Answer isn't Enough

The researchers noticed that most people only check if a robot's answer is factually correct (like checking if a math answer is right). But for health questions, being "correct" isn't enough.

  • Analogy: Imagine you ask a tour guide, "Where is the nearest hospital?"
    • If they say, "It's 5 miles north," that is accurate.
    • But if they say it in a language you don't understand, or they give you a map to a hospital that closed 10 years ago, or they say, "Just ignore the pain," that is dangerous.
    • The robot needs to be accurate, but also safe, culturally polite, and easy to understand.

2. The Solution: The "LEAF" Framework

The team invented a new way to grade the robots, which they called LEAF (LLM Evaluation Framework). Think of LEAF as a four-legged stool. If one leg is broken, the stool falls over. The four legs are:

  1. Accuracy: Is the medical info right?
  2. Language: Did it answer in Nepali (not English)?
  3. Usability: Was the answer helpful, not too long, and culturally appropriate? (e.g., not suggesting a medicine you can't buy in Nepal).
  4. Safety: Did it avoid giving dangerous advice or saying something offensive?

3. The Experiment: A Massive Chat Session

The researchers built a chatbot and invited 9,000 real people from all over Nepal to talk to it for about 30 minutes.

  • Who asked? Regular people and Female Community Health Volunteers (local health workers).
  • What did they ask? Over 14,000 questions about periods, pregnancy, contraception, and more.
  • The Test: They asked the questions to two versions of the robot:
    • Robot A (ChatGPT-3.5): The standard, free version.
    • Robot B (ChatGPT-3.5 + RAG): The standard version, but with a "textbook" attached to it (Retrieval Augmented Generation) so it could look up facts from official Nepali health manuals.

4. The Results: The Robot is Smart, but Clumsy

After the chats, experts graded the answers using the LEAF framework. Here is what they found:

  • The "Perfect" Score was Rare: Only 35% of the answers were "Proper." This means they were accurate, safe, in the right language, and helpful.
  • Accuracy vs. Usefulness: The robot got the facts right 62% of the time. However, half of those "correct" answers still had problems.
    • Analogy: It's like a chef who cooks a delicious steak (accurate) but serves it on a plate that is too hot to touch (usability gap) or forgets to cut the meat (inadequate).
  • The Biggest Issue: The robot often gave inadequate answers. It didn't give enough detail or missed key parts of the question.
  • The Safety Issue: Surprisingly, the robot was mostly safe. Very few answers were offensive or dangerous (less than 1%). But the researchers warned that even one bad answer in health can be fatal.
  • Language Trouble: Sometimes the user asked in Nepali, and the robot replied in English. Other times, it mixed languages weirdly.
  • Robot B vs. Robot A: The robot with the "textbook" (RAG) was slightly better at facts, but still struggled with the nuances of the conversation.

5. The "GPT-4" Surprise

The researchers also tested a newer, smarter robot (GPT-4) on a small sample.

  • Good News: GPT-4 was much better. It gave "Proper" answers 59% of the time (compared to 35% for the older one).
  • Bad News: Even the super-smart GPT-4 sometimes got confused by "Romanized Nepali" (Nepali written with English letters, like "Kasto cha?"). It preferred the official script (Devanagari).

6. The Big Takeaway

The paper concludes that while AI is a promising tool for helping people in Nepal get health information anonymously, it is not ready to be a doctor yet.

  • The Metaphor: Think of the current AI as a student nurse who has read all the textbooks but has never actually held a patient's hand. They know the theory (accuracy) but often forget the bedside manner (usability and safety).
  • The Future: We need to train these robots better, especially on local culture and language, before we can fully trust them with sensitive health issues.

In short: The robot is smart, but it needs to learn how to be a good listener and a caring helper before it can replace a human health worker in Nepal.