Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer
Imagine you have four different "digital doctors" (AI chatbots) who are supposed to answer health questions. You ask them the same medical questions, but you ask them in six different languages: English, French, Russian, Arabic, Hebrew, and Thai.
This study is like a massive quality control test. The researchers didn't just ask the bots simple questions; they took real, messy, real-world health worries from online forums and asked the bots to solve them. Then, they hired real doctors who speak those specific languages to grade the answers.
Here is what they found, explained simply:
1. The "Warm Hug" vs. The "Bad Map"
The most surprising discovery is a split between how the AI sounds and what the AI actually says.
- The Warm Hug (Empathy): The AI chatbots were great at sounding kind, caring, and warm, no matter what language you spoke. If you asked a question in Thai or Hebrew, the bot sounded just as sympathetic as it did in English. It was like a robot that learned to give a perfect, comforting hug in every language.
- The Bad Map (Clinical Substance): However, the actual medical advice was often a disaster in non-English languages. While the English answers were like a clear, accurate map to the hospital, the answers in Thai, Hebrew, and Arabic were often like maps with missing roads, wrong turns, or dead ends.
The Analogy: Imagine a tour guide who speaks perfect English and gives you a detailed, accurate map of the city. Now imagine that same guide trying to give you a map in a language they barely know. They might still smile warmly, hold your hand, and say, "Don't worry, I've got you!" (The Warm Hug), but the map they hand you might lead you into a river instead of the museum (The Bad Map).
2. The Language Matters More Than the Brand
You might think, "Well, maybe the 'Google' bot is better than the 'OpenAI' bot." The study found that it didn't matter which bot you used.
The biggest factor determining whether the advice was safe or dangerous was the language you spoke, not the company that made the bot.
- If you spoke English, the advice was generally safe and accurate.
- If you spoke Thai, Hebrew, or Arabic, the advice was significantly worse, regardless of whether you were talking to ChatGPT, Claude, Gemini, or DeepSeek.
It's like ordering a meal at a restaurant chain. Whether you go to the "Big Burger" or "Super Burger," if you order in a language the kitchen doesn't understand well, you might get a salad instead of a burger. The brand doesn't save you; the language barrier does.
3. The "Silent" Danger
The study found that the AI didn't usually make loud, obvious mistakes (like saying "Take this poison"). Instead, it made silent omissions.
- The Stroke Example: If a patient described symptoms of a stroke, the AI in English might say, "Go to the ER immediately; there is a 4.5-hour window for treatment." In other languages, the AI would say, "Go to the ER," but it would forget to mention the time limit. It didn't say the wrong thing; it just left out the most critical piece of information.
- The Carbon Monoxide Example: If a husband said his family felt sick and blamed "work stress," the AI in English might say, "Check for carbon monoxide; if everyone in the house is sick, it's not stress." In other languages, the AI would agree with the husband that it's just stress, missing the clue that saved lives.
The Analogy: It's like a doctor who tells you to take your medicine but forgets to tell you when to take it. The advice isn't "wrong" in a way you can easily argue with, but it's useless and dangerous because the most important part is missing.
4. The "Safe" Emergency Numbers
When people asked about emergencies in non-English languages, the bots often failed to give the correct local emergency number.
- In English, they knew to say "911" (in the US context) or the local number.
- In other languages, they often just said "Call emergency services" without giving a number, or gave a generic number that didn't work in that specific country. They were "safe" (they didn't give a wrong number like 911 to someone in Thailand), but they were unhelpful.
5. Why Does This Happen?
The researchers found that the problem gets worse the further a language is from English in terms of how computers "think" about words (tokenization) and how much data exists for that language online.
- Languages like Thai or Hebrew, which are structurally very different from English and have less digital data, suffered the most.
- The AI models seem to have been trained mostly on English data, so when they try to speak other languages, they are essentially "guessing" the medical facts while sounding very confident and kind.
The Bottom Line
The paper concludes that current AI health tools are not ready for the whole world. They are excellent at sounding like a caring friend in any language, but they are often terrible at being a safe medical advisor in languages other than English.
The danger is that a patient might feel so comforted by the AI's warm tone that they trust the bad advice hidden inside it. The study warns that we cannot assume an AI is safe just because it speaks your language fluently; the "substance" of the answer often breaks down the moment you leave the English-speaking world.
Drowning in papers in your field?
Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.