📄 health informatics

Asymmetry between warmth and clinical substance in multilingual consumer health AI

This study reveals that multilingual consumer health AI exhibits a critical asymmetry where clinical substance and safety vary significantly by language—often failing silently in non-English contexts—while maintaining a consistent, empathetic tone across all languages.

Original authors: Ariel, D., Grumberg, L. R., Supakul, S., Wannasri, S., Mitchnik, I. Y., Lev, A., Ariyamethanon, W., Agbarieh, M., Miari, S., Laban, G., Hasid, B.

Published 2026-05-14

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Ariel, D., Grumberg, L. R., Supakul, S., Wannasri, S., Mitchnik, I. Y., Lev, A., Ariyamethanon, W., Agbarieh, M., Miari, S., Laban, G., Hasid, B.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have four different "digital doctors" (AI chatbots) who are supposed to answer health questions. You ask them the same medical questions, but you ask them in six different languages: English, French, Russian, Arabic, Hebrew, and Thai.

This study is like a massive quality control test. The researchers didn't just ask the bots simple questions; they took real, messy, real-world health worries from online forums and asked the bots to solve them. Then, they hired real doctors who speak those specific languages to grade the answers.

Here is what they found, explained simply:

1. The "Warm Hug" vs. The "Bad Map"

The most surprising discovery is a split between how the AI sounds and what the AI actually says.

The Warm Hug (Empathy): The AI chatbots were great at sounding kind, caring, and warm, no matter what language you spoke. If you asked a question in Thai or Hebrew, the bot sounded just as sympathetic as it did in English. It was like a robot that learned to give a perfect, comforting hug in every language.
The Bad Map (Clinical Substance): However, the actual medical advice was often a disaster in non-English languages. While the English answers were like a clear, accurate map to the hospital, the answers in Thai, Hebrew, and Arabic were often like maps with missing roads, wrong turns, or dead ends.

The Analogy: Imagine a tour guide who speaks perfect English and gives you a detailed, accurate map of the city. Now imagine that same guide trying to give you a map in a language they barely know. They might still smile warmly, hold your hand, and say, "Don't worry, I've got you!" (The Warm Hug), but the map they hand you might lead you into a river instead of the museum (The Bad Map).

2. The Language Matters More Than the Brand

You might think, "Well, maybe the 'Google' bot is better than the 'OpenAI' bot." The study found that it didn't matter which bot you used.

The biggest factor determining whether the advice was safe or dangerous was the language you spoke, not the company that made the bot.

If you spoke English, the advice was generally safe and accurate.
If you spoke Thai, Hebrew, or Arabic, the advice was significantly worse, regardless of whether you were talking to ChatGPT, Claude, Gemini, or DeepSeek.

It's like ordering a meal at a restaurant chain. Whether you go to the "Big Burger" or "Super Burger," if you order in a language the kitchen doesn't understand well, you might get a salad instead of a burger. The brand doesn't save you; the language barrier does.

3. The "Silent" Danger

The study found that the AI didn't usually make loud, obvious mistakes (like saying "Take this poison"). Instead, it made silent omissions.

The Stroke Example: If a patient described symptoms of a stroke, the AI in English might say, "Go to the ER immediately; there is a 4.5-hour window for treatment." In other languages, the AI would say, "Go to the ER," but it would forget to mention the time limit. It didn't say the wrong thing; it just left out the most critical piece of information.
The Carbon Monoxide Example: If a husband said his family felt sick and blamed "work stress," the AI in English might say, "Check for carbon monoxide; if everyone in the house is sick, it's not stress." In other languages, the AI would agree with the husband that it's just stress, missing the clue that saved lives.

The Analogy: It's like a doctor who tells you to take your medicine but forgets to tell you when to take it. The advice isn't "wrong" in a way you can easily argue with, but it's useless and dangerous because the most important part is missing.

4. The "Safe" Emergency Numbers

When people asked about emergencies in non-English languages, the bots often failed to give the correct local emergency number.

In English, they knew to say "911" (in the US context) or the local number.
In other languages, they often just said "Call emergency services" without giving a number, or gave a generic number that didn't work in that specific country. They were "safe" (they didn't give a wrong number like 911 to someone in Thailand), but they were unhelpful.

5. Why Does This Happen?

The researchers found that the problem gets worse the further a language is from English in terms of how computers "think" about words (tokenization) and how much data exists for that language online.

Languages like Thai or Hebrew, which are structurally very different from English and have less digital data, suffered the most.
The AI models seem to have been trained mostly on English data, so when they try to speak other languages, they are essentially "guessing" the medical facts while sounding very confident and kind.

The Bottom Line

The paper concludes that current AI health tools are not ready for the whole world. They are excellent at sounding like a caring friend in any language, but they are often terrible at being a safe medical advisor in languages other than English.

The danger is that a patient might feel so comforted by the AI's warm tone that they trust the bad advice hidden inside it. The study warns that we cannot assume an AI is safe just because it speaks your language fluently; the "substance" of the answer often breaks down the moment you leave the English-speaking world.

Technical Summary: Asymmetry between warmth and clinical substance in multilingual consumer health AI

Problem Statement
While consumer Large Language Model (LLM) chatbots are increasingly used for health inquiries across diverse languages, their clinical performance has been evaluated almost exclusively on English-language tasks. Existing benchmarks (e.g., MedQA, MedMCQA) focus on accuracy and safety for English inputs, leaving a critical gap in understanding whether these models perform safely and effectively for patients querying in Hebrew, Arabic, Thai, Russian, or French. The authors posit that a "confidently wrong" AI statement is challengeable, but an omission—a failure to provide critical safety information—leaves no signal that something is missing. The study addresses whether clinical quality degrades across languages and whether this degradation is uniform or specific to certain dimensions of care (e.g., clinical substance vs. empathetic tone).

Methodology
The study employed a $4 \times 6 \times 21$ factorial design, crossing four widely deployed consumer LLM chatbots (ChatGPT, Claude, Gemini, DeepSeek) with six languages (English, Hebrew, French, Russian, Arabic, Thai) and 21 clinical scenarios.

Data Source: Scenarios were derived from real patient posts on language-matched health forums, adapted by clinicians to preserve clinical content and ambiguity while removing identifying information.
Response Generation: Each chatbot generated a response to every scenario in every language (504 total responses) using a zero-shot, single-turn, temperature-0.7 setting with no system prompt.
Evaluation: Two language-matched clinicians (with C1/C2 proficiency or native status) rated each response on five Likert dimensions (1–5):
1. Clinical Accuracy
2. Safety
3. Referral Appropriateness
4. Cultural and Local Appropriateness
5. Empathy
Analysis: The five dimensions were partitioned into a "clinical-substance" layer (accuracy, safety, referral, cultural) and an "affective-surface" layer (empathy). Variance decomposition was performed using Type II ANOVA and linear mixed-effects models to attribute variance to language, chatbot identity, and their interaction.
Supplementary Arms: The study included paired English controls (English prompts with local context), cross-lingual anchoring tests (family-minimization framing), and a remediation stress test.

Key Results

Language Outweighs Chatbot Identity: The patient's input language was the dominant source of variance in clinical-substance dimensions, far exceeding the variance attributable to the specific chatbot used.
- Clinical Substance: Language accounted for a partial $\eta^2$ of 0.275 in the clinical-substance composite, compared to 0.035 for chatbot identity.
- Empathy: In contrast, empathy showed minimal language effect ( $\eta^2 = 0.029$ ), indicating that the "warmth" of the response was relatively preserved across languages even when clinical substance degraded.
Safety Disparities: Catastrophic safety ratings (safety $\le$ 2) ranged 4.3-fold by language, from 3.6% in English to 15.5% in Hebrew and Thai. Under descriptive standardization, 62% of catastrophic ratings represented an excess over the English baseline.
Systematic Omissions vs. Confident Errors: The study identified "shared blind spots" where failures were systematic omissions rather than confident factual contradictions.
- Stroke (S16): 0/24 responses conveyed time-criticality (e.g., the 4.5-hour thrombolysis window).
- Carbon Monoxide (S08): 0/24 responses used the multi-victim symptom pattern to refute a family member's "stress" hypothesis.
- Occupational Anaphylaxis (S11): 0/24 responses framed the exposure as an occupational health issue requiring investigation.
- Sentinel Facts: In a set of 120 fact-bearing responses, 0/120 contained confidently wrong statements, suggesting omission is the dominant failure mode.
Localization Gaps: Chatbots frequently defaulted to diaspora or US-centric medical structures (e.g., suggesting "Coumadin" instead of the Russian generic "Warfarin," or providing US 911 instead of local emergency numbers). Only 34.5% of non-English emergency responses provided the correct local emergency number.
Warmth-Clinical Substance Decoupling: Warmth did not discriminate clinical danger. The Area Under the Curve (AUC) for empathy predicting catastrophic safety was 0.49 (chance level). Catastrophic responses were rated as "warm" at rates indistinguishable from non-catastrophic ones (18.9% vs 19.1%).
Predictive Factors: Three language properties were associated with the safety gradient: URIEL typological distance from English (AUC 0.93), tokenization fertility (AUC 0.84), and Joshi resource tier (AUC 0.88).

Significance and Claims
The paper claims that the current deployment of consumer health AI exhibits a structural asymmetry: the affective surface (warmth/empathy) remains robust across languages, while the clinical substance (accuracy, safety, referral) degrades significantly in non-English, lower-resource languages.

Equity Implications: The findings parallel health-equity gradients in non-AI care but are inverted; the gradient is mediated by training-data composition and localization coverage, which are within vendor control, rather than distributed clinician behavior.
Evaluation Standards: The authors argue against treating English-only testing as evidence of multilingual clinical quality. They support language-matched evaluation in deployment languages, prioritizing high-volume and high-risk use cases.
Safety Detection: The preservation of warmth in catastrophic responses creates a patient safety-detection problem, as the affective signal patients use to calibrate trust does not track clinical danger.
Limitations: The authors note that the study is correlational and that the language effect cannot be fully separated from cross-language rater-severity calibration, though sensitivity analyses (excluding the PI, fluent-only restrictions) preserved the main effects. The findings are hypothesis-generating regarding the specific mechanisms (e.g., tokenization fertility) and require prospective validation in deployment-candidate languages outside the study sample.

The study concludes that the convergence of universal omissions and language-graded substance loss across four independently trained vendors suggests these are properties of consumer health AI as currently deployed, necessitating upstream interventions in training data and localization strategies.