📄 health informatics

Ambient AI Documentation in Mixed-Language Encounters: A Heuristic Evaluation of Spanish-English and Mandarin-English Conversations

This study evaluates an ambient AI documentation system's performance in mixed-language clinical encounters, finding that while overall transcription error rates are low and language switching is generally detected reliably, significant challenges remain with Mandarin-English code-switching, including high error outliers and frequent deletions at switch points.

Original authors: Hu, D., Flores, D., Flores, L., Chien, R., Lam, K., Chow, E., Guo, Y., Tam, S., Perret, D., Pandita, D., Zheng, K.

Published 2026-05-22

📖 5 min read🧠 Deep dive

CC BY 4.0

Original authors: Hu, D., Flores, D., Flores, L., Chien, R., Lam, K., Chow, E., Guo, Y., Tam, S., Perret, D., Pandita, D., Zheng, K.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a new kind of "smart scribe" for doctors. This is an Ambient AI tool that listens to the conversation between a patient and a doctor, writes it down word-for-word, and then turns that conversation into a medical note. It's like having a super-fast, tireless secretary who never misses a beat.

This paper asks a simple but crucial question: What happens when the doctor and patient speak two different languages at the same time?

In the real world, many patients and doctors switch back and forth between languages (like English and Spanish, or English and Mandarin) to make sure they understand each other. This is called "code-switching." The researchers wanted to see if this AI scribe could handle that "linguistic dance" without tripping over its own feet.

The Experiment: A Rehearsed Play

Since it's hard to get permission to record real private doctor visits, the researchers created a "rehearsed play." They took 24 real-life medical scenarios and had actors (who were actually researchers and medical students) act them out.

12 plays were in Spanish and English.
12 plays were in Mandarin and English.

They fed these recordings into the AI tool (called Abridge) and then compared what the AI wrote down against the "perfect script" (the reference transcript) to see how many mistakes it made.

The Scorecard: How Did the AI Do?

1. The Spanish-English Duo: The Smooth Dancers
When the actors switched between Spanish and English, the AI did a pretty good job.

The Error Rate: It made very few mistakes (about 4% on average).
The Vibe: It was consistent. Whether the conversation was short or long, the AI stayed on track.
The Catch: It occasionally got confused by words that sound alike (like hearing "depression" instead of "my blood pressure" because the sounds were similar in the mix).

2. The Mandarin-English Duo: The Stumbling Blocks
When the actors switched between Mandarin and English, the AI struggled more.

The Error Rate: The mistakes were higher (about 9% on average), but the real problem was variability. Some conversations were fine, but others were a disaster, with error rates skyrocketing to 67%.
The Big Drop: The most common mistake wasn't swapping words; it was deleting them. Imagine the AI listening to a sentence and suddenly deciding, "I'm going to skip the next 50 words," leaving a huge gap in the medical note. This happened frequently when the speaker switched from English to Mandarin.
The Confusion: The AI sometimes got lost at the exact moment the language changed, dropping entire chunks of the conversation.

The "Glitch" Types: Where the AI Got Confused

The researchers found four main ways the AI messed up, which they explain with some fun analogies:

The "Sound-Alike" Trap (Phonetic Similarity):
The AI is like a person trying to guess a word based only on how it sounds, without looking at the context.
- Example: In Mandarin, a word for "liver" sounded so much like a word for "gallbladder" that the AI swapped them. In Spanish, "my pressure" sounded like "depression," so the AI wrote down a mental health issue instead of a blood pressure reading.
- Cross-Language Mix-up: The English word "bone" sounds exactly like a Chinese character for "pump." The AI heard "bone" but wrote "pump," creating a confusing medical note.
The "Over-zealous Translator" (Automatic Translation):
Sometimes, the AI didn't just write down what was said; it tried to translate it on the fly, even when it shouldn't have.
- Example: If a doctor said the English word "chemotherapy," the AI might write the Spanish word for it ("quimioterapia") because it thought the context demanded Spanish.
- The Pinyin Problem: Sometimes, instead of writing Chinese characters, the AI wrote the English alphabet version of the sounds (Pinyin), or worse, "fake Pinyin" that didn't make sense. It's like trying to write a recipe in a language you only half-know.
The "Medical Jargon" Blind Spot:
The AI is great at everyday words but stumbles on complex medical terms, especially when they are spoken with an accent or mixed with another language.
- Example: A specific heart medication called "Leqvio" was written as "Lekvia." A patch called "Zio" became "Xylem." It's like a translator who knows the word "apple" but has never heard of "avocado" and guesses "orange" instead.
The "Grammar Glitch" (Language-Specific Issues):
- Spanish: The AI sometimes changed the tense of a verb (e.g., changing "I smoke" to "to smoke"), which changes the meaning of the patient's history.
- Mandarin: The AI sometimes mixed up "he," "she," and "it" because they all sound the same in Mandarin. It also randomly switched between Simplified and Traditional Chinese characters in the same sentence, like a writer who can't decide which alphabet to use.

The Bottom Line

The paper concludes that while this AI scribe is impressive, it isn't ready for the full "multilingual dance" just yet.

It works well for Spanish-English conversations, with only minor hiccups.
It struggles with Mandarin-English conversations, often dropping large pieces of the conversation or getting confused at the moment the language switches.

Why does this matter?
If the AI deletes a chunk of the conversation or swaps a medical term, the doctor has to spend extra time reading the note, finding the missing pieces, and fixing the errors. This defeats the purpose of the tool, which is supposed to save doctors time and reduce burnout.

The study suggests that for these tools to be truly helpful for everyone, they need to get better at handling the "messy middle" where two languages collide, ensuring that no patient's story gets lost in translation.

The Experiment: A Rehearsed Play

The Scorecard: How Did the AI Do?

The "Glitch" Types: Where the AI Got Confused

The Bottom Line

Technical Summary: Ambient AI Documentation in Mixed-Language Encounters

More like this