Vision-Language System using Open-Source LLMs for Gestures in Medical Interpreter Robots

This paper presents a privacy-preserving vision-language framework for medical interpreter robots that leverages locally deployed open-source LLMs and a novel annotated dataset to accurately detect clinical speech acts and generate human-like, appropriate robotic gestures.

Thanh-Tung Ngo, Emma Murphy, Robert J. Ross

Published Mon, 09 Ma
📖 5 min read🧠 Deep dive

Imagine a robot doctor's assistant standing in a busy hospital room. Its job is to help a doctor talk to a patient who speaks a different language. Usually, these robots just translate words, which can feel a bit robotic and cold. But what if the robot could also nod, point, or hold up a hand at the exact right moment, just like a human would? That's the goal of this paper.

The researchers built a "smart body" for a medical robot that understands not just what is being said, but how it should be said physically. Here is how they did it, explained simply:

1. The Problem: The "Silent" Translator

In a hospital, words aren't enough. If a doctor says, "I need your permission to proceed," they might also hold out their hand or lean forward to show they are asking for consent. If they say, "Take a deep breath," they might demonstrate the motion.
Current translation tools are like a radio: they only broadcast the voice. They miss the body language. The researchers wanted a robot that could "speak" with its hands and arms, too, to make the patient feel safer and understood.

2. The Solution: A "Privacy-First" Brain

To make the robot smart enough to know when to gesture, they needed a brain (an AI) that could listen to the conversation and decide: "Is the doctor asking for permission? Is the doctor giving an instruction? Or is this just small talk?"

  • The Local Brain: Most AI models live in the "cloud" (huge servers far away). But in a hospital, patient privacy is everything. You don't want sensitive medical data flying over the internet. So, the team built a system that runs entirely on the robot's own computer. It's like having a personal librarian in the room who never leaves the building and never tells anyone what you read.
  • The "Few-Shot" Trick: They taught this local AI using a clever method called "few-shot prompting." Imagine you are teaching a child to recognize a cat. Instead of showing them a million pictures, you show them three examples: "This is a cat. This is a cat. This is a cat. Now, is this a cat?" The AI learned to spot "Consent" and "Instruction" sentences just by seeing a handful of examples.

3. The Dataset: The "Gestural Dictionary"

To train the robot, the team needed a dictionary of medical conversations paired with gestures.

  • They took 58 real videos of doctors talking to patients (from a public YouTube channel).
  • They broke these videos into thousands of tiny clips.
  • They labeled each clip: "This sentence is a request for consent," or "This is an instruction to move an arm."
  • They even recorded the doctor's hand movements in these clips so the robot could learn to copy them.

4. The Two-Mode System

The robot has two ways of moving, depending on what the AI hears:

  • Mode A: The Mirror (Human-Mimic)
    If the AI hears a Consent or Instruction sentence, the robot switches to "Mirror Mode." It looks at the video of the human speaker, grabs their hand movements, and copies them exactly.

    • Analogy: It's like a dance partner who perfectly mirrors your steps. If the doctor raises a hand to say "Stop," the robot raises its hand too. This feels very natural and human.
  • Mode B: The Improviser (Speech-Gesture Generation)
    If the AI hears normal conversation (not a specific instruction or consent), the robot uses a different AI to invent a gesture that fits the mood.

    • Analogy: This is like a jazz musician. If the conversation is casual, the robot adds a little nod or a wave to keep the rhythm going, even if no one told it exactly what to do.

5. The Results: Does It Work?

They tested this on a Pepper robot (a friendly, humanoid robot often used in research) with 26 human volunteers.

  • The "Human" Test: People watched videos of the robot moving. When the robot was copying real human gestures (Mode A), people rated it as more human-like than when it was just generating random gestures. It felt less like a machine and more like a person.
  • The "Appropriateness" Test: People also checked if the gestures matched the words. The robot did just as well as the best existing systems at making sure the gestures made sense.
  • The Privacy Win: Because everything runs locally on the robot, it uses very little memory and keeps all patient data safe inside the machine.

The Big Picture

Think of this system as giving the robot a soulful body. Before, medical robots were like telephones—good for hearing, bad for feeling. This new system allows the robot to use its "body language" to build trust, reduce anxiety, and make sure patients truly understand their care, all while keeping their private medical secrets safe in a locked box on the robot itself.

It's a step toward robots that don't just translate words, but translate human connection.