COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

This paper introduces QUORUM, a unified evaluation framework, and COACH, an LLM-driven pipeline, to generate and assess personalized health counseling for cancer patients, demonstrating that while stakeholders converge on the system's relevance and quality, they diverge on nuances like tone and error sensitivity, thereby highlighting the critical need for multi-perspective evaluation in trustworthy patient-centered NLP systems.

Yee Man Ng, Bram van Dijk, Pieter Beynen, Otto Boekesteijn, Joris Jansen, Gerard van Oortmerssen, Max van Duijn, Marco Spruit

Published Tue, 10 Ma
📖 4 min read☕ Coffee break read

Imagine you have a smart health diary app that tracks your sleep, mood, and energy levels. You ask it, "Why am I so tired?" or "How can I sleep better?" and it gives you advice.

This paper is about building a system to make sure that advice is helpful, accurate, and safe, especially for people dealing with serious health issues like cancer. The authors created two main things to solve this: a new way to test the system (called QUORUM) and the system itself (called COACH).

Here is the breakdown in simple terms:

1. The Problem: The "Three-Legged Stool"

When building a health AI, you usually have three different groups of people who care about the result, but they all look at it through different lenses:

  • The User (You): "Does this advice feel like it's for me? Is it nice to read? Will I actually do it?"
  • The Expert (The Doctor): "Is the medical information 100% correct? Is the tone appropriate for a sick person?"
  • The Developer (The Builder): "Did the computer make up facts? Did it read the data correctly? Is the code working?"

The Analogy: Imagine building a house.

  • The User is the family living there: "Is the kitchen big enough? Is the paint color cheerful?"
  • The Expert is the building inspector: "Is the foundation solid? Are the wires up to code?"
  • The Developer is the architect: "Did we use the right blueprints? Did the walls go up straight?"

Usually, these groups work separately. This paper says, "No, we need to bring all three into the room at the same time to see if the house is actually livable."

2. The Solution: QUORUM (The "All-Seeing Eye")

The authors created a framework called QUORUM (Quality, Outcome, Reliability, and User-relevance). Think of QUORUM as a specialized scorecard that asks all three groups to grade the AI's advice at the same time.

Instead of just asking the developer if the code works, QUORUM asks:

  • User: "Did this make you feel understood?"
  • Expert: "Is the medical advice safe?"
  • Developer: "Did the AI hallucinate (make things up)?"

This ensures the final product isn't just technically perfect but also actually helpful to real people.

3. The Tool: COACH (The "Smart Butler")

They built a specific AI system called COACH for cancer patients using a "Healthy Chronos" app.

  • How it works: When a user asks a question, COACH acts like a super-organized librarian.
    1. It looks at the user's diary (e.g., "I slept 4 hours last night").
    2. It goes to a trusted medical library (a database of cancer info) to find facts about sleep and cancer.
    3. It mixes the user's personal story with the medical facts to write a custom letter of advice.

4. What They Found (The Results)

They tested COACH with real users, medical experts, and developers. Here is what happened:

  • The Good News (Convergence): Everyone agreed the advice was relevant, high-quality, and reliable.

    • Users felt the advice matched their lives.
    • Experts said the medical facts were correct.
    • Developers confirmed the AI rarely made up data.
    • Analogy: It's like a chef, a food critic, and a customer all agreeing the soup tastes great.
  • The Bad News (Divergence): They disagreed on tone and sensitivity.

    • The Tone Clash: Experts thought the AI was sometimes too blunt or "condescending" (talking down to the patient). The users, however, thought the tone was fine!
    • The "Hallucination" Blindspot: The developers found that 22% of the time, the AI added small details that weren't strictly in the medical database (like suggesting "eat nuts" when the database only said "eat protein"). The developers called this a "hallucination" (a mistake). But the experts and users didn't mind; they thought it was just helpful common sense.
    • Analogy: The developer is like a strict grammar teacher who marks down a sentence for a tiny error. The user is just happy the sentence makes sense and helps them.

5. The Big Takeaway

The paper proves that you cannot build a trustworthy health AI by only listening to the engineers.

If you only listen to the developers, you might get a technically perfect system that feels cold and robotic to patients. If you only listen to users, you might get a friendly system that gives dangerous medical advice.

The Conclusion: To build AI that truly helps people, you need a "committee" approach. You need the User (for empathy), the Expert (for safety), and the Developer (for accuracy) to all have a seat at the table. When they all agree, you know you have a system that is ready for the real world.