COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

Imagine you have a smart health diary app that tracks your sleep, mood, and energy levels. You ask it, "Why am I so tired?" or "How can I sleep better?" and it gives you advice.

This paper is about building a system to make sure that advice is helpful, accurate, and safe, especially for people dealing with serious health issues like cancer. The authors created two main things to solve this: a new way to test the system (called QUORUM) and the system itself (called COACH).

Here is the breakdown in simple terms:

1. The Problem: The "Three-Legged Stool"

When building a health AI, you usually have three different groups of people who care about the result, but they all look at it through different lenses:

The User (You): "Does this advice feel like it's for me? Is it nice to read? Will I actually do it?"
The Expert (The Doctor): "Is the medical information 100% correct? Is the tone appropriate for a sick person?"
The Developer (The Builder): "Did the computer make up facts? Did it read the data correctly? Is the code working?"

The Analogy: Imagine building a house.

The User is the family living there: "Is the kitchen big enough? Is the paint color cheerful?"
The Expert is the building inspector: "Is the foundation solid? Are the wires up to code?"
The Developer is the architect: "Did we use the right blueprints? Did the walls go up straight?"

Usually, these groups work separately. This paper says, "No, we need to bring all three into the room at the same time to see if the house is actually livable."

2. The Solution: QUORUM (The "All-Seeing Eye")

The authors created a framework called QUORUM (Quality, Outcome, Reliability, and User-relevance). Think of QUORUM as a specialized scorecard that asks all three groups to grade the AI's advice at the same time.

Instead of just asking the developer if the code works, QUORUM asks:

User: "Did this make you feel understood?"
Expert: "Is the medical advice safe?"
Developer: "Did the AI hallucinate (make things up)?"

This ensures the final product isn't just technically perfect but also actually helpful to real people.

3. The Tool: COACH (The "Smart Butler")

They built a specific AI system called COACH for cancer patients using a "Healthy Chronos" app.

How it works: When a user asks a question, COACH acts like a super-organized librarian.
1. It looks at the user's diary (e.g., "I slept 4 hours last night").
2. It goes to a trusted medical library (a database of cancer info) to find facts about sleep and cancer.
3. It mixes the user's personal story with the medical facts to write a custom letter of advice.

4. What They Found (The Results)

They tested COACH with real users, medical experts, and developers. Here is what happened:

The Good News (Convergence): Everyone agreed the advice was relevant, high-quality, and reliable.
- Users felt the advice matched their lives.
- Experts said the medical facts were correct.
- Developers confirmed the AI rarely made up data.
- Analogy: It's like a chef, a food critic, and a customer all agreeing the soup tastes great.
The Bad News (Divergence): They disagreed on tone and sensitivity.
- The Tone Clash: Experts thought the AI was sometimes too blunt or "condescending" (talking down to the patient). The users, however, thought the tone was fine!
- The "Hallucination" Blindspot: The developers found that 22% of the time, the AI added small details that weren't strictly in the medical database (like suggesting "eat nuts" when the database only said "eat protein"). The developers called this a "hallucination" (a mistake). But the experts and users didn't mind; they thought it was just helpful common sense.
- Analogy: The developer is like a strict grammar teacher who marks down a sentence for a tiny error. The user is just happy the sentence makes sense and helps them.

5. The Big Takeaway

The paper proves that you cannot build a trustworthy health AI by only listening to the engineers.

If you only listen to the developers, you might get a technically perfect system that feels cold and robotic to patients. If you only listen to users, you might get a friendly system that gives dangerous medical advice.

The Conclusion: To build AI that truly helps people, you need a "committee" approach. You need the User (for empathy), the Expert (for safety), and the Developer (for accuracy) to all have a seat at the table. When they all agree, you know you have a system that is ready for the real world.

Here is a detailed technical summary of the paper "COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling."

1. Problem Statement

The paper addresses the challenge of developing reliable, personalized lifestyle counseling systems for populations affected by chronic diseases, specifically cancer survivors. While existing AI systems can extract patterns from user data, they often fail to:

Contextualize patterns with validated medical knowledge.
Generate advice that is relevant to the specific user's situation.
Provide a unified evaluation that balances technical reliability, medical correctness, and user satisfaction.

Current evaluation methodologies for Large Language Model (LLM) health applications are often fragmented, focusing on either technical metrics (developer-centric) or user satisfaction (user-centric), but rarely integrating medical expert validation. This gap makes it difficult to ensure that AI-generated health advice is safe, accurate, and actionable in real-world settings.

2. Methodology

The authors propose a dual contribution: a new evaluation framework (QUORUM) and a specific LLM-driven pipeline (COACH) applied to the Healthy Chronos use case (a digital diary app for cancer patients).

A. The QUORUM Framework

QUORUM (QUality, Outcome Reliability, and User-relevance from Multiple stakeholders) is a holistic evaluation framework that unifies three distinct perspectives:

User-Centric Perspective: Focuses on Relevance.
- Metrics: Alignment with personal situation, likelihood of follow-up (actionability), tone, and length.
- Method: 19 Dutch app users rated responses on a 5-point Likert scale.
Expert-Centric Perspective: Focuses on Quality.
- Metrics: Correctness of medical contextualization, tone, and length.
- Method: 6 independent cancer information specialists (from kanker.nl) evaluated 50 random query-response pairs. They focused on whether the advice was medically sound based on the knowledge base.
Developer-Centric Perspective: Focuses on Reliability.
- Metrics:
  - Faithfulness: Consistency of claims about user data with the actual data.
  - Completeness: Whether all relevant user data variables were retrieved for the query.
  - Hallucination: Presence of contextualization statements not traceable to the external knowledge database.
- Method: 2 developers performed binary evaluations on 66 data claims and 31 contextualization statements.

B. The COACH Pipeline

COACH (Contextualised Outcome-Adaptive Counselling for Health) is the technical implementation used to generate the counseling.

Architecture: A Retrieval-Augmented Generation (RAG) pipeline built with LangChain and OpenAI's GPT-4o.
Workflow:
1. Input: User submits a query (e.g., "How can I sleep better?") based on their diary data (sleep, mood, activity, goals).
2. Embedding & Retrieval: The query is embedded, and the system retrieves the top 4 relevant chunks from a validated external knowledge base (kanker.nl).
3. Prompt Engineering: The user's data (formatted in Markdown tables) and retrieved chunks are concatenated into a prompt. The prompt includes:
  - System instructions for an empathic, non-medical-advice tone.
  - Descriptions of data types and column headers.
  - In-context learning examples for common topics (fatigue, mental health).
4. Generation: The LLM generates a response that separates claims about user data (e.g., "You slept 5 hours on Tuesday") from contextualization statements (e.g., "Low sleep can increase fatigue").

3. Key Contributions

QUORUM Framework: A novel, multi-stakeholder evaluation framework that moves beyond single-perspective assessments. It demonstrates that user, expert, and developer perspectives are complementary and often irreducible to one another.
COACH Pipeline: A working prototype demonstrating that RAG-based LLMs can generate personalized, evidence-based lifestyle counseling for a specific medical population (cancer survivors) while adhering to privacy constraints.
Empirical Analysis of Divergence: The study provides concrete evidence of where stakeholders agree and disagree, highlighting that high technical reliability does not automatically guarantee user relevance or expert-approved tone.

4. Results

The evaluation of COACH using QUORUM yielded the following findings:

Convergence (Agreement)

Overall Positive Sentiment: All three stakeholder groups converged on the view that the system produces relevant, high-quality, and reliable counseling.
User Scores: High average scores (~4/5) for alignment, follow-up likelihood, tone, and length.
Expert Scores: High scores for correctness (~4/5) and appropriate length.
Developer Metrics:
- Faithfulness: 79% of claims about user data were consistent with the actual data.
- Completeness: 97% of claims successfully retrieved relevant user variables.
- Hallucination: 22% of contextualization statements contained information not strictly traceable to the knowledge base (though these were generally not harmful).

Divergence (Disagreement)

Tone Sensitivity: Experts were more critical of the tone than users. Experts found the tone occasionally condescending, too direct, or lacking empathy (Krippendorff's $\alpha$ = 0.15 for tone), whereas users rated the tone highly.
Error Sensitivity: Developers identified specific technical failures (overgeneralization, missing patterns, hallucinations) that users and experts largely missed.
- Example: Users did not notice when the system overgeneralized a pattern (e.g., "You are often tired" when data was sparse), whereas developers flagged this as a faithfulness error.
Clarity: Experts noted that some advice was too ambiguous (e.g., "listen to your body") and lacked the B1-level Dutch simplicity required for broad accessibility, a nuance users did not report.

5. Significance and Conclusion

Trustworthy AI in Health: The paper argues that for AI systems to have a positive real-world impact in healthcare, they must be evaluated through a multi-stakeholder lens. Relying solely on technical metrics (faithfulness) or user satisfaction is insufficient; medical correctness and tone are equally critical.
Iterative Development: QUORUM serves as a "quick probe" for iterative system development, allowing developers to identify specific areas of divergence (e.g., tone vs. accuracy) to refine prompts and retrieval strategies.
Scalability: The framework and pipeline are designed to be adaptable to other populations and medical conditions, provided the knowledge base and user data structures are updated.
Limitations: The study was limited to one use case (cancer), a specific proprietary model (GPT-4o), and a relatively small sample size for expert evaluation. Future work should explore open-source models and longer-term data patterns.

In summary, the paper successfully demonstrates that while LLMs can generate high-quality health counseling, a unified evaluation framework like QUORUM is essential to align the often conflicting priorities of users, medical experts, and developers to ensure the system is safe, effective, and trustworthy.