Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Imagine you are a primary care doctor. You have 15 minutes to see a patient. The patient is talking about their headache, their sleep, and their stress. At the same time, you are trying to remember a specific medical guideline about sleep apnea, check the patient's electronic health record, and figure out if they need a new medication. Your brain is juggling a dozen balls at once. This is the reality of modern medicine: too much information, too little time.

This paper proposes a solution: an "Ambient AI Co-Pilot" that sits quietly in the room, listens to the conversation, and whispers helpful questions to the doctor.

Here is the breakdown of their research, explained with simple analogies:

1. The Problem: The "Library in a Hurricane"

Doctors are trained to use Evidence-Based Medicine (EBM). Think of this as a massive, perfect library of medical rules and guidelines. In an ideal world, a doctor would open the library, find the exact page for a patient's specific problem, and follow the instructions.

But in reality, the doctor is in a hurricane. They can't stop the conversation to read a 50-page document. They often guess or rely on memory, which can lead to missed diagnoses or inconsistent care.

2. The Solution: The "Smart Sous-Chef"

The researchers built an AI system that acts like a smart sous-chef in a busy kitchen.

The Chef (Doctor): Focuses on cooking the meal (talking to the patient and making the diagnosis).
The Sous-Chef (AI): Listens to the conversation. If the Chef is making a soup and mentions "it tastes a bit salty," the Sous-Chef doesn't take over the stove. Instead, they quietly slide a sticky note onto the counter that says: "Hey, did you check the guidelines for sodium limits in patients with high blood pressure?"

The AI doesn't answer the question for the doctor; it asks the right question to help the doctor remember what they need to look up.

3. How the AI Works: The "Three-Step Dance"

The researchers tested two ways to make this AI work. They found that a "smart" approach was much better than a "dumb" one.

The "Dumb" Way (Zero-Shot): You just ask the AI, "Listen to this chat and give me three questions." It's like asking a random person to read a complex legal document and summarize it instantly. It might get the gist, but it often misses the nuance or hallucinates (makes things up).
The "Smart" Way (Multi-Stage Reasoning): This is the method the paper champions. It's a three-step dance:
1. The Scribe (Summarizer): First, the AI listens to the messy, chatty conversation and writes a clean, structured medical note (like a "SOAP" note). It filters out the "How's the weather?" small talk and keeps the medical facts.
2. The Detective (Generator): Next, a second AI looks at that clean note and asks, "What are the tricky parts here? What guidelines might apply?" It generates 10 potential questions.
3. The Editor (Evaluator): Finally, a third AI acts like a strict editor. It reviews the 10 questions, picks the top 3, and throws away the bad ones. It ensures the questions are safe, relevant, and actually useful.

4. The Experiment: The "Taste Test"

The researchers didn't just guess if this worked. They got six experienced doctors to play a game.

They gave the doctors 80 real (but anonymous) patient recordings.
They showed the doctors the recordings at different stages: 30% done, 70% done, and 100% done.
They asked the doctors to rate the AI's questions on a scale of 1 to 7.

The Results:

The AI is helpful: The doctors agreed that the AI's questions were generally useful and relevant. It felt like having a knowledgeable colleague in the room.
Timing matters: The AI worked well even when it only heard 30% of the conversation. This is huge because it means the AI can give a hint early in the visit, not just at the end.
The "Smart" Way wins: The multi-step "Scribe-Detective-Editor" method produced much safer and more accurate questions than the "Dumb" direct approach.
The "Robot Judge" isn't perfect: The researchers tried using another AI to grade the questions (an "AI Judge"). While the AI Judge agreed with the humans on which method was better, it was too optimistic. It gave high scores to things the human doctors thought were risky. Human experts are still the gold standard for safety.

5. The Catch: It's Not Ready for Prime Time Yet

The paper is honest about the limitations:

Cost: Having real doctors review the AI for 90 hours cost over $10,000. We can't do that for every patient visit.
Speed: The "Smart" method takes about 60 seconds to generate questions. In a real clinic, you need answers in seconds, not minutes.
Privacy: Listening to patient conversations requires strict privacy rules.

The Bottom Line

This paper proves that AI can be a great "question asker" for doctors. It can help reduce the mental load of remembering thousands of medical rules.

Think of it like a GPS for medical guidelines. You don't want the GPS to drive the car for you (that's dangerous), but you definitely want it to say, "Hey, there's a speed limit change coming up in 500 feet," so you don't get a ticket. This system is learning to be that GPS, helping doctors stay on the right path without slowing down the journey.

1. Problem Definition

The paper addresses the critical gap between Evidence-Based Medicine (EBM) and its practical application in primary care settings.

The Challenge: Primary care physicians (PCPs) face short consultation times (often <15 minutes) and high patient loads. Clinical guidelines are lengthy, unstructured, and difficult to consult in real-time. Consequently, guideline-based decision-making often remains "aspirational" rather than actionable.
The Specific Gap: Existing Clinical Decision Support Systems (CDSS) are often static, lack context awareness, or impose rigid workflows. While Large Language Models (LLMs) have shown promise in medical QA and summarization, few systems are designed to proactively generate targeted, evidence-based questions during a live patient encounter to scaffold physician reasoning.
Objective: The authors propose an ambient AI assistant that listens to the physician-patient dialogue and patient health records (PHR) to generate specific, guideline-oriented questions. The goal is not to answer the questions immediately, but to surface the right questions to guide the physician toward relevant evidence and reduce cognitive load.

2. Methodology

The study utilizes Gemini 2.5 as the backbone model and compares two prompting strategies: a Zero-Shot Baseline and a Multi-Stage Reasoning Framework.

A. Input Data

Patient Health Record (PHR): Structured intake questionnaires and clinical background ( $x_{phr}$ ).
Dialogue Context ( $x_{dlg}$ ): Transcripts of physician-patient conversations.
Truncation Strategy: To simulate real-time ambient listening, the system processes dialogue at three truncation levels: 30%, 70%, and 100% of the total conversation length.

B. Proposed Framework: Multi-Stage Reasoning

The framework consists of three sequential agents:

Summarizer Agent:
- Function: Extracts essential clinical information from unstructured, verbose dialogue.
- Output: A structured clinical summary ( $s$ ) following standard schemas (e.g., SOAP notes: Chief Complaint, HPI, Medications, Objective Findings, Assessment, Plan).
- Goal: Mitigate information loss and noise (e.g., small talk) inherent in raw dialogue.
Question Generator Agent:
- Function: Analyzes the structured summary to generate 10 diverse candidate questions.
- Strategy: Uses a few-shot prompting approach with expert-verified examples.
- Categories: Questions cover Medication Adjustment, Ordering Tests, Medication Details, Diagnosis, Follow-up, and Counseling.
- Constraint: Questions must be grounded strictly in the provided context to avoid hallucinations and formatted as standalone clinical vignettes.
Question Evaluator Agent:
- Function: Ranks the 10 candidates to select the top 3 questions.
- Mechanism: Uses a Chain-of-Thought (CoT) reasoning process. The agent evaluates each question against 7 predefined criteria (Relevance, Expected Impact, Originality, Factual Accuracy, Comprehensiveness, Clarity, Collaborative Tone) on a 1.0–5.0 scale.
- Selection: The top 3 are selected based on the mean score across all criteria.

C. Baseline

Zero-Shot Setting: The LLM takes the raw dialogue and PHR directly as input to generate 3 questions without intermediate summarization or multi-stage reasoning.

3. Experimental Design & Evaluation

Dataset: 80 de-identified, real-world primary care transcripts sampled from a larger corpus of 2,000 visits.
Human Evaluation:
- Participants: 6 experienced primary care physicians/internists.
- Effort: >90 hours of structured review.
- Metrics: A 7-point Likert scale across 5 dimensions: Relevance, Guideline Navigation, Thought Alignment, Non-Redundancy, and Usefulness.
- Task: Clinicians compared the Zero-Shot vs. Multi-Stage outputs for each case at 30%, 70%, and 100% context lengths.
Automated Evaluation: "LLM-as-a-Judge" (Gemini 2.5 Pro) was used to score the same questions to assess the feasibility of automated evaluation.

4. Key Results

Clinical Value: Experienced clinicians found the generated questions valuable. The Multi-Stage framework received an average score of 5.63/7, slightly outperforming the Zero-Shot baseline (5.54/7).
Robustness to Context Length: The system performed consistently well across 30%, 70%, and 100% dialogue lengths. Notably, 30% context (early in the visit) yielded slightly higher scores, suggesting the system can identify critical decision points early.
Safety and Reliability:
- The Multi-Stage framework significantly reduced "hallucinated" or unsupported guideline citations compared to Zero-Shot (9.17% vs. 17.22% failure rate in Guideline Navigation).
- It showed consistent gains in Guideline Navigation (+6.72%), Non-Redundancy (+1.51%), and Usefulness (+0.98%).
Question Type Preferences:
- Clinicians consistently preferred questions related to Medication Adjustment and Ordering Tests (>25% preference).
- As the dialogue progressed (70% mark), preference shifted toward Follow-up and Diagnosis questions as uncertainty was resolved.
LLM-as-Judge Limitations:
- While LLMs agreed with humans on the direction of improvement (Multi-Stage > Zero-Shot), they exhibited systematic optimism, inflating scores significantly.
- Correlation between LLM scores and human ratings was near zero ( $\rho \approx 0$ ), indicating LLMs cannot yet replace human experts for safety-critical evaluation.

5. Key Contributions

Novel Paradigm: First systematic study of ambient LLMs that proactively generate questions (rather than answers) to scaffold EBM in real-time primary care.
Methodological Innovation: A Multi-Stage Reasoning Framework (Summarizer $\to$ Generator $\to$ Evaluator) that improves clinical safety and guideline alignment over direct zero-shot generation.
Comprehensive Benchmark: A curated dataset of 80 real-world cases with multi-physician human evaluation (90+ hours), establishing a rigorous standard for clinical NLP evaluation.
Empirical Insights: Demonstrated that high-quality support is possible even with partial context (30% of dialogue) and highlighted the limitations of automated evaluation for clinical safety.

6. Significance and Future Work

Significance: The study proves that LLMs can act as "silent virtual experts" to reduce cognitive burden and integrate guideline-based reasoning into fast-paced workflows. It shifts the focus from "answering" to "scaffolding," which is more aligned with clinical decision-making processes.
Limitations: High cost of expert evaluation, latency in multi-agent processing (~60s), and limited generalizability to non-text-heavy specialties (e.g., radiology).
Future Directions:
- Developing proactivity models to determine when to surface questions to minimize interruption.
- Integrating Retrieval-Augmented Generation (RAG) to provide the actual guideline answers and actionable recommendations with provenance.
- Exploring multimodal inputs for specialties relying on visual data.

In conclusion, this paper provides a strong foundation for ambient clinical assistants, demonstrating that structured, multi-stage LLM reasoning can produce clinically meaningful, guideline-relevant questions that experienced physicians find useful, provided that human oversight remains the gold standard for safety.