Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop

Imagine you are trying to understand a patient's experience with cancer. Sometimes, the medical terms are too cold or technical. Instead, patients often speak in pictures. They might say their body is a "battlefield," their treatment is a "long journey," or their tumor is a "weed" that won't die. These pictures are called metaphors.

This paper is about a team of researchers who wanted to collect these pictures from Dutch cancer patients to help doctors and families understand them better. But there was a problem: there were thousands of stories, interviews, and forum posts, and reading them all by hand would take forever.

So, the researchers decided to hire a digital detective (an Artificial Intelligence, or LLM) to do the heavy lifting. Here is how they did it, explained simply:

1. The Mission: Finding Hidden Gems

The researchers had two big piles of "gold dust" (data):

Pile A: Transcripts of real-life interviews where patients talked about their feelings.
Pile B: Thousands of blog posts and comments from a Dutch cancer website (like a digital support group).

Their goal was to find every metaphor hidden inside these texts. But just like a child looking for a needle in a haystack, the AI had a hard time. At first, the AI was a bit of a daydreamer.

2. The Problem: The AI's "Imagination"

When the researchers first asked the AI, "Find the metaphors," the AI got a bit too creative. It made three main mistakes:

The Hallucinator: Sometimes, the AI invented metaphors that weren't actually there. It was like a student making up a quote for a history essay because they didn't want to admit they forgot the facts.
The Idiom Confuser: The AI couldn't tell the difference between a real metaphor and a common phrase. If a Dutch person said, "It's raining cats and dogs," the AI might think that's a deep metaphor about animals, when it's just a standard saying.
The Summarizer: Instead of pulling out the exact words the patient used, the AI would rewrite the story. It was like asking someone to copy a painting, but they just wrote a description of the painting instead.

3. The Solution: The "Human-in-the-Loop" Team

To fix the AI's daydreaming, the researchers built a training camp for the AI. They didn't just let the AI guess; they gave it a strict rulebook and a human supervisor.

The Rulebook (Prompting): They taught the AI how to think step-by-step (like a detective solving a crime). They said, "Don't guess. Look at the exact sentence. Is it a real picture? Is it a common phrase? Prove it."
The Supervisor (Human Experts): After the AI found a list of metaphors, three human experts (linguists who speak Dutch) acted as the "quality control inspectors." They checked every single one. If the AI was wrong, they crossed it out.

4. The Result: The "HealthQuote.NL" Treasure Chest

After all this training and checking, the team created a special collection called HealthQuote.NL.

They found 130 validated metaphors.
These weren't just random words; they were organized like a library. Some were about Journeys (life as a road), some about Battles (fighting the disease), and some about Nature (tumor as a weed).
They even found some very creative ones, like a patient comparing their body to a damaged car that needs new bodywork, or cancer as an uninvited party that won't leave.

5. Why Does This Matter?

Think of this collection as a dictionary of feelings.

For Doctors: If a doctor knows that a patient sees their treatment as a "journey," they can use that language to explain the next step, making the patient feel understood rather than confused.
For Researchers: It helps us understand how people in the Netherlands process the trauma of cancer, which might be different from people in the UK or the US.
For the Future: It proves that we can use AI to listen to patients, as long as we have humans to double-check the work.

The Bottom Line

The researchers built a bridge between human emotion and machine speed. They taught an AI to listen to the poetry of cancer patients, fixed its mistakes with human eyes, and created a tool that helps healthcare become more personal, empathetic, and clear. It's not just about data; it's about helping people feel heard.

Here is a detailed technical summary of the paper "Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop."

1. Problem Statement

Metaphors are crucial for communication between cancer patients and clinicians, often helping patients conceptualize their illness (e.g., as a "journey" or "battle"). However, there is a significant lack of resources for Dutch-language metaphors in the healthcare domain. Existing metaphor identification procedures (like MIP) are symbolic and word-by-word, making them difficult to scale. Furthermore, while Large Language Models (LLMs) show promise in Natural Language Processing (NLP), their application to extracting metaphors from Dutch patient narratives (both spoken interviews and written forums) remains unexplored.

The core challenges identified include:

Hallucination: LLMs generating metaphors not present in the text.
Confusion with Idioms: Mistaking conventional idioms or figurative language for genuine cross-domain conceptual metaphors.
Abstraction vs. Extraction: Models paraphrasing or summarizing metaphors rather than extracting the exact original text.
Privacy: The need to process sensitive patient data without exposing original text, requiring local processing or strict anonymization.

2. Methodology

The authors developed a Human-in-the-Loop (HITL) extraction framework combining structured prompting, automatic verification, and expert linguistic validation.

Data Sources

Interview Data: Transcripts from 13 oncology interviews involving patients, significant others, and researchers (13 documents, ~5k–13k words each).
Forum Data: 100 blog posts and comments from kanker.nl (a Dutch cancer support website) covering breast, prostate, and melanoma cancers.

Model Configuration

LLMs: A suite of open-source, local models (via Ollama) to ensure data privacy, including qwen3:8b, gemma3 (12b/27b), llama3.1:8b, mistral:7b, deepseek-r1:8b, and domain-specific models (meditron, medllama2).
Prompting Strategies:
1. Instruction Prompt (I.inP): Basic role-playing without Chain-of-Thought (CoT).
2. Refined Prompt v1 (RP-v1): Added CoT and few-shot examples (3 simple metaphors).
3. Refined Prompt v2 (RP-v2): Added CoT and the full English Metaphor Menu (17 categories) as inserted knowledge to guide categorization.
4. Automatic Verification (Auto-Verify): An external checklist requiring the LLM to locate the exact source text, identify the speaker, and distinguish metaphors from literal terms.

Evaluation Process

Human Validation: Three native Dutch speakers with PhDs in computational linguistics or health communication independently reviewed candidate metaphors.
Criteria:
- Faithfulness: Is the metaphor explicitly in the text?
- Metaphoricity: Is it a genuine cross-domain mapping (vs. idiom/literal)?
- Contextual Appropriateness: Does it fit the original context?
Metrics: Precision was calculated as $\frac{\text{Validated Metaphors}}{\text{Total Generated Candidates}}$ .

3. Key Contributions

First Study on Dutch Metaphor Extraction: The first investigation into using LLMs for metaphor identification specifically in Dutch cancer narratives.
HealthQuote.NL Dataset: A curated dataset of 130 validated metaphors (65 from interviews, 65 from forums), categorized by type (word, phrase, sentence), source domain (violence, journey, nature, etc.), and function (coping, explanation, emotion).
Extraction Framework: A novel HITL pipeline integrating structured prompting, CoT, and automatic verification to mitigate hallucination.
Empirical Analysis of Prompting: A detailed comparison of prompting strategies, identifying failure modes (e.g., over-interpretation, idiom confusion) and demonstrating that refined prompting significantly improves precision.
Resource Sharing: Release of prompts, code, and synthetic/paraphrased examples on GitHub to support reproducibility while protecting patient privacy.

4. Results

Performance of Prompting Strategies:
- Instruction Prompt (I.inP): Generated 72 candidates; 56.9% precision. High rate of hallucinations and abstractive summaries.
- Refined Prompt v1 (RP-v1): Generated 38 candidates; 63.2% precision. The best balance of precision and quality. CoT and strict constraints reduced hallucinations.
- Refined Prompt v2 (RP-v2): Generated 174 candidates; 13.8% precision. While sensitivity increased, the inclusion of the full English Metaphor Menu introduced significant noise and over-interpretation, leading to many false positives.
Model Comparison: Different models extracted different subsets of metaphors, suggesting that an ensemble of LLMs is beneficial.
Dataset Composition: The final dataset covers diverse source domains (e.g., "The Party," "The Car," "The Lighthouse," "The Storm") and functions (coping, empowerment, prognosis).
Blog Data Pilot: Applying the method to 100 blog posts yielded 65 distinct metaphor instances, with ~10 identified as highly vivid and suitable for a therapeutic "metaphor menu."

5. Significance and Future Work

Clinical Impact: The extracted metaphors provide insights into how Dutch patients conceptualize illness, enabling clinicians to tailor communication strategies and improve patient-centered care.
Linguistic Contribution: The study bridges a gap in Dutch healthcare linguistics, offering a bilingual (Dutch-English) resource that can be used for translation and cross-cultural studies.
Methodological Insight: The paper demonstrates that while LLMs are powerful, they require rigorous human-in-the-loop validation and specific prompting techniques (like CoT) to handle the nuance of metaphor vs. idiom.
Future Directions:
- Expanding the dataset with more blog posts.
- Investigating Multi-Word Expressions (MWEs) where idioms and metaphors overlap.
- Enhancing model interpretability and explainability.
- Formalizing annotation protocols for broader adoption.

In conclusion, this work establishes a robust pipeline for extracting high-quality metaphor data from sensitive healthcare texts, providing a foundational resource for improving communication in oncology care within the Dutch-speaking context.