Disentangling Prompt Element Level Risk Factors for Hallucinations and Omissions in Mental Health LLM Responses

Imagine you have a very smart, well-read robot friend who is great at chatting and giving advice. You decide to ask this robot for help with your mental health, like asking, "I feel really down, what should I do?"

This paper is like a safety inspection report for that robot. The researchers wanted to see: When does this robot start making things up, and when does it forget to give you the most important safety advice?

Here is the breakdown of their study using simple analogies:

1. The Problem: The "Scripted" vs. The "Real"

Most tests for AI are like driving tests on a closed track. They ask the AI simple, clear questions like, "What are the symptoms of depression?"

The Reality: Real life is more like driving in a heavy rainstorm with a broken windshield. People in distress don't ask perfect questions. They ramble, they cry, they mix up their words, and they describe their feelings in messy, emotional stories.
The Gap: The researchers realized that if we only test the AI on the "closed track," we don't know if it will crash when a real person in a crisis asks for help.

2. The Tool: The "UTCO" Recipe

To test the AI properly, the researchers built a special recipe called UTCO. Think of it like a Lego set where they can snap together four different blocks to build a unique question every time:

U (User): Who is asking? (e.g., a tired mom, a lonely teenager, a worried dad).
T (Topic): What is the problem? (e.g., anxiety, suicide, relationship stress).
C (Context): The story behind it. (e.g., "I haven't slept in three days," or "My boss yelled at me").
O (Tone): The emotion. (e.g., angry, hopeless, confused, or urgent).

They built 2,075 different "stories" using these blocks and asked the AI (Llama 3.3) to answer them.

3. The Two Big Mistakes

The researchers looked for two specific ways the AI failed:

Hallucinations (The "Fake News" Problem): The AI made up facts. It might invent a fake medicine or a fake therapy that doesn't exist. It's like a tour guide pointing at a building and saying, "That's the White House," when it's actually a bakery.
Omissions (The "Silent Failure" Problem): This was the bigger surprise. The AI gave a nice, empathetic answer but forgot the most important safety rule. For example, if someone says, "I want to hurt myself," the AI might say, "That sounds really hard, have you tried breathing exercises?" but forget to say, "Please call 911 or go to the ER." It's like a lifeguard seeing someone drowning and saying, "You look tired," but forgetting to throw the life preserver.

4. The Findings: What Actually Triggers the Mistakes?

The researchers expected that who was asking (the User) would matter most. They thought if a specific type of person asked, the AI would fail.

The Surprise: It didn't matter much who asked.
The Real Culprit: It mattered how they asked.
- The "Messy Story" Effect: When the prompt was long, emotional, and sounded like a real human rambling (high "Context" and "Tone"), the AI was much more likely to fail.
- The "Crisis" Effect: When the tone was desperate or hopeless, the AI was most likely to omit safety advice. It got so caught up in being "nice" and "empathetic" that it forgot to be "safe."

5. The Analogy of the "Over-Emphatic Waiter"

Imagine a waiter who is trained to be incredibly polite and comforting.

The Scenario: A customer is crying at the table and says, "I'm so upset I could scream, I don't know what to do!"
The Hallucination: The waiter might confidently say, "Have you tried the new 'Calm-Down' soup? It's our secret recipe!" (Even though the soup doesn't exist).
The Omission: The waiter might say, "Oh, that sounds terrible! I'm so sorry you're feeling this way. Here is a napkin." But they forget to call the manager or the police because the customer seems in danger. They were so focused on being a "good listener" that they missed the emergency.

6. The Conclusion: What Should We Do?

The paper argues that we need to stop testing AI with short, perfect questions.

Stress Test the AI: We need to throw messy, emotional, long-winded stories at the AI to see if it breaks.
Prioritize Safety over "Nice": In mental health, an AI that gives a perfect, empathetic answer but forgets the safety warning is dangerous. The researchers say we should treat "forgetting the safety warning" (Omission) as a bigger failure than "making things up" (Hallucination).
The Fix: AI systems need a "safety brake." If the AI detects a long, emotional, confused story, it should be programmed to pause and ask, "I hear you are in crisis. Before we talk about feelings, do you need to call a doctor or emergency services?"

In short: The AI isn't failing because of who is talking to it; it's failing because it gets overwhelmed by how people talk when they are in pain. To make these tools safe, we need to teach them to handle the messy, emotional reality of human distress, not just the clean, textbook questions.

1. Problem Statement

Mental health Large Language Models (LLMs) are increasingly used for consumer-facing support, yet current evaluation methods often fail to capture the complexity of real-world inquiries.

The Gap: Standard benchmarks rely on static, short, and well-defined questions that do not reflect the narrative, high-distress, and ambiguous nature of actual help-seeking behavior.
The Risk: Two critical failure modes exist:
1. Hallucinations: Fabricated or clinically incorrect information (e.g., wrong medication advice).
2. Omissions: Failure to provide clinically necessary or safety-critical guidance (e.g., missing crisis resources in suicidal ideation prompts).
The Unknown: It remains unclear which specific components of a user prompt (e.g., user demographics vs. emotional tone vs. narrative context) drive these failures. Most evaluations do not systematically isolate these variables.

2. Methodology

The authors introduced a structured framework and a rigorous evaluation pipeline to isolate risk factors.

A. The UTCO Framework

The study utilized the UTCO (User, Topic, Context, Tone) framework to modularize prompts into four controllable elements:

User (U): Nine facets including role (e.g., caregiver), age, gender identity, ethnicity, etc.
Topic (T): 10 clinical domains (e.g., Depression, Crisis/Suicidality, Medication).
Context (C): Situational narrative drawn from peer support forums (Reddit, HealingWell) and curated scenarios.
Tone (O): 12 affective labels (e.g., hopeless, anxious, urgent, confused).

B. Data Generation & Filtering

Corpus: Generated 2,075 unique first-person inquiries using GPT-4o, constrained to 300 words.
Filtering: An automated realism filter and expert review ensured internal consistency (e.g., preventing a "retired" user from being under 18) and clinical plausibility.
Target Model: Llama 3.3 (70B parameters), chosen for its open-weight nature and reproducibility.

C. Annotation & Evaluation

Labels: Three independent annotators labeled responses for Hallucination (incorrect/fabricated content) and Omission (missing safety/clinical guidance).
Resolution: Disagreements were adjudicated by a senior medical informatics expert team.
Outcome Rates: Hallucinations occurred in 6.5% of responses; Omissions occurred in 13.2%.

D. Analytical Strategy (Three-Stage)

RQ1 (Global Risk Profiling): Used Gradient-Boosted Tree classifiers with SHAP values to identify which UTCO features globally predicted failure risk.
RQ2 (Sensitivity Analysis): Employed Propensity Score Matching (PSM) in a leave-one-out design. By balancing three UTCO elements, the study isolated the effect of the fourth element (e.g., holding User, Topic, and Tone constant to test the impact of Context variations).
RQ3 (Mechanism Analysis): Paired failure cases with highly similar non-failure controls (cosine distance $\le$ 0.15) using a GPT-4o "judge" to score linguistic triggers (ambiguity, emotional load, missing constraints) on a 0–3 scale.

3. Key Results

A. Failure Prevalence

Omissions were nearly twice as common as hallucinations (13.2% vs. 6.5%).
Topic Specificity: Omissions were heavily concentrated in Crisis and Suicidality prompts (36.2% omission rate), whereas hallucinations were most frequent in Medication-Specific Issues (10.9% rate).

B. Risk Factors (RQ1 & RQ2)

Context and Tone are Critical: Failures were most consistently associated with Context (narrative length, naturalistic sources) and Tone (high distress).
- Hallucinations: Correlated with longer prompts, naturalistic sources, and "confused" tone.
- Omissions: Correlated with "hopeless," "anxious," and "confused" tones, longer narratives, and higher readability levels.
User Background is Not a Primary Driver: After balancing other elements via propensity matching, user demographics (age, gender, role) showed no systematic difference in failure rates. The model's failure risk is driven by how the inquiry is framed, not who is asking.

C. Linguistic Mechanisms (RQ3)

Ambiguity & Missing Constraints: The strongest predictors for both failure modes were Ambiguity (mean severity ~2.7) and Missing Clinical Constraints (e.g., vague questions like "How long does depression last?").
Emotional Load: Omissions were uniquely sensitive to high Emotional Load (mean severity 2.46 vs. 2.00 for hallucinations). High-distress framing often led to empathetic but clinically incomplete responses.
Complexity: Failure cases had higher readability grades, more subordinate clauses, and higher pronoun ambiguity compared to successful controls.

4. Key Contributions

UTCO Framework: A novel, modular prompt construction method that enables systematic stress testing of LLMs by varying specific narrative elements while controlling others.
Reframing Safety Evaluation: Demonstrates that omissions are a primary safety failure mode in mental health, often more prevalent and dangerous than hallucinations because they are "silent" (the response looks coherent but lacks critical safety info).
Decoupling User Identity from Risk: Provides empirical evidence that user demographics are not the primary drivers of LLM failure in this domain; rather, the linguistic framing (context length, tone, ambiguity) dictates risk.
Mechanism Taxonomy: Identifies specific linguistic triggers (ambiguity, missing constraints, emotional load) that distinguish failure cases from non-failure cases even in highly similar prompts.

5. Significance and Implications

Benchmarking: Current short, static benchmarks underestimate omission risks. Evaluation protocols must incorporate stress testing with long, naturalistic, and high-distress narratives.
Mitigation Strategies:
- Systems should prioritize safety supplementation (e.g., automatically adding crisis resources) when high-distress tones or crisis topics are detected, regardless of explicit user requests.
- Models should be trained or prompted to trigger clarifying questions when prompts contain high ambiguity or missing clinical constraints, rather than guessing.
Ethical Alignment: Treating omissions as a primary safety outcome aligns with the ethical principles of non-maleficence (do no harm) and beneficence, ensuring users receive necessary safety guidance even when they cannot articulate it clearly.

Conclusion

The study concludes that in mental health LLM applications, the narrative structure and emotional tone of a prompt are the dominant predictors of safety failures. To ensure safety, future evaluations must move beyond static datasets to dynamic, stress-tested scenarios that mimic the complexity and ambiguity of real-world help-seeking, with a specific focus on preventing the omission of critical safety guidance.