Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a doctor trying to diagnose a patient, but instead of looking at a neat, organized medical chart, you are handed a massive, chaotic pile of papers. This pile includes handwritten notes from 10 years ago, blurry photos of X-rays, data from a smartwatch, and genetic test results, all jumbled together with no clear order.

This is the problem this paper tackles regarding Artificial Intelligence (AI) in healthcare.

The Problem: The "Over-Confident" AI

Currently, we hope AI can act like a super-smart medical assistant. However, these AI models have a nasty habit called "hallucination." This is when the AI makes up facts that sound very convincing but are completely false.

The common belief was that if we gave the AI a "search engine" (called RAG) to look up the patient's real records before answering, it would stop making things up. It's like telling a student, "Don't guess; go look in the textbook first."

The paper's shocking discovery: In the messy world of real medical data, just giving the AI the "textbook" (raw search results) actually made it hallucinate much more often. It's as if the student was handed a library full of books but no table of contents, so they started guessing wildly to connect the dots, creating more nonsense than if they had just relied on their own training.

The Solution: Organizing the Chaos

The researchers tried a different approach. Instead of dumping a pile of raw text on the AI, they first acted like a super-organized librarian. They took all the messy data (notes, images, genetics) and turned it into structured, machine-readable "artifacts."

Think of it this way:

Raw Text (The Old Way): A messy kitchen counter covered in flour, eggs, and broken shells. You ask the AI to make a cake, and it tries to bake the shells.
Structured Artifacts (The New Way): The same ingredients, but they have been pre-measured, cracked, and placed in labeled bowls. The AI just has to mix them.

What They Found

The team tested four different ways of asking the AI for help:

The Baseline: The AI guesses on its own. (Result: It made up facts about 5% of the time).
The "Search Engine" (RAG): The AI searches the raw, messy notes. (Result: Disaster! It started making up facts 43% of the time. Giving it more unorganized information confused it).
The "Structured" Approach: The AI uses the pre-organized, labeled data bowls. (Result: Much better! It only made up facts 8% of the time).
The "Agent Workflow": The AI uses the organized data and has a "second pair of eyes" (a verification step) to double-check its work before speaking. (Result: The Winner. This was the safest and most useful method).

The Big Lesson

The paper concludes that how you present information matters more than just having more information.

If you hand a doctor (or an AI) a chaotic pile of papers, they will get overwhelmed and make mistakes. But if you organize that data into a clear, structured format with a clear trail of where every fact came from, the AI becomes much safer and more reliable.

In short: Don't just give the AI a library; give it a well-organized index card system. The quality of the representation determines how smart the AI can actually be.

1. Problem Statement

The paper addresses a critical safety gap in applying Large Language Models (LLMs) to clinical decision support. While LLMs show promise, their tendency to hallucinate (generate plausible but factually unsupported claims) poses severe risks to patient safety.

The prevailing assumption in the field is that Retrieval-Augmented Generation (RAG) mitigates hallucinations by grounding model outputs in retrieved documents. However, the authors argue this assumption is untested in high-stakes clinical environments characterized by:

High information density.
Temporal complexity (longitudinal patient data).
Heterogeneous data sources (EHRs, wearables, genomics, imaging).

The core hypothesis is that simply retrieving raw text is insufficient and may even exacerbate hallucination if the representation of that data is unstructured.

2. Methodology

The authors developed a comparative evaluation framework involving four distinct system conditions to test how data representation and generation strategies affect hallucination rates.

Data Preparation:

Structured Artifacts: Instead of raw text, the system compiles heterogeneous patient data (EHRs, wearables, genomics, imaging) into structured, machine-readable artifacts with explicit provenance tracking.
Scope: The study covers seven clinical domains.
Dataset: 100 synthetic patient vignettes, evaluated across 3 random seeds, resulting in 1,200 total evaluations (300 per condition).

Experimental Conditions:

C0 (Baseline): Standard LLM generation without external data.
C1 (Standard RAG): LLM generation augmented by retrieving raw clinical text.
C2 (Artifact-Augmented Single-Pass): LLM generation augmented by the structured artifacts (single-step).
C3 (Artifact-Augmented Multi-Step Agent): An agentic workflow using structured artifacts with a multi-step process including verification.

Evaluation Metrics:

Primary: Unsupported claim rates (hallucination frequency).
Secondary: Factual accuracy, temporal consistency, contraindication detection, and clinical safety metrics.
Adjudication: Results were verified using GPT-4o-mini with a final physician-adjudicated safety review.

3. Key Results

The study yielded counter-intuitive findings regarding the efficacy of standard RAG in clinical settings.

RAG Increases Hallucination: Contrary to expectations, standard RAG (C1) substantially increased hallucination rates compared to the baseline.
- Baseline (C0): 5.0% unsupported claims (95% CI: 3.8–6.4%).
- RAG (C1): 43.6% unsupported claims (95% CI: 40.1–47.2%).
- Statistical Significance: This represents an 8.7-fold increase ( $p < 0.001$ , Cohen's $d = 2.31$ ).
Structured Representation Reduces Hallucination: Converting data into structured artifacts (C2) significantly lowered the error rate.
- Artifact-Augmented (C2): 8.4% unsupported claims.
- This is a 40% relative reduction compared to the baseline ( $p = 0.02$ , Cohen's $d = 0.48$ ).
Agentic Verification Optimizes Safety: The multi-step agent workflow (C3) achieved 21.1% unsupported claims but demonstrated superior performance in safety-critical metrics:
- Lowest contraindication miss rate: 0.04.
- Highest clinician utility scores.
Ablation Analysis: The study identified that citation requirements and constraint checking were the primary drivers of safety improvements in the structured artifact systems.

4. Key Contributions

Empirical Refutation of RAG Assumptions: The paper provides robust evidence that standard RAG (retrieving raw text) can degrade performance in clinical contexts, likely due to the model's inability to parse unstructured, dense, and conflicting information effectively.
The "Representation First" Paradigm: The authors propose that representation quality is the primary determinant of factual reliability. Structured, provenance-tracked artifacts are more effective than raw retrieval for grounding LLMs.
Information-Theoretic Framework: The paper introduces a theoretical framework distinguishing between:
- Representation Quality: Determines the "ceiling" of factual reliability.
- Agentic Verification: Affects uncertainty handling and the enforcement of safety constraints.
Clinical Safety Benchmark: A rigorous evaluation protocol using synthetic vignettes and physician adjudication to measure specific clinical risks (e.g., contraindication misses) rather than just general text coherence.

5. Significance

This research fundamentally shifts the approach to building clinical AI systems. It suggests that investing in data structuring and provenance tracking is more critical than simply implementing retrieval mechanisms.

Safety Implications: Deploying standard RAG in healthcare without structured representation could lead to a dangerous increase in medical misinformation and unsafe recommendations.
Architectural Guidance: Future clinical AI should prioritize the transformation of heterogeneous data into machine-readable, structured artifacts before attempting generation.
Regulatory Impact: The findings support the need for "explainable" AI where every claim is traceable to a specific, structured data point, rather than a vague retrieval context.

In summary, the paper argues that how data is represented matters more than how data is retrieved, and that structured artifacts are essential for reducing hallucinations in high-stakes medical AI.

Representation Before Retrieval: Structured Patient Artifacts Reduce Hallucination in Clinical AI Systems

The Problem: The "Over-Confident" AI

The Solution: Organizing the Chaos

What They Found

The Big Lesson

1. Problem Statement

2. Methodology

3. Key Results

4. Key Contributions

5. Significance

More like this

A case report on gendered biases in a Finnish healthcare AI assistant

Spine Reviews: Crowdsourcing Global Spine Expert Knowledge via Digital Ledger Technology

Individualised evoked response detection based on the spectral noise colour

Mechanistic Insights into Skin Sympathetic Nerve Activity Dynamics in Healthy Subjects Through a Two-Layer Signal-Analytical and Closed-Loop Physiological Modeling Framework

Wearable sleep staging using photoplethysmography and accelerometry across sleep apnea severity: a focus on very severe sleep apnea