Large-Language Models for data extraction from written kidney biopsy reports

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you have a massive library of medical stories written by kidney specialists. These stories (kidney biopsy reports) are incredibly detailed and full of life-saving information, but they are written in a messy, free-flowing narrative style—like handwritten letters from the 19th century. While a human doctor can read them and understand the plot, a computer cannot. It's like trying to feed a handwritten novel into a spreadsheet; the computer just sees a jumble of words, not the data it needs to find patterns, cure diseases, or build better treatments.

This paper is about teaching Artificial Intelligence (AI) to read these messy stories and turn them into neat, organized spreadsheets automatically.

Here is the breakdown of what the researchers did, using some simple analogies:

1. The Problem: The "Unreadable Library"

Kidney biopsies are the gold standard for diagnosing kidney diseases. However, pathologists write their findings in long, complex paragraphs.

The Analogy: Imagine trying to find every mention of "blue cars" in a library of 10,000 novels. You could read every book, but it would take you a lifetime. If you want to study blue cars, you need a way to instantly pull that data out. Currently, humans have to do this manually, which is slow, expensive, and doesn't scale up.

2. The Solution: The "Super-Reader" (LLMs)

The researchers tested three different AI models (called Large Language Models or LLMs) to see if they could act as super-fast librarians. They fed these AI models the messy kidney reports and asked them to extract specific facts (like "How many kidney filters are there?" or "What is the diagnosis?") and put them into a structured format (like a JSON file, which is just a fancy way of saying a digital data box).

They tested three "students":

Llama3 70B: The PhD student with a massive brain.
MedGemma: A specialized medical student.
Llama3 8B: A smart but smaller student.

3. The Results: Who Got the Best Grades?

The researchers compared the AI's answers against a "Gold Standard" created by two human doctors who double-checked the work.

The Big Brain Wins: The Llama3 70B model was the star of the show. It got 93.3% of the facts exactly right and 97.1% right if you allowed for tiny wording differences. It was almost as good as the human experts.
The Specialist: MedGemma also did a great job, coming in a close second.
The Small Student: The Llama3 8B model was okay, but it made more mistakes (around 80% accuracy). It was like a smart intern who sometimes missed the subtle details.

4. The Catch: The "Context Trap"

The AI was amazing at finding facts that were clearly stated.

Example: If the report said "There are 15 glomeruli," the AI instantly wrote down "15."
The Struggle: The AI stumbled when the report required interpretation.
- The Analogy: Imagine a report says, "The inflammation is bad, but only in the scarred parts of the kidney." A human doctor knows exactly what that means. The AI sometimes got confused about where the inflammation was or whether a specific pattern meant "Disease A" or just "a symptom of Disease B."
- When the AI had to make a judgment call rather than just copy a number, it made more errors.

5. The Speed Factor

The most exciting part? Speed.
The AI did the work of a human data collector 12 to 17 times faster.

The Analogy: If a human takes 1 hour to organize one file, the AI can organize 12 to 17 files in that same hour. This means researchers can suddenly analyze thousands of past cases instead of just a few dozen.

6. The Conclusion: The "Human-in-the-Loop"

The paper concludes that we don't need to replace human doctors with AI. Instead, we should use AI as a super-efficient assistant.

The Workflow: Let the AI do the heavy lifting of reading the report and filling in the easy boxes (numbers, clear diagnoses). Then, a human doctor just needs to double-check the tricky parts where the AI might be confused.
The Future: This could lead to massive, searchable databases of kidney diseases. Instead of having to hunt for rare diseases in dusty files, doctors could instantly find every case of a specific rare kidney condition to study it and find better cures.

In a nutshell: This paper proves that AI can read messy medical handwriting and turn it into clean data almost perfectly. It's not perfect yet (it needs a human to check the tricky parts), but it's fast enough to unlock a treasure trove of medical knowledge that was previously stuck in unreadable stories.

1. Problem Statement

Kidney biopsy reports are a cornerstone of nephrology, containing critical histopathological data essential for disease classification, prognosis, and treatment planning. However, these reports are predominantly generated in narrative free-text format. This unstructured nature creates significant barriers to:

Scalable data reuse: Manual extraction of data for research cohorts is time-consuming and inefficient.
Standardization: Unlike cancer pathology, which has adopted structured synoptic reporting, nephropathology lacks standardized coding, hindering the creation of large, interoperable datasets required for computational research and AI development.
Retrospective analysis: Building informative cohorts from historical free-text reports is labor-intensive, limiting the scope of data-driven nephrology research.

2. Methodology

The study investigated the capability of open-source Large Language Models (LLMs) to automate the extraction of structured data from German free-text native kidney biopsy reports.

Data Source: Native kidney biopsy pathology reports from the Institute of Pathology at RWTH Aachen University Clinic.
Models Evaluated: Three open-source LLMs were tested:
1. Llama3 70B (Large parameter count)
2. Llama3 8B (Smaller parameter count)
3. MedGemma (Medical-tuned model)
Task: The models were prompted to parse unstructured text and output a structured JSON format containing specific elements: primary diagnosis, glomerular counts, globally sclerotic glomeruli, histopathological patterns, scores, and immunohistochemical markers.
Ground Truth Creation:
- Two independent human observers manually curated the same report elements.
- Disagreements were resolved by an experienced nephropathologist to establish a final consensus ground truth.
Evaluation Metrics:
- Accuracy: Measured via Strict Matching (exact agreement) and Soft Matching (allowing for minor rephrasing or incomplete expressions).
- Inter-rater Agreement: Quantified using Cohen's Kappa (for two raters) and Light's Kappa (for three raters, including the LLM) with 95% confidence intervals calculated via 1,000-times bootstrapping.
- Efficiency: Comparison of extraction time between LLMs and human raters.

3. Key Contributions

First Comprehensive Nephropathology Extraction: This is one of the first studies to apply LLMs to the complex, nuanced field of nephropathology, which involves rare diseases, manifold histological changes, and specialized stains (e.g., electron microscopy).
Benchmarking Open-Source Models: The study provides a comparative performance analysis of different model sizes (70B vs. 8B) and specialized medical models (MedGemma) in a clinical pathology context.
Quantification of Human vs. AI Variability: The authors uniquely assessed how adding an LLM as a "third rater" affects inter-rater agreement compared to human-to-human agreement, highlighting where AI aligns with or diverges from human consensus.
Prompt Engineering Insights: The study demonstrates how specialized prompts can mitigate specific errors, particularly in distinguishing between descriptive patterns and diagnostic conclusions.

4. Results

Overall Performance:
- Llama3 70B achieved the highest accuracy: 93.3% (strict match) and 97.1% (soft match).
- MedGemma followed closely with 90.5% (strict) and 95.9% (soft).
- Llama3 8B showed significantly lower performance: 79.3% (strict) and 84.2% (soft).
Task-Specific Performance:
- High Accuracy (>95%): Models excelled at extracting explicit, discrete variables such as glomerular counts, globally sclerotic glomeruli, and positivity of immunohistochemistry markers.
- Lower Accuracy: Performance dropped for elements requiring contextual interpretation, such as the primary diagnosis and distinguishing interstitial inflammation within fibrotic vs. non-fibrotic cortex.
Inter-rater Agreement:
- Human-to-human agreement for primary diagnosis was strong ( $\kappa = 0.74$ ).
- Adding Llama3 70B or MedGemma as a third rater increased overall agreement ( $\kappa = 0.82$ and $0.78$, respectively).
- Adding Llama3 8B decreased agreement ( $\kappa = 0.71$ ), indicating the smaller model introduced more noise than value.
Efficiency: Using Llama3 70B for data extraction was 12.5 to 17.86 times faster than manual human data collection.
Error Analysis: Common errors included misinterpreting descriptive patterns (e.g., "focal segmental glomerulosclerosis" as a pattern vs. a disease) and failing to synthesize complex findings (e.g., electron microscopy results) into a final diagnosis.

5. Significance and Conclusion

Scalable Research Infrastructure: The study demonstrates that open-source LLMs can effectively transform narrative nephropathology reports into machine-readable structured data, enabling the rapid creation of large-scale retrospective cohorts essential for AI-driven research.
Hybrid Workflow Recommendation: The authors propose a human-in-the-loop approach:
- Low-risk, explicit variables (counts, markers) can be auto-populated with high confidence.
- High-risk, interpretive variables (diagnoses, complex patterns) should be pre-filled by the LLM but require expert human supervision for verification.
Future Directions: The authors suggest future work should focus on multicenter validation, multilingual support, and mapping extracted fields to controlled vocabularies (e.g., Kidney Biopsy Codes) to ensure interoperability in clinical registries.
Clinical Integration: Such systems have the potential to be integrated into routine diagnostic workflows, directly converting narrative reports into structured data at the point of care, thereby improving data quality for both clinical management and research.

Large-Language Models for data extraction from written kidney biopsy reports

1. The Problem: The "Unreadable Library"

2. The Solution: The "Super-Reader" (LLMs)

3. The Results: Who Got the Best Grades?

4. The Catch: The "Context Trap"

5. The Speed Factor

6. The Conclusion: The "Human-in-the-Loop"

1. Problem Statement

2. Methodology

3. Key Contributions

4. Results

5. Significance and Conclusion

More like this

Fragile polyQ assemblies cause Golgipathy in Huntington's disease

3-Minute Hematoxylin and Oil Red O (H-ORO) Staining Protocol for Frozen Sections of Zebrafish

Cassava witches' broom disease in French Guiana: a threat to cacao cultivation and its biodiversity?

Autopsy-based longitudinal multi-organ high-dimensional profiling reveals lineage plasticity in TRK-inhibitor-resistant secretory breast carcinoma

The K18-hACE2 mouse model of SARS-CoV-2 infection to illustrate the role and response of the vasculature in neurotropic viral infection