Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models

Imagine you are a detective trying to solve a mystery that unfolds over several years. The clues aren't hidden in a safe or a diary; they are scattered across hundreds of messy, handwritten police reports. Each report describes a suspect (a tumor) in a different way: sometimes the suspect is big, sometimes small, sometimes they've moved, and sometimes they've disappeared.

The problem? These reports are written in "human language," full of paragraphs, weird abbreviations, and inconsistent formatting. A regular computer program is like a robot that can only read a spreadsheet; if the data isn't in a perfect grid, the robot gets confused and crashes.

This paper is about building a super-smart, privacy-focused detective that can read these messy reports, understand the story, and create a perfect timeline of the suspect's movements.

Here is the breakdown of their solution:

1. The Problem: The "Messy Notebook"

Doctors write radiology reports to track cancer. They use a standard rulebook called RECIST (like a set of strict rules for measuring a suspect's height). But doctors write these rules in long, flowing sentences.

The Issue: To study cancer trends, researchers need to turn these messy sentences into a clean spreadsheet. Doing this manually is like trying to copy a library of books by hand—it takes forever and is prone to human error.
The Privacy Wall: Most powerful AI tools (Large Language Models) are like "Black Boxes" owned by big tech companies. You have to send your patient data to their servers to get an answer. In healthcare, this is a no-go zone because patient data must stay private and never leave the hospital.

2. The Solution: The "Local Librarian"

The authors built a system that acts like a local librarian who lives inside the hospital's own building.

Open Source: Instead of renting a Black Box, they built their own tool using free, open-source software.
Locally Deployable: This means the AI runs on the hospital's own computers. The patient data never leaves the building. It's like hiring a private investigator who works out of your office rather than sending your files to a stranger's office.
The Brain: They used a specific AI model called Qwen2.5-72b. Think of this as a very well-read detective who has studied millions of medical texts but is smart enough to follow the specific rules of the RECIST game.

3. The Mission: Tracking the "Three Types of Suspects"

The system was trained to look for three specific things in the reports and link them across time:

Target Lesions (TLs): The main suspects the doctors are watching closely.
Non-Target Lesions (NTLs): The "hangers-on" or smaller suspects that are still there but not the main focus.
New Lesions (NLs): Brand new suspects that have appeared since the last report.

The tricky part? The AI had to realize that "The lump in the left lung mentioned in January" is the same lump mentioned as "The mass in the left lung" in March. It had to connect the dots across time.

4. The Test: The "Double-Check"

To see if their detective was any good, they gave it 50 pairs of reports (a "before" and "after" for 50 patients) and asked it to create a timeline.

The Judges: Two human experts (senior detectives) also looked at the same reports and created their own timelines.
The Score: They compared the AI's timeline to the humans' timelines.

The Results were impressive:

The AI got the size of the tumors right 93.7% of the time.
It correctly identified new tumors 94% of the time.
It correctly linked the same tumor across different reports 95% of the time.

5. The Hiccups: When the "Paper" Gets Crumpled

Even a super-smart detective makes mistakes. The paper notes a few funny scenarios where the AI got confused:

The "Wrapped" Table: Sometimes doctors write a table that spills over onto the next line. The AI sometimes grabbed the wrong number because it lost track of which column it was in.
The "Not Measurable" Confusion: If a doctor wrote "too small to measure" or used a dash (–), the AI sometimes got confused about whether to write "0" or "unknown."
The "Group" vs. "Individual" Problem: Sometimes a doctor says "a bunch of lymph nodes" in one report, and then lists them one by one in the next. The AI struggled to realize these were the same group of suspects.

The Big Takeaway

This paper proves that you don't need a billion-dollar, secret AI from a tech giant to analyze medical data. You can build a privacy-safe, open-source AI that runs on your own computers, understands the messy story of a patient's cancer journey, and turns it into clean, usable data.

It's like giving every hospital a personal, super-intelligent assistant that never forgets a detail, never leaks a secret, and can do in seconds what would take a human team weeks to do. This opens the door for massive, high-quality cancer research that was previously impossible due to privacy fears and manual labor.

Here is a detailed technical summary of the paper "Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models."

1. Problem Statement

Radiology reports are a primary source of longitudinal data regarding tumor burden, treatment response, and disease progression in oncology. However, extracting this information for clinical research and AI development faces three major hurdles:

Unstructured Format: Reports are narrative and heterogeneous, making systematic automated analysis difficult.
Proprietary Limitations: State-of-the-art Large Language Models (LLMs) capable of processing this text are often proprietary, raising concerns regarding data privacy, security, and reproducibility in healthcare settings.
Lack of Longitudinal Context: Most existing extraction methods treat reports as independent documents. They fail to link findings across timepoints, which is essential for applying criteria like RECIST (Response Evaluation Criteria in Solid Tumors) to track lesion evolution (persistence, resolution, or new appearance).

2. Methodology

The authors developed a fully open-source, locally deployable pipeline to extract and link lesion data from Dutch CT Thorax/Abdomen reports.

Dataset

Source: Radiology reports from Radboud University Medical Center (March 2021 – March 2025).
Selection: Patients with multiple reports were identified. A text search ensured the term "target" appeared in at least two reports per patient.
Final Set: 60 pairs were manually reviewed; 10 were used for prompt engineering (debug set), and 50 pairs (100 individual reports) were used for the final evaluation.

System Architecture

Framework: The pipeline utilizes llm_extractinator, a language-agnostic framework for structured extraction using open-source LLMs.
Model: qwen2.5-72b-instruct (quantized to q4_K_M). This model was selected after comparative testing for its balance of performance and computational feasibility.
Hardware: Inference was run on two NVIDIA A100 GPUs (40 GB each) with a temperature setting of 0 to ensure deterministic outputs.

Prompt Engineering & Logic

Input Strategy: Reports were processed in concatenated pairs (Timepoint $T_1$ + $T_2$ ) to allow the model to reason about temporal continuity and link lesions across time.
RECIST Compliance: The prompt enforced specific rules:
- Extraction of Target Lesions (TL), Non-Target Lesions (NTL), and New Lesions (NL).
- Identification of the current measurement based on the rightmost numeric value preceding the last valid Series-Image (SE-IMA) reference.
- Assignment of stable labels (e.g., TL_, NTL_, NL_) derived from anatomical descriptions to ensure consistent tracking.
- Handling of null sizes for prose-only findings or unmeasurable lesions.
Output Schema: A Pydantic v2 schema enforced strict JSON output, structuring data into Lesion (label, description, size, SE-IMA, note) and Report groups.

Evaluation Protocol

Ground Truth: Two independent human readers manually annotated all TLs, NTLs, and NLs in the 50 report pairs.
Metrics: Accuracy was measured at three levels:
1. Document Level: Presence of extraction errors in a report.
2. Lesion Level: Correct identification and linking of a specific lesion.
3. Attribute Level: Accuracy of specific fields (Label, Size in mm, SE-IMA ID).
Statistical Analysis: Inter-reader agreement was quantified using lesion-level agreement rates and two-proportion $z$ -tests.

3. Key Contributions

First Open-Source Longitudinal Pipeline: Demonstrates the first use of an open-source LLM for longitudinal lesion extraction from radiology reports, ensuring data privacy by keeping processing local.
Novel Task Formulation: Introduces a specific framework for multi-timepoint lesion linkage under RECIST guidelines, moving beyond single-document extraction.
Reproducible Framework: Provides a complete, open-source pipeline (llm_extractinator + qwen2.5-72b) that can be deployed in privacy-sensitive healthcare environments without relying on external APIs.

4. Results

The system achieved high performance across all lesion categories:

Metric	Target Lesions (TL)	Non-Target Lesions (NTL)	New Lesions (NL)
Attribute-Level Accuracy	93.7% (95% CI: 92.2–95.0)	94.9% (95% CI: 92.5–96.6)	94.0% (95% CI: 91.1–96.1)
Report-Level (No Errors)	83.5%	88.0%	95.5%
Complete Extraction Accuracy	78.1%	90.0%	89.1%

Inter-Reader Agreement: High agreement between human readers (93.3%), confirming the reliability of the ground truth.
Error Analysis:
- TL Errors: Balanced between size (45%) and SE-IMA (55%) attributes.
- NTL Consistency: Slightly lower consistency was attributed to variable radiologist phrasing (e.g., grouping lymph nodes vs. listing individually) rather than model failure.
- Formatting Challenges: The model occasionally struggled with tables split across lines or ambiguous "not measurable" notations, though it generally handled edge cases well.

5. Significance and Conclusion

This study validates that open-source, locally deployable LLMs can achieve clinically meaningful performance (>93% attribute accuracy) in complex, multi-timepoint oncology tasks.

Privacy & Reproducibility: By avoiding proprietary cloud APIs, the approach ensures sensitive patient data remains within institutional boundaries, addressing a critical barrier in healthcare AI adoption.
Scalability: The pipeline enables the conversion of vast amounts of unstructured historical radiology reports into structured, longitudinal datasets. This facilitates large-scale retrospective studies, population analyses, and the training of specialized AI models.
Future Impact: The results suggest that with minor refinements in prompt engineering (e.g., handling table wrapping), open-source LLMs are ready for deployment in routine clinical text processing to support cancer research and treatment monitoring.