Automated Extraction of Cancer Registry Data from… — Plain-Language Explanation

Imagine you are running a massive library of medical stories. Every time a patient has surgery, a doctor writes a detailed report about what they found inside the body. These reports are the "gold mine" for cancer researchers, but they are written in messy, unstructured language—like handwritten notes, different fonts, and varying styles.

To use this information for research or to track cancer trends, someone has to read every single report and turn the messy notes into neat, organized data boxes (like "Tumor Size," "Lymph Node Status," etc.). Right now, this is done by human experts, which is slow, expensive, and tiring.

This paper is a race between two different "robot librarians" to see which one can do this sorting job faster and more accurately.

The Two Contenders

1. The "Smart Intern" (Brim Analytics)

How it works: This system uses a Large Language Model (LLM), which is like a super-smart AI that has read millions of medical books. Instead of just looking for keywords, it "reads" the report like a human would, understanding context and nuance.
The Strategy: The researchers gave this AI a very specific set of instructions (a rulebook) on exactly what to look for. Think of it like a highly trained intern who knows exactly how to fill out a form based on a detailed checklist.
The Result: It was incredibly accurate. It got about 97% of the answers right for pancreatic cancer and 94% for breast cancer. Even better, it didn't get confused when the report style changed from a messy paragraph to a structured checklist. It was like a translator who speaks both "Doctor-Speak" and "Data-Speak" fluently.

2. The "Keyword Detective" (DeepPhe)

How it works: This is an older, "ontology-driven" system. It doesn't really "understand" sentences; instead, it acts like a keyword detective. It has a giant dictionary of medical terms and looks for specific matches (e.g., "if I see the word 'tumor' near 'size', I will grab that number").
The Strategy: It relies on rigid rules and pre-defined lists. It's like a robot that only knows how to find specific words in a text.
The Result: It did okay on some things (like lymph nodes), but it struggled badly with others. When it came to tumor size (T-stage), it got it wrong nearly 30% of the time on breast cancer reports. It tended to guess "yes" too often (hallucinating a tumor size where there wasn't one clearly stated). It was like a detective who only looks for the word "gun" and assumes a crime happened even if the word "gun" was just mentioned in a story about a toy.

The Big Test: The "Real World" Challenge

The researchers didn't just test these robots on perfect, clean data. They threw them into the messy reality of a real hospital:

The Time Travel Test: They used reports from 2006 all the way to 2025. Medical writing styles change over time (like how we stopped writing letters and started texting).
The Style Test: Some reports were long, rambling paragraphs (narrative), while others were neat, fill-in-the-blank forms (synoptic).
The Switch Test: They trained the "Smart Intern" on pancreatic cancer, then asked it to do breast cancer without any retraining.

What They Found (The Takeaway)

The "Smart Intern" (Brim) won the race.

Adaptability: It handled the messy, old reports and the new reports equally well.
Generalization: Even though it was only taught about pancreatic cancer, it figured out breast cancer almost as well. It understood that "tumor" means the same thing, even if the doctor wrote it differently.
Safety: When it made a mistake, it was usually "conservative" (it missed a detail but didn't invent one). In medicine, it's often safer to miss a detail and have a human check it, than to invent a fake detail and cause panic.

The "Keyword Detective" (DeepPhe) struggled.

It fell apart when the reports weren't perfectly structured.
It made up data (false positives) more often, which is dangerous in a medical setting.
It couldn't adapt well when switching from one type of cancer to another.

The Speed

Both robots were fast. They could process a report in less than 5 seconds. The "Smart Intern" was slightly slower on complex reports, but both were thousands of times faster than a human.

The Bottom Line

This study suggests that the future of cancer data isn't just about having more computers; it's about having smarter computers that can read and understand context, not just search for keywords.

The Analogy for the Future:
Imagine a hospital where the "Smart Intern" reads every pathology report the moment it's written. It fills out 95% of the data forms automatically. Then, a human expert (the tumor registrar) just has to review the 5% of cases where the AI was unsure or the report was weird. This turns a job that takes hours into a job that takes minutes, allowing humans to focus on the complex cases rather than data entry.

In short: AI is ready to help, but it needs to be the kind of AI that understands language, not just the kind that searches for words.

1. Problem Statement

Cancer registries rely on converting unstructured pathology reports into structured data for surveillance, research, and quality monitoring. Currently, this process is largely manual, labor-intensive, and prone to variability. While automated Natural Language Processing (NLP) tools exist, most evaluations occur under idealized conditions that do not reflect real-world deployment challenges, such as:

Heterogeneity: Pathology reports vary significantly in structure (narrative free-text vs. structured synoptic templates), terminology, and formatting across institutions and time.
Generalizability: It is unclear if systems trained on one cancer type or report format can accurately extract data for other cancer types without retraining.
Validation Gap: There is a lack of rigorous validation across multiple disease types and extended temporal ranges (evolving clinical standards like AJCC 8th edition).

2. Methodology

The study conducted a retrospective evaluation of two distinct automated extraction platforms using data from Johns Hopkins Hospital.

Datasets:

Pancreatic Adenocarcinoma: 330 validation reports (plus 150 for development) spanning 2006–2025. This cohort included a mix of narrative (72.1%) and synoptic (27.9%) formats, capturing evolving documentation practices.
Breast Cancer: 34 independent validation reports (2006–2025) used to test cross-disease generalizability without disease-specific system modifications.
Gold Standard: Seven key variables were annotated by trained specialists following Commission on Cancer (CoC) and AJCC 8th edition guidelines. Inter-rater reliability was perfect ( $\kappa = 1.0$ ).

Evaluated Platforms:

Brim Analytics (LLM-Based): A commercial, cloud-hosted platform using Large Language Models (GPT-4.1 mini). It employs a "specification-driven" approach where clinical experts define explicit abstraction rules (instructions, semantics, scope) that guide the LLM. The system was optimized via a two-iteration protocol on pancreatic data before being applied to validation sets.
DeepPhe (Ontology-Driven): An open-source system built on Apache cTAKES. It uses a double-pipeline architecture combining a domain-specific ontology (DeepPhe conceptual model) with rule-based mention detection and phenotype summarization. It was run with default settings (no customization) to test "out-of-the-box" performance.

Workflow:

Reports were assembled into single text files with minimal preprocessing.
Both systems generated structured outputs mapped to standard ontologies (NCI Thesaurus, OncoTree).
Performance was measured using Accuracy, Precision, Recall, F1 Score, and Cohen's $\kappa$ against the gold standard.
Processing times were recorded as end-to-end wall-clock time.

3. Key Contributions

Real-World Deployment Evaluation: Unlike prior studies, this work evaluated systems on unmodified, heterogeneous clinical documents spanning nearly two decades of changing standards.
Cross-Disease Generalizability: The study explicitly tested if a system optimized for pancreatic cancer could perform on breast cancer without retraining, a critical requirement for scalable deployment.
Format-Agnostic Analysis: The study stratified performance by report format (narrative vs. synoptic), revealing significant performance gaps in legacy narrative reports that previous benchmarks often overlooked.
Architectural Comparison: It provides a direct comparison between LLM-guided abstraction (flexible, rule-transparent) and ontology-driven extraction (rigid, structure-dependent).

4. Key Results

Overall Accuracy:

Brim Analytics: Achieved high accuracy across all variables.
- Pancreatic Cancer: Mean accuracy 96.7% (Range: 90.6%–99.7%). Notable performance on T stage (96.4%) and Histologic Grade (97.0%).
- Breast Cancer: Mean accuracy 93.7% (a 3.0 percentage point drop), demonstrating strong cross-disease generalizability. T stage reached 100%.
DeepPhe: Showed variable performance, heavily dependent on report format.
- Pancreatic Cancer: Mean accuracy 90.5% (TNM only). Performed well on N stage (96.4%) but poorly on T stage (83.6%) and M stage (91.5% with low precision).
- Breast Cancer: Mean accuracy dropped to 83.3%. T stage accuracy fell to 70.6% ( $\kappa \approx 0.076$ , near chance level).

Error Analysis:

Error Types: Brim Analytics exhibited a conservative bias (false negatives), underclassifying tumor extent, which is clinically safer. DeepPhe exhibited a false-positive bias, systematically overclassifying tumor extent (T stage).
Format Dependency: DeepPhe struggled significantly with narrative reports (21.4% error rate in pancreatic cancer) compared to synoptic reports (3.3%). Brim Analytics showed minimal performance degradation between formats (4.6% vs. 1.1%).
Processing Speed: Both systems were operationally fast.
- Brim: 0.9s (Pancreatic) to 4.6s (Breast) per report.
- DeepPhe: 1.1s (Pancreatic) to 3.5s (Breast) per report.

Optimization Impact:

For Brim Analytics, a two-iteration optimization process (adding synonym mapping and clarifying ambiguous criteria) improved mean accuracy on the development set from 85.9% to 92.1%.

5. Significance and Implications

Feasibility of Automation: The study confirms that LLM-based extraction can achieve clinically meaningful accuracy (>95% for key staging variables) across diverse cancer types and report formats, supporting the feasibility of automated cancer registry workflows.
Operational Deployment:
- Brim Analytics is better suited for environments with heterogeneous, legacy narrative data and where transparency/auditability of extraction rules is required. Its "human-in-the-loop" potential is high due to its conservative error profile.
- DeepPhe offers "out-of-the-box" utility for structured synoptic reports but requires significant ontology engineering to handle narrative text or new disease domains.
Future Directions: The authors advocate for a human-in-the-loop model where automation pre-populates registry fields and flags discrepancies, reducing the manual burden on tumor registrars. They call for multi-institutional validation (e.g., via the Cancer AI Alliance) and prospective integration studies to measure efficiency gains in live workflows.

In conclusion, while ontology-driven systems remain useful for structured data, LLM-based platforms with explicit, guideline-driven instructions demonstrate superior robustness, generalizability, and accuracy for extracting cancer registry data from real-world, unstructured pathology reports.

Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms