Automated Extraction of Cancer Registry Data from Pathology Reports: Comparing LLM-Based and Ontology-Driven NLP Platforms

This study demonstrates that an LLM-based platform (Brim Analytics) achieves high accuracy and efficient processing for extracting cancer registry data from pathology reports, outperforming an ontology-driven system (DeepPhe) particularly in T stage classification across pancreatic and breast cancer cases.

McPhaul, T., Kreimeyer, K., Baris, A., Botsis, T.

Published 2026-03-23
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are running a massive library of medical stories. Every time a patient has surgery, a doctor writes a detailed report about what they found inside the body. These reports are the "gold mine" for cancer researchers, but they are written in messy, unstructured language—like handwritten notes, different fonts, and varying styles.

To use this information for research or to track cancer trends, someone has to read every single report and turn the messy notes into neat, organized data boxes (like "Tumor Size," "Lymph Node Status," etc.). Right now, this is done by human experts, which is slow, expensive, and tiring.

This paper is a race between two different "robot librarians" to see which one can do this sorting job faster and more accurately.

The Two Contenders

1. The "Smart Intern" (Brim Analytics)

  • How it works: This system uses a Large Language Model (LLM), which is like a super-smart AI that has read millions of medical books. Instead of just looking for keywords, it "reads" the report like a human would, understanding context and nuance.
  • The Strategy: The researchers gave this AI a very specific set of instructions (a rulebook) on exactly what to look for. Think of it like a highly trained intern who knows exactly how to fill out a form based on a detailed checklist.
  • The Result: It was incredibly accurate. It got about 97% of the answers right for pancreatic cancer and 94% for breast cancer. Even better, it didn't get confused when the report style changed from a messy paragraph to a structured checklist. It was like a translator who speaks both "Doctor-Speak" and "Data-Speak" fluently.

2. The "Keyword Detective" (DeepPhe)

  • How it works: This is an older, "ontology-driven" system. It doesn't really "understand" sentences; instead, it acts like a keyword detective. It has a giant dictionary of medical terms and looks for specific matches (e.g., "if I see the word 'tumor' near 'size', I will grab that number").
  • The Strategy: It relies on rigid rules and pre-defined lists. It's like a robot that only knows how to find specific words in a text.
  • The Result: It did okay on some things (like lymph nodes), but it struggled badly with others. When it came to tumor size (T-stage), it got it wrong nearly 30% of the time on breast cancer reports. It tended to guess "yes" too often (hallucinating a tumor size where there wasn't one clearly stated). It was like a detective who only looks for the word "gun" and assumes a crime happened even if the word "gun" was just mentioned in a story about a toy.

The Big Test: The "Real World" Challenge

The researchers didn't just test these robots on perfect, clean data. They threw them into the messy reality of a real hospital:

  • The Time Travel Test: They used reports from 2006 all the way to 2025. Medical writing styles change over time (like how we stopped writing letters and started texting).
  • The Style Test: Some reports were long, rambling paragraphs (narrative), while others were neat, fill-in-the-blank forms (synoptic).
  • The Switch Test: They trained the "Smart Intern" on pancreatic cancer, then asked it to do breast cancer without any retraining.

What They Found (The Takeaway)

The "Smart Intern" (Brim) won the race.

  • Adaptability: It handled the messy, old reports and the new reports equally well.
  • Generalization: Even though it was only taught about pancreatic cancer, it figured out breast cancer almost as well. It understood that "tumor" means the same thing, even if the doctor wrote it differently.
  • Safety: When it made a mistake, it was usually "conservative" (it missed a detail but didn't invent one). In medicine, it's often safer to miss a detail and have a human check it, than to invent a fake detail and cause panic.

The "Keyword Detective" (DeepPhe) struggled.

  • It fell apart when the reports weren't perfectly structured.
  • It made up data (false positives) more often, which is dangerous in a medical setting.
  • It couldn't adapt well when switching from one type of cancer to another.

The Speed

Both robots were fast. They could process a report in less than 5 seconds. The "Smart Intern" was slightly slower on complex reports, but both were thousands of times faster than a human.

The Bottom Line

This study suggests that the future of cancer data isn't just about having more computers; it's about having smarter computers that can read and understand context, not just search for keywords.

The Analogy for the Future:
Imagine a hospital where the "Smart Intern" reads every pathology report the moment it's written. It fills out 95% of the data forms automatically. Then, a human expert (the tumor registrar) just has to review the 5% of cases where the AI was unsure or the report was weird. This turns a job that takes hours into a job that takes minutes, allowing humans to focus on the complex cases rather than data entry.

In short: AI is ready to help, but it needs to be the kind of AI that understands language, not just the kind that searches for words.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →