OncoRAG: Graph-Based Retrieval Enabling Clinical Phenotyping from Oncology Notes Using Local Mid-Size Language Models

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a complex medical mystery. You have a massive library of patient files (the Electronic Health Records), but most of the clues aren't neatly typed into a spreadsheet. Instead, they are buried inside thousands of pages of handwritten notes, doctor's letters, and messy reports.

Finding a specific clue—like "Did this patient have high blood pressure?" or "What was the exact type of tumor?"—usually requires a human detective to read every single page. This is slow, expensive, and impossible to do for thousands of patients at once.

Enter OncoRAG. Think of OncoRAG as a super-smart, tireless robot assistant that can read these messy notes, understand the context, and pull out the exact facts you need in a matter of hours instead of weeks.

Here is how it works, broken down into simple concepts:

1. The Problem: The "Needle in a Haystack"

Doctors write notes in natural language. They might say, "Patient denies family history of breast cancer," or "Tumor was found in the upper outer quadrant." A standard computer search might just look for the word "cancer" and miss the fact that the patient doesn't have it, or it might get confused by the specific location.

Previous attempts to automate this required either:

Super-computers: Using massive, expensive AI models that need huge data centers (like trying to use a nuclear reactor to toast a slice of bread).
Heavy Training: Teaching the AI specific rules for every single disease, which takes years of work.

2. The Solution: The "Smart Librarian" (OncoRAG)

The researchers built a system called OncoRAG. Instead of just reading the text, it acts like a master librarian who builds a map of the patient's story before trying to find the answer.

Here is the 4-step process, using a Detective's Case File analogy:

Step 1: The Wishlist (Configuration)
The system is told what clues it needs to find (e.g., "Find the tumor size"). It then consults a medical dictionary (an ontology) to learn all the different ways a doctor might describe that clue (e.g., "mass," "lesion," "growth"). It's like the detective making a list of all possible aliases for the suspect.
Step 2: Drawing the Map (Knowledge Graph)
Instead of just reading the notes linearly, the system scans the text and builds a web of connections. It links "Patient" to "Tumor," "Tumor" to "Date," and "Tumor" to "Treatment."
- Analogy: Imagine the patient's notes are a messy pile of puzzle pieces. OncoRAG doesn't just look at the pile; it snaps the pieces together to see the picture. If the note says "No family history," the map connects "Family History" to "Negative."
Step 3: The Smart Search (Graph-Based Retrieval)
When the system needs to find a specific fact, it doesn't just search for keywords. It looks at its map. It follows the connections to find the most relevant sentences.
- The Magic Trick: It uses a technique called "Graph Diffusion." Imagine dropping a drop of ink in water; it spreads out to connected areas. The system lets the "search" spread through the map to find related clues that a simple keyword search would miss. It then ranks the best sentences to show the AI.
Step 4: The Final Answer (The Mid-Size AI)
Now, the system hands the top 5 most relevant sentences to a "Mid-Size" AI (a model that is smart but not massive). Because the AI only has to read the relevant parts (thanks to the map), it doesn't get confused or make up facts (hallucinate). It simply extracts the answer and puts it in a neat table.

3. Why This is a Big Deal

It's Local: You don't need to send sensitive patient data to the cloud or a giant server farm. This system can run on a standard computer in a hospital basement. It's like having a private detective in your own office rather than hiring a massive agency.
It's Fast: In the study, extracting data for 104 patients took about 2.5 hours with the robot. Doing it by hand took two weeks. That's a speed-up of roughly 100x.
It's Accurate: The robot was tested on patients with breast cancer and brain cancer (in both English and German). It got the facts right about 80% of the time, which is nearly as good as a human expert.
It Works for the Future: The researchers used the robot's data to predict patient survival rates. The predictions were just as accurate as those made by human experts. This means the robot isn't just good at finding facts; it's good at helping doctors make life-saving decisions.

The Bottom Line

OncoRAG is like giving a super-powered magnifying glass to medical researchers. It turns a chaotic mountain of paper notes into a clean, organized database without needing a supercomputer or sending private data to the internet. It proves that you don't need the biggest, most expensive AI to solve big problems; sometimes, you just need a smarter way to look at the clues.

1. Problem Statement

The extraction of structured clinical data from unstructured oncology notes (e.g., consultation reports, pathology findings) is a critical bottleneck in cancer research and personalized medicine.

Current Limitations: Manual chart abstraction is labor-intensive and impractical for large-scale studies. Existing automated solutions often rely on:
- Rule-based systems: Lack generalizability beyond predefined patterns.
- Supervised Machine Learning: Require large, manually labeled training datasets for every specific feature.
- Large Language Models (LLMs): Typically require massive models (70B+ parameters), dedicated cloud infrastructure, or task-specific fine-tuning, which increases computational costs and data privacy risks.
- Standard RAG (Retrieval-Augmented Generation): Traditional vector-based retrieval often fails to capture complex clinical relationships, temporal sequences, and entity co-occurrences necessary for accurate phenotyping.

2. Methodology: The OncoRAG Pipeline

The authors developed OncoRAG, a four-phase pipeline designed to extract clinical features using a locally deployed mid-size language model (14B parameters) without task-specific fine-tuning. The system operates on unstructured notes only.

Phase 1: Automated Feature Configuration & Ontology Enrichment

Features are defined by name, data type, and expected categories.
Ontology Enrichment: Non-demographic features are enriched using UMLS and BioPortal APIs to generate Concept Unique Identifiers (CUIs) and synonyms.
Keyword Expansion: WordNet and the LLM are used to expand keywords and synonyms for search queries.

Phase 2: Clinical Knowledge Graph Construction

Named Entity Recognition (NER): Biomedical NER models (supporting English and German) extract entities (diseases, meds, procedures, anatomy).
Context Detection: Tools like medspaCy detect modifiers (negation, temporality, family history).
Graph Construction: Entities are clustered using SapBERT embeddings and organized into a graph using NetworkX. Nodes represent patients, notes, dates, and entities; edges link patients to notes, notes to entities, and co-occurring entities to each other. This captures entity co-occurrence rather than just semantic similarity.

Phase 3: Entity Indexing and Context Retrieval

Multi-Stage Retrieval:
1. Query Expansion: Generates ~60 search terms.
2. Semantic Search: Retrieves top 30 matching entities via SapBERT embeddings.
3. Structural Expansion: Uses Breadth-First Search (2 hops) on the knowledge graph to capture related nodes.
Graph-Diffusion Reranking: A sophisticated scoring function ranks candidate sentences to select the top 5 for extraction. The score ( $S_{retrieval}$ $S_{r e t r i e v a l}$ ) combines:
- Semantic similarity (BioBERT-NLI).
- Lexical overlap with keywords/synonyms.
- Name/synonym matching.
- Graph-diffusion score: Smoothing embeddings across neighbors in the graph.
- Keyword boosts.
Filtering: Filters out hypothetical/planning mentions while retaining negated or historical evidence for interpretation.

Phase 4: LLM-Based Feature Extraction

Model: Microsoft Phi-3-medium-instruct (14B parameters) runs locally via Ollama.
Prompting: Structured prompts containing the task definition, feature description, interpretation rules, and the top 5 retrieved context sentences are submitted to the LLM.
Temporal Anchoring: For time-varying features, values are anchored to specific clinical events (e.g., diagnosis date).
Output: The model returns structured values (categorical letters or normalized numbers) and a self-assessed confidence level.

3. Key Contributions

Graph-Based RAG for Clinical Notes: Unlike standard RAG that retrieves text chunks, OncoRAG constructs a clinical knowledge graph to leverage entity co-occurrence and temporal structure, significantly improving context retrieval accuracy.
Local Mid-Size Model Efficiency: Demonstrates that a 14B parameter model (Phi-3-medium), deployed locally, can achieve research-grade accuracy comparable to manual curation, eliminating the need for massive 70B+ models or cloud-based APIs.
Cross-Lingual and Cross-Institutional Validation: The pipeline was validated on:
- TNBC (English): Triple-negative breast cancer notes from Mass General.
- RiCi (German): Recurrent high-grade glioma notes from Heidelberg.
- MIMIC-IV (English): External public ICU dataset.
No Fine-Tuning Required: The system achieves high performance using prompt engineering and ontology enrichment alone, avoiding the data privacy and resource costs of fine-tuning.

4. Results

Extraction Accuracy:

Performance: OncoRAG achieved mean F1 scores of 0.80 (TNBC), 0.79 (RiCi), and 0.84 (MIMIC-IV).
Comparisons:
- Outperformed Direct LLM Prompting (no retrieval) by +0.19 to +0.22 F1 points.
- Outperformed Naive Vector-Based RAG by +0.17 to +0.19 F1 points.
- The Graph-Diffusion Reranking step alone improved F1 by 0.04–0.10 over raw semantic retrieval.
Configuration: A "Hybrid" configuration (manual refinement of automatic settings) slightly improved F1 scores (up to 0.83) in institutional cohorts but showed no gain in MIMIC-IV, suggesting automatic ontology enrichment is robust for well-documented features.

Efficiency and Scalability:

Speed: Extraction took 1.7–1.9 seconds per feature with the 14B model.
Model Size Trade-off: Using a smaller 3.8B model reduced extraction time by 57% with a minimal F1 drop (0.03–0.10).
Time Savings: Reduced manual abstraction time for TNBC from ~2 weeks to under 2.5 hours.

Downstream Utility:

Prognostic Modeling: Survival models (3-year progression-free survival) built using automatically extracted features achieved a C-index of 0.77, statistically comparable to models built on manually curated data (C-index 0.76).

Error Analysis:

62% of errors were due to retrieval failures (relevant context not in the top 5 sentences), indicating that further gains depend on better retrieval strategies rather than larger LLMs.
38% were due to LLM interpretation errors despite correct context.

5. Significance

Democratization of Oncology Research: By enabling accurate extraction using locally deployed, mid-size models, OncoRAG lowers the barrier to entry for institutions without access to massive GPU clusters or cloud computing budgets.
Data Privacy: The local deployment ensures sensitive patient data never leaves the institution, addressing critical HIPAA/GDPR concerns.
Scalability: The approach transforms the labor-intensive bottleneck of chart abstraction into an automated process, enabling large-scale real-world evidence generation and multimodal outcome prediction (linking clinical notes with imaging/genomics).
Generalizability: The successful cross-lingual (English/German) and cross-institutional validation proves the pipeline's robustness for diverse clinical environments.

OncoRAG: Graph-Based Retrieval Enabling Clinical Phenotyping from Oncology Notes Using Local Mid-Size Language Models

1. The Problem: The "Needle in a Haystack"

2. The Solution: The "Smart Librarian" (OncoRAG)

3. Why This is a Big Deal

The Bottom Line

1. Problem Statement

2. Methodology: The OncoRAG Pipeline

3. Key Contributions

4. Results

5. Significance

More like this

A feasibility study on combining Ayurvedic dietary knowledge and modern nutrition to personalise diets for cancer patients

A Real-World Retrospective Study of Sintilimab in Combination with Neoadjuvant Chemotherapy for Triple-Negative Breast Cancer

Backfill Bayesian Ordered Lattice Design for Phase I Clinical Trials

Cell-free chromatin epigenomic profiling enables non-invasive pancreatic cancer cell-state identification

Clinical and pathological characteristics of thin cutaneous melanomas with rapid recurrence.