End-to-End PET/CT Interpretation and Quantification with an LLM-Orchestrated AI Agent: A Real-World Pilot Study

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a master chef (the radiologist) trying to cook a complex meal (diagnosing a patient) using a massive, chaotic pantry of ingredients (the raw medical images). Usually, you have to spend hours sorting through boxes, measuring spices, chopping vegetables, and then finally cooking the dish.

This paper introduces a new AI Sous-Chef that doesn't just chop vegetables; it manages the entire kitchen workflow from start to finish.

Here is the breakdown of their "AI Agent" in simple terms:

1. The Problem: The Chaotic Pantry

In a hospital, PET/CT scans (which show where cancer is glowing in the body) come in all different shapes and sizes. Some are labeled weirdly, some have missing data, and the machines that take the pictures are all different.

Old AI: Previous AI tools were like specialized robots that could only do one thing perfectly, like "slice the onions" (find a tumor) or "measure the salt" (calculate a number). But they couldn't talk to each other, and they couldn't handle the mess of the real world.
The Goal: The researchers wanted an AI that could walk into the messy pantry, figure out which ingredients are good, chop them, cook them, and write the recipe card (the medical report) all by itself.

2. The Solution: The "Brain" and the "Hands"

The researchers built a system with three layers, which they call an LLM-Orchestrated Agent. Think of it like a Conductor of an Orchestra:

The Conductor (The "Brain"): This is a Large Language Model (like a super-smart text AI). It doesn't look at the pictures directly. Instead, it reads the "sheet music" (the patient's data and the doctor's request) and tells the other musicians what to do. It decides: "Okay, we need to find the lung tumor first. Let's ask Tool A to slice the image, then ask Tool B to measure the glow, and finally ask Tool C to write the summary."
The Musicians (The "Hands"): These are specialized AI tools that are already good at specific jobs.
- The Slicer: Finds the tumors.
- The Measurer: Calculates how "hot" (active) the tumors are.
- The Painter: Draws the outlines on the images.
The Process: The Conductor grabs the raw, messy data, tells the Slicer to work, checks if the result looks right, and if it fails, it says, "Okay, try a different method," and keeps going until it has a full report.

3. The Test: The "Stress Test"

They tested this AI on 170 real patients with lung cancer. They gave the AI the raw, messy data and asked it to do the whole job: find the cancer, check if it spread to lymph nodes, check if it spread to other organs, and write a draft report.

The Results:

The Main Course (Primary Tumor): The AI was perfect. It found the main lung tumor in 100% of the cases. It was like a master chef who never misses the main ingredient.
The Side Dishes (Lymph Nodes): The AI was good at finding them (85% success) but got a bit paranoid. It often thought normal, healthy lymph nodes were cancerous (false alarms). It's like a sous-chef who thinks a speck of dust is a rock and tries to remove it.
The Dessert (Distant Metastasis): When checking if cancer spread to other organs (like the liver or bones), the AI was okay (about 70% success). It sometimes missed tiny, hidden spots (false negatives) or got confused by normal body processes (like digestion) that looked like cancer (false positives).

4. The Verdict: A Helpful Assistant, Not a Replacement

The researchers conclude that this AI is not ready to replace the doctor.

Why? Because while it's great at the heavy lifting (sorting data, measuring, drawing), it still gets confused by tricky, borderline cases.
The Real Value: It acts as a super-efficient assistant. It does all the boring, repetitive math and data sorting in seconds, giving the doctor a "draft report" to review. The doctor can then focus on the tricky decisions, like "Is this weird spot actually cancer, or just inflammation?"

The Big Picture Analogy

Think of this AI as a GPS for a road trip.

The GPS (the AI Agent) can instantly plot the route, calculate the fuel, and tell you the traffic conditions (the quantitative data and draft report).
However, if there is a sudden landslide or a confusing detour (a complex medical case), the Driver (the human doctor) still needs to take the wheel and make the final decision.

In short: This paper proves that we can build an AI that manages the whole hospital workflow, not just one tiny task. It's a huge step toward making medical imaging faster and more consistent, but for now, it works best when it sits next to a human expert, not in their place.

1. Problem Statement

Despite advances in deep learning for specific tasks (e.g., organ segmentation or lesion detection), end-to-end automation of clinical PET/CT interpretation remains elusive. Current workflows require human experts to:

Select appropriate image series from heterogeneous raw DICOM data.
Perform registration, resampling, and SUV (Standardized Uptake Value) conversion.
Coordinate multiple downstream tools for segmentation and quantification.
Synthesize structured clinical reports.

Existing AI models are typically "single-task" and fail in real-world settings due to data heterogeneity (varying scanners, protocols, and metadata completeness). There is a critical need for a system that can act as a cognitive orchestrator, dynamically managing the entire pipeline from raw data to a structured draft report without human intervention.

2. Methodology

Study Design

Cohort: 170 adult patients undergoing baseline FDG PET/CT for lung cancer staging at Seoul National University Hospital (2018).
Data: Raw, heterogeneous DICOM files from three different PET/CT systems (Siemens Biograph TruePoint 40, mCT 40, mCT 64).
Reference Standard: Expert-interpreted clinical PET/CT reports used to extract ground-truth labels for primary tumor, nodal involvement (N-stage), and distant metastasis (M-stage).

System Architecture: LLM-Orchestrated Multi-Tool Agent

The authors developed a three-layer architecture designed to function as an autonomous agent:

Cognitive Control Layer (The Orchestrator):
- Core: A text-based Large Language Model (LLM), specifically Gemini-3-flash-preview.
- Function: Acts as the "brain." It receives user prompts, plans the workflow, selects appropriate tools, monitors intermediate outputs, and handles errors.
- Mechanism: Uses an iterative Thought–Action–Observation framework. It parses DICOM headers, decides which series to use, and triggers specific tools based on the task.
Tool Abstraction Layer:
- Translates high-level LLM requests (e.g., "segment the liver") into executable function calls.
- Defines input/output schemas for various tools to ensure compatibility.
Execution Layer (Specialized Tools):
- DICOM Handling: Python modules for parsing, registration, and resampling.
- Quantification: Calculation of SUVs using patient metadata (weight, injected activity, time).
- Segmentation:
  - Lesions: AutoPET (nnU-Net based 3D model) for whole-body tumor delineation.
  - Organs: TotalSegmentator for anatomical structure segmentation.
- Vision-LLM: A vision-enabled LLM (Gemini-3-flash-preview) used to interpret generated visualizations (MIPs, fusion slices) and summarize findings.

Workflow Execution

The agent executes the following steps autonomously:

Input: Raw DICOM + Text Prompt (e.g., "Interpret scan and draft report").
Preprocessing: Selects attenuation-corrected PET and CT series; aligns and resamples volumes.
Quantification: Computes SUV metrics; runs segmentation models.
Visualization: Generates Maximum Intensity Projections (MIPs) and fusion images with mask overlays.
Interpretation: Sends visualizations to the Vision-LLM for lesion characterization (primary, nodal, metastatic, physiologic).
Output: Synthesizes a structured draft report with quantitative metrics and staging conclusions.
Fallback: If quantitative tools fail (e.g., missing metadata), the agent switches to qualitative interpretation using MIPs and Vision-LLM analysis.

3. Key Contributions

Workflow-Level Automation: Demonstrated the first proof-of-concept for an AI agent that manages the entire PET/CT pipeline (DICOM selection $\to$ Quantification $\to$ Reporting) rather than isolated tasks.
Robustness to Heterogeneity: Successfully processed data from multiple scanner types with varying reconstruction parameters and metadata inconsistencies, a common failure point for traditional AI.
Dynamic Orchestration: Introduced a system where an LLM dynamically selects tools and implements fallback strategies (e.g., switching from quantitative to qualitative analysis) when specific tools fail, mimicking human adaptability.
Shift in AI Paradigm: Moved the focus from "optimizing a single model's accuracy" to "orchestrating a clinical workflow," positioning LLMs as supervisory control layers rather than standalone diagnostic models.

4. Results

Workflow Feasibility

Success Rate: The agent completed the full end-to-end workflow without human intervention in 169 out of 170 cases (99.4%).
Failure Mode: The single failure was due to insufficient DICOM metadata for SUV calculation, which the system handled by reverting to qualitative analysis.

Diagnostic Performance (vs. Expert Reports)

Primary Tumor Detection:
- Sensitivity: 100% (170/170).
- Analysis: Highly reliable for identifying dominant lung lesions.
Nodal Involvement (N-Stage):
- Sensitivity: 84.8% (84/99).
- Specificity: 39.4% (28/71).
- Analysis: High sensitivity but low specificity. False positives were driven by reactive or physiologic uptake in hilar/mediastinal nodes. False negatives involved small or atypical nodes.
Distant Metastasis (M-Stage):
- Sensitivity: 70.2% (33/47).
- Specificity: 65.0% (80/123).
- Accuracy: 66.5%.
- Analysis: False positives stemmed from physiologic bowel/pelvic uptake and benign bone changes. False negatives involved small-volume metastases in challenging locations (e.g., subtle bone lesions, pleural seeding, adrenal nodules).

Error Patterns

Discrepancy analysis revealed systematic rather than random errors:

False Positives: Often misinterpretation of low-grade uptake (reactive nodes, bowel activity) as malignant.
False Negatives: Missed small-volume or anatomically atypical lesions that require high-resolution spatial reasoning.

5. Significance and Conclusion

Clinical Role: The study concludes that LLM-orchestrated agents are best positioned as workflow assistants rather than autonomous diagnostic readers. They excel at automating repetitive preprocessing, quantification, and draft reporting, thereby enhancing consistency and efficiency.
Expert Oversight: While primary tumor detection is near-expert level, the system currently requires human oversight for complex nodal and metastatic assessments due to limitations in distinguishing physiologic uptake from malignancy.
Future Direction: The architecture is modular; performance can be improved by swapping the underlying Vision-LLM or segmentation models without redesigning the entire workflow. This approach offers a scalable pathway for integrating validated AI tools into routine clinical practice across heterogeneous environments.

In summary, this paper validates the feasibility of using LLMs to orchestrate complex, multi-step medical imaging workflows, bridging the gap between isolated AI tools and real-world clinical application.