OPGAgent: An Agent for Auditable Dental Panoramic X-ray Interpretation

Imagine you are a dentist looking at a panoramic X-ray of a patient's mouth. It's a huge image showing every single tooth, the jawbone, the sinuses, and the nerves. Traditionally, to diagnose problems, a dentist has to act like a detective with a magnifying glass, scanning the whole picture, then zooming in on specific teeth, checking for cavities, checking bone density, and counting how many teeth are missing.

Now, imagine trying to teach a computer to do this.

The Problem: The "Jack-of-All-Trades" vs. The "Specialist"

For a long time, we've had two types of AI trying to help:

The Generalist (Vision Language Models): Think of this like a very smart, well-read student who has read every medical textbook. They can describe the X-ray in great detail and answer questions like, "Is there a cavity?" However, because they try to do everything at once, they often miss small details or get confused about exactly which tooth has the problem. They are good at conversation but bad at precise diagnosis.
The Specialist (Traditional AI): These are like master craftsmen who only know how to do one thing perfectly, like counting teeth or spotting bone loss. But they can't talk to each other. If you want a full report, you have to run the "counting robot," then the "cavity robot," then the "bone robot" separately and try to stitch their answers together. It's slow and messy.

The Solution: OPGAgent (The "Orchestrator")

The authors of this paper built OPGAgent, which isn't just one smart brain; it's a super-efficient project manager that runs a team of specialists.

Think of OPGAgent as a construction site foreman looking at a blueprint (the X-ray). Instead of trying to lay every brick himself, he coordinates a team:

The "Scout" (Hierarchical Evidence Gathering):
- First, the foreman looks at the whole site (the full X-ray) to get the big picture.
- Then, he divides the site into four neighborhoods (the four quadrants of the mouth).
- Finally, he zooms in on individual houses (each tooth) to check for specific cracks or damage.
- Why this matters: It stops the AI from getting overwhelmed. It checks the whole mouth first, then the neighborhood, then the specific house, ensuring nothing is missed.
The "Specialist Toolbox" (The Team):
- The foreman has a toolbox filled with different experts.
- The Map-Maker: Knows exactly where every tooth is located (using a standard numbering system dentists use).
- The Detector: Uses laser eyes to spot cavities or bone loss.
- The Mathematician: Measures distances (e.g., "Is this tooth root too close to the nerve?").
- The Panel of Experts: A group of different AI models (like a panel of doctors) who all look at the same spot and give their opinion.
The "Judge" (Consensus Subagent):
- This is the most important part. Sometimes, the "Panel of Experts" might disagree. One might say, "That's a cavity," while another says, "No, that's just a shadow."
- The Judge doesn't just guess. It looks at the hard data from the "Map-Maker" and "Detector." If the Map-Maker says, "That tooth is #38," but the Experts argue about the number, the Judge uses the Map-Maker's coordinates to settle the argument.
- It only writes down a diagnosis if enough experts agree and the hard data supports it. This stops the AI from "hallucinating" (making things up).

The New Report Card: OPG-Bench

The paper also realized that testing these AIs was broken. Usually, we ask an AI, "Do you see a cavity?" and it says "Yes" or "No."

The Flaw: If the AI misses a cavity because you didn't ask about it, or if it invents a cavity where there isn't one, the old tests wouldn't catch it.
The Fix: The authors created OPG-Bench. Instead of asking questions, they ask the AI to write a structured report, just like a real dentist does.
- Format: "Location (Tooth #38), Field (Caries), Value (Moderate)."
- This forces the AI to be precise. It can't just ramble; it has to prove exactly where the problem is and what it is.

The Result

When they tested OPGAgent:

It was more accurate than the smartest "Generalist" AIs.
It was more reliable than the "Specialist" AIs because it could handle the whole picture.
It made fewer mistakes (hallucinations) because the "Judge" checked the work of the "Experts."

In short: OPGAgent is like replacing a single, overworked intern with a well-organized dental clinic. You have a manager who knows the layout, a team of specialists who do their jobs perfectly, and a strict quality control manager who ensures the final report is 100% accurate before it goes to the patient.

1. Problem Statement

Orthopantomograms (OPGs) are the standard panoramic radiographs used in dentistry for screening, diagnosis, and treatment planning. While deep learning models exist for specific tasks (e.g., caries detection, tooth segmentation), they operate in isolation, forcing clinicians to use multiple tools sequentially, which is inefficient.

Conversely, Vision Language Models (VLMs) attempt to unify these tasks via natural language generation but suffer from two critical limitations:

Performance Gap: They underperform specialized models on individual tasks because they lack the precise local spatial cues required for dental diagnostics.
Evaluation Flaws: Existing benchmarks rely on Question-Answering (VQA) paradigms. These fail to measure recall (if a finding isn't asked about, it isn't evaluated) and cannot quantify hallucinations (fabricated findings in unasked regions). Furthermore, the Q&A format does not reflect real-world clinical workflows, where dentists generate comprehensive, structured reports rather than answering isolated prompts.

2. Methodology: OPGAgent

The authors propose OPGAgent, a multi-tool agentic system designed to orchestrate specialized perception modules with a consensus mechanism. The system is powered by a planner (GPT-5.2) following the ReAct paradigm and consists of three core components:

A. Hierarchical Evidence Gathering Module

This module decomposes the analysis into three progressive phases to refine findings from global to local levels:

Global Analysis: The agent queries VLM experts on the full image for an initial reading while simultaneously invoking detection tools to establish an anatomical baseline (total tooth count, missing teeth, and FDI notation mapping). This creates a structured coordinate system in memory.
Quadrant-Level Screening: Using global context, the agent scans four quadrants (Q1–Q4). It generates dynamic crops to screen for gross pathologies (e.g., alveolar bone loss, large lesions).
Tooth-Level Screening: For detailed pathologies (e.g., caries, impaction), the agent crops dynamic Regions of Interest (ROIs) at higher resolution. Crucially, if Phase 2 flags an issue in a region where Phase 1 detected no tooth (e.g., a root remnant), the agent requests a quadrant crop from independent anatomical detection to recover false negatives.

B. Specialized Toolbox

The toolbox encapsulates four categories of agent-callable tools:

Spatial Tools: Return masks/bounding boxes for teeth, quadrants, and findings to serve as spatial references.
Detection Tools: Detect specific abnormalities (caries, periapical lesions, etc.) using pathological models and associate them with specific teeth.
Utility Tools: Manage FDI numbering, ROI extraction, and spatial reasoning (e.g., calculating the minimum contour distance between a tooth root and the mandibular canal).
Expert Zoos: A collection of diverse VLMs (DentalGPT, OralGPT-Omni, GPT-5.2, Gemini-3-Flash) that provide multiple expert opinions on whole images or specific ROIs.

C. Consensus Subagent

To mitigate hallucinations and resolve conflicts, this subagent aggregates evidence:

Evidence Aggregation: A finding is confirmed if $\ge3$ sources agree or $\ge2$ sources report the same finding. This allows rare conditions missed by rule-based detectors to pass through via VLM votes.
Conflict Resolution: When sources disagree on attributes (e.g., tooth number or severity) but agree on the presence of a finding, the subagent consults hard constraints from detection tools (e.g., FDI coordinate maps) to assign the finding to the spatially correct tooth, correcting VLM output with deterministic coordinates.

3. Key Contributions

1. OPGAgent Framework

The first agentic system specifically designed for OPG interpretation. It uniquely combines hierarchical evidence gathering, a specialized toolbox of domain-specific models, and a consensus mechanism that resolves conflicts using anatomical constraints.

2. OPG-Bench (Structured-Report Protocol)

A novel evaluation protocol derived from real clinical reports that moves beyond VQA.

Format: Findings are abstracted as (Location, Field, Value) triples $(l_i, f_i, v_i)$ $(l_{i}, f_{i}, v_{i})$ .
- Location: Adheres strictly to FDI notation (ISO 3950) and global regions.
- Field: Clinical field (e.g., caries, periodontitis).
- Value: Categorical severity levels based on standardized guidelines (ICDAS, AAP/EFP, PAI).
Evaluation Metrics: Uses a Triple-based Hierarchical Evaluation that disentangles visual perception from clinical reasoning:
1. Exact Match: Perfect alignment of all elements.
2. Step-wise Partial Match: Evaluates Presence (Precision/Recall/F1), Localization (Precision/Recall/F1), and Classification (Accuracy) sequentially.
Significance: This protocol explicitly audits both pathological findings and hallucinations, reflecting true diagnostic utility in a clinical workflow.

4. Experimental Results

The system was evaluated on OPG-Bench (1,009 anonymized OPGs with structured reports) and the public MMOral-OPG benchmark.

Performance on OPG-Bench:
- OPGAgent achieved an Exact Match F1 of 42.3% and an Aggregate Score of 49.7%, outperforming state-of-the-art general VLMs (e.g., Gemini-3-Flash, GPT-5.2) and dental-specific VLMs (DentalGPT, OralGPT-Omni).
- Precision vs. Recall: While Gemini-3-Flash had higher recall (45.1%), its precision dropped significantly (27.6%) with 10.58 false positives per case. OPGAgent maintained a precision of 43.1% with only 4.89 false positives, demonstrating a superior balance of sensitivity and reliability.
- Comparison: Domain-specific models like OralGPT-Omni minimized false positives (4.02) but suffered from extremely low coverage (F1 6.2%).
Performance on VQA Benchmarks (MMOral-OPG):
- OPGAgent achieved the highest accuracy (62.53%), outperforming all baselines.
- Notably, domain-specific VLMs performed poorly on OPG-Bench VQA questions compared to general VLMs, suggesting that models trained on generated QA pairs may not capture the distribution of real clinical reports.
Ablation Studies:
- Removing external tools (ReAct loop only) resulted in a low F1 of 27.78%.
- Adding Expert Zoos improved precision but crashed recall (due to lack of exact coordinates).
- Adding Spatial Tools resolved the grounding bottleneck, restoring recall.
- Adding Detection Tools provided the final boost, achieving the peak F1 of 42.30%.

5. Significance

Clinical Workflow Alignment: By shifting from isolated Q&A to structured, holistic reporting, OPGAgent aligns AI evaluation with how dentists actually work.
Auditability: The consensus mechanism and structured triple output make the diagnostic process transparent and auditable, addressing the "black box" nature of generative models.
Domain Specificity: The paper demonstrates that general medical agent frameworks fail in dentistry due to the need for specialized knowledge (FDI notation) and mechanisms (dynamic ROI cropping). OPGAgent successfully bridges the gap between the versatility of VLMs and the precision of specialized detection models.
Benchmarking: OPG-Bench sets a new standard for evaluating medical AI by penalizing hallucinations and measuring recall on unasked findings, which previous VQA benchmarks failed to do.