Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

Imagine you are trying to hire a new doctor for your clinic. How would you test them?

In the past, you might have handed them a thick stack of multiple-choice questions (like a final exam in medical school). If they got 95% of the answers right, you'd hire them.

But here's the problem: Real life isn't a multiple-choice test. A real patient doesn't walk in and say, "I have a headache, a fever, and a rash. What is my diagnosis?" Instead, they say, "My head hurts, and I feel weird." It's up to the doctor to ask the right follow-up questions, look at the rash, check the fever, and figure out what's missing.

This paper introduces Doctorina MedBench, a new way to test AI doctors that is less like a written exam and more like a role-playing game or a simulator.

Here is the breakdown of how it works, using simple analogies:

1. The "Acting" Patient (The Simulation)

Instead of a static test, the system creates a virtual patient.

The Old Way: The test gives the AI a full report card with all the symptoms listed.
The Doctorina Way: The virtual patient is an actor. They are shy, forgetful, or maybe they think they know everything. They won't tell you they have a heart condition unless you specifically ask, "Do you have a history of heart problems?"
The Goal: The AI doctor has to play the role of a detective. It must ask the right questions to get the information. If the AI just guesses without asking, it fails.

2. The Scorecard: D.O.T.S.

How do we grade the AI? The authors created a scorecard called D.O.T.S. Think of it like a four-star rating system for a restaurant:

D (Diagnosis): Did the AI figure out what's wrong? (Did it guess "Flu" or "Broken Leg" correctly?)
O (Observations): Did the AI order the right tests? (Did it ask for an X-ray when it was needed, or did it waste money on a test that wasn't necessary?)
T (Treatment): Did the AI give safe advice? (This is the most important one. If the patient is allergic to penicillin, the AI must not prescribe penicillin. If it does, it gets a zero.)
S (Step Count): Was the conversation efficient? Did the AI ask 50 unnecessary questions before giving an answer, or did it get to the point quickly?

3. The "Trap" Doors (Safety Checks)

The system includes "trap cases." These are scenarios designed to trick the AI.

Example: A patient says, "I can't be pregnant because I had my tubes tied," but they are actually showing classic pregnancy symptoms.
A smart AI doctor will say, "Let's run a test just to be sure." A dumb AI might say, "Okay, you're not pregnant," and miss a life-threatening situation. The system catches these errors instantly.

4. The "Stress Test" (Continuous Monitoring)

Imagine you have a car. You don't just test it once when you buy it; you check the oil, brakes, and tires regularly.
Doctorina does this for AI. It runs thousands of these "role-play" scenarios every day, even while the AI is being used by real people. If the AI starts making mistakes or gets "lazy" (skipping questions), the system sounds an alarm and stops the AI from being used until it's fixed.

5. The Big Surprise: AI vs. Humans vs. "Raw" AI

The paper compared three groups:

Real Doctors: The experts.
"Raw" AI (like a standard GPT-5): An AI that just reads a prompt like "Pretend you are a doctor."
Doctorina: The AI trained specifically with this new simulation method.

The Results:

On the "Multiple Choice" tests: The "Raw" AI and Real Doctors both did great.
On the "Role-Play" simulation: The "Raw" AI crashed. It was too lazy to ask questions and often missed the diagnosis.
Doctorina: It performed almost as well as the real doctors and much better than the "Raw" AI.

The Bottom Line

The paper argues that passing a written test doesn't mean you can practice medicine.

Just because an AI can memorize a textbook doesn't mean it can talk to a confused patient, figure out what they really mean, and keep them safe. Doctorina MedBench is a new "driving simulator" for AI doctors. It proves that to build a safe, helpful AI doctor, we need to test them on how they talk and think, not just how well they can memorize answers.

1. Problem Statement

The rapid adoption of Large Language Models (LLMs) in healthcare has outpaced the development of robust evaluation frameworks. Current benchmarks rely heavily on standardized test questions (e.g., USMLE-style multiple-choice), which fail to capture the complexity of real-world clinical practice.

Limitations of Static Benchmarks: High scores on standardized tests do not correlate with the ability to handle dynamic, multi-step clinical interactions where information is incomplete, patients are uncooperative, or data must be extracted from unstructured attachments (images, lab reports).
The "Black Box" of Clinical Reasoning: Traditional benchmarks do not assess the process of diagnosis (history taking, differential formulation, safety checks) but only the final answer.
Need for Continuous Monitoring: There is a lack of systems to detect model degradation (regression) in real-time during deployment, particularly regarding safety-critical errors like drug interactions or missed red flags.

2. Methodology: The Doctorina MedBench Framework

The authors propose Doctorina MedBench, an end-to-end evaluation system based on agent-based simulation of realistic physician-patient interactions.

A. Core Architecture

Simulated Patient Agent: A dedicated LLM agent acts as the patient. It is strictly constrained by a "case record" and a system prompt.
- Behavior: It does not volunteer information spontaneously. It answers only targeted questions, mimicking real-world scenarios where clinicians must actively elicit history.
- Constraints: It cannot hallucinate facts not in the case record. If asked about missing info, it gives a negative/uncertain reply.
AI Physician Agent (Doctorina): The system being evaluated. It must initiate the dialogue, ask targeted questions, analyze attachments (lab results, images), and formulate a diagnosis and treatment plan.
Orchestration: A dialogue module routes messages between agents, logs all steps, and enforces iteration limits.

B. Evaluation Metrics: The D.O.T.S. Framework

Performance is quantified using four asynchronous, parallel metrics:

Diagnosis (D): Accuracy of the primary diagnosis (textual and ICD-10 code) and the differential diagnosis list (Top-3/Top-5).
Observations/Investigations (O): Validity of recommended tests.
- Mandatory tests: Must be recommended (weighted).
- Optional tests: Permissible but not required.
- Penalty system: Deductions for redundant or irrelevant tests (over-testing).
Treatment (T): Safety and adequacy of treatment. Includes a Critical Condition Override (auto-fail) for gross errors (e.g., prescribing penicillin to a patient with a documented allergy).
Step Count (S): Efficiency of the dialogue. Measures the number of turns required to reach a conclusion. Deviations >25% from the gold standard indicate inefficiency or redundant questioning.

C. Multi-Level Testing Protocol

To ensure safety and stability, the system employs a three-tiered monitoring architecture:

Level 1 (Trap-Based Testing): A curated set of "must-pass" cases designed to trigger errors (e.g., patients denying pregnancy despite symptoms, or "red flag" images). Failure here triggers immediate alerts.
Level 2 (Category-Based Random Sampling): Random sampling across clinical categories (Internal Medicine, Surgery, etc.) to track trends without running the full dataset.
Level 3 (Full Regression Testing): Comprehensive testing of the entire dataset (1,000+ cases) triggered only when Level 1 or 2 anomalies are detected.
Continuous Monitoring: Runs in parallel with production, using a "sliding window" to detect performance drift (e.g., changes in model routing or input data formats).

D. Dataset

Scale: >1,000 clinical cases covering >750 unique diagnoses.
Demographics: Balanced gender distribution; age groups skewed toward adults (40–59) but including pediatrics and obstetrics.
Complexity: Cases range from Basic (single pathology) to Expert (multi-system, rare syndromes, conflicting data).
Validation: Each case validated by at least two physicians with weighted scoring for recommendations.

3. Key Contributions

Shift from Static to Dynamic Evaluation: Moves beyond "question-answer" benchmarks to interactive simulation, evaluating the process of clinical reasoning (history taking, information gathering) rather than just the final output.
The D.O.T.S. Metric: A novel, multi-dimensional scoring system that balances diagnostic accuracy with safety (treatment), efficiency (steps), and investigative rigor (observations).
Safety-Oriented "Trap" Cases: A mechanism to specifically test for catastrophic failures (e.g., ignoring allergies, missing life-threatening conditions) which standard benchmarks often miss.
Continuous Quality Assurance (CQA): A real-time monitoring architecture that detects model degradation within minutes and automatically blocks unsafe model versions from deployment.
Dual-Purpose Utility: The framework is designed to evaluate both AI systems and human physicians/students, providing a unified standard for clinical reasoning assessment.

4. Experimental Results

The authors compared their specialized AI Doctor system against a GPT-5 baseline (restricted to a single-sentence prompt: "Imagine you are a doctor") and human specialists.

Diagnostic Accuracy:
- AI Doctor: 89.3%
- GPT-5 Baseline: 84.6%
- Human GPs: ~83%
- Result: The specialized agent outperformed the raw LLM and matched human performance.
Differential Diagnosis & History Taking:
- AI Doctor showed significantly higher Differential Accuracy (45.4% vs. 24.0% for GPT-5) and Question Accuracy (61.4% vs. 30.3%), indicating superior ability to explore alternative hypotheses and gather necessary history.
Treatment Accuracy:
- AI Doctor: 53.0%
- GPT-5 Baseline: 38.0%
- Statistical Significance: Wilcoxon signed-rank test showed highly significant improvements ( $p < 0.001$ ) in treatment recommendations.
Dialogue Efficiency:
- AI Doctor engaged in significantly more extensive dialogues (Average 11.56 steps vs. 0.66 for GPT-5), reflecting a more thorough clinical interview.
- Despite longer dialogues, AI Doctor maintained better consistency in step counts (fewer cases outside the $\pm 25\%$ limit).
Comparison with Specialists:
- In a specific test with Obstetricians/Gynecologists, the AI Doctor (67.8% accuracy) outperformed the human specialists (56.5%) on discipline-specific cases, while GPT-5 scored 38.6%.

5. Significance and Conclusion

Rethinking Medical AI Benchmarks: The paper argues that high performance on standardized tests (like USMLE) is a poor predictor of real-world clinical competence. The "Doctor-Modeled" simulation reveals that base LLMs struggle with the iterative, incomplete-information nature of real consultations.
Safety First: The framework prioritizes safety through "trap cases" and critical condition overrides, ensuring that AI deployment does not introduce catastrophic medical errors.
Educational Value: The system serves as a tool for training medical students and junior doctors in clinical reasoning and history-taking, offering a standardized, scalable alternative to human-led simulations.
Future Outlook: The authors conclude that the future of medical AI evaluation lies in interactive, agent-based simulations that mimic the dynamic, multi-step nature of human clinical practice, rather than static knowledge retrieval.

In summary, Doctorina MedBench provides a rigorous, safety-centric, and realistic evaluation framework that demonstrates the necessity of specialized architectural designs (multi-agent orchestration, structured prompting) to bridge the gap between raw LLM capabilities and safe, effective clinical application.