Monte Carlo Committee Simulation with Large Language… — Plain-Language Explanation

Original authors: Janoudi, G., Rada (Uzun), m., Yasinov, E., Richter, T.

Published 2026-03-03

📖 5 min read🧠 Deep dive

Original authors: Janoudi, G., Rada (Uzun), m., Yasinov, E., Richter, T.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a pharmaceutical company trying to get a new, life-saving drug approved for sale in Canada. You have to send a massive, complex dossier of medical and economic data to a government committee (the CDA-AMC). This committee acts like a high-stakes jury. They review your evidence and decide:

Do we pay for this drug?
If yes, what strict rules must we attach? (e.g., "Only for patients who failed other treatments," or "Only if the price is cut by 20%").

For years, predicting this jury's decision has been like trying to guess the weather by looking at a single cloud. It's hard, the data is messy, and the rules change.

This paper introduces a new, clever tool called Monte Carlo Committee Simulation. Here is how it works, explained simply:

1. The Problem: One Brain vs. A Jury

Traditional computer programs try to predict the outcome by looking for simple patterns (like "if the drug is cheap, they say yes"). But these committees are complex. They have doctors, economists, and patient advocates who argue, debate, and look at things from different angles. A single computer program can't capture that human drama.

Also, big AI models (like the ones used here) are like students who might have memorized the answers to old exams. If you ask them about a drug they've seen before, they might just be "reciting" the answer rather than actually thinking about it.

2. The Solution: The "Virtual Jury"

The authors built a system that doesn't just ask one AI for an answer. Instead, it creates a virtual committee of 14 AI panelists.

The Characters: Imagine a room with 14 different people. Some are "Patient Advocates," some are "Hard-nosed Economists," some are "Senior Doctors," and some are "Policy Experts."
The Personas: Each AI panelist is given a specific "persona" (a personality and a specific way of thinking). The "Economist" looks at the price tag; the "Doctor" looks at the side effects; the "Patient" looks at quality of life.
The Simulation: The system feeds the drug's evidence to all 14 of them. They all cast a vote.
- The Twist: The system runs this voting process many times (like rolling dice 50 times). Sometimes the "Economist" is grumpy and votes "No," other times they are happy and vote "Yes." This captures the natural uncertainty of human debate.

3. The "Neurosymbolic" Magic: Brain + Rules

The paper calls this a "Neurosymbolic" approach. Think of it as a Brain (Neural) working inside a Rulebook (Symbolic).

The Brain (LLMs): The AI panelists use their "brain" to read the messy, long documents and understand the nuance, just like a human expert.
The Rulebook: The system doesn't just take a simple average. It uses strict math rules to count the votes, weigh them (some experts count for more), and decide if the group has reached a solid agreement or if they are still fighting.

4. Knowing When to Shut Up (Uncertainty)

This is the most important part. A normal AI might confidently say, "This drug will be approved!" even if it's guessing.

This system has a confidence meter.

High Confidence: If all 14 virtual panelists agree strongly, the system says, "I'm 96% sure this will pass with these specific conditions."
Low Confidence: If the panelists are split 50/50 and arguing, the system says, "I don't know. This is a tough case. Don't trust my guess; a human needs to look at this."

In the study, when the system said "I'm not sure," it was usually right that the case was difficult. When it said "I'm sure," it was right 93% of the time.

5. The "Fresh Data" Test

To prove the AI wasn't just cheating by memorizing old answers, the researchers tested it on brand new drug cases that were released after the AI had finished its training.

Analogy: It's like giving a student a test on a topic they learned in 2024, but the test questions are from 2025. If they get it right, they actually understood the material; they didn't just memorize the textbook.
Result: The system passed this test, proving it can actually reason about new, unseen medical evidence.

6. Predicting the "Fine Print"

Most systems just guess "Yes" or "No." This system also predicts the conditions.

Instead of just saying "Yes, we'll pay for it," it says: "Yes, but only if you restrict it to patients with a specific gene, and you lower the price by 15%."
It got the combination of these rules right about 49% of the time. While that sounds low, predicting exactly which 5 specific rules will apply out of 32 possible combinations is incredibly hard (like guessing the exact winning lottery numbers). It's a massive improvement over random guessing.

The Bottom Line

This paper shows that we can use AI to simulate a complex human committee to predict drug approval outcomes.

For Drug Companies: It's like having a crystal ball that tells you, "You will likely get approved, but expect to cut your price and limit who can take the drug." This helps them prepare their strategy early.
For the System: It doesn't replace the human committee. Instead, it acts as a warning system. It tells the humans, "Hey, this case is tricky and the AI is unsure; you should spend extra time reviewing this one."

It turns a reactive process (waiting for the decision) into a proactive one (preparing for the likely outcome).

1. Problem Statement

Health Technology Assessment (HTA) agencies, such as Canada's Drug Agency (CDA-AMC), determine reimbursement recommendations for new therapies. These decisions are complex, relying on unstructured clinical, economic, and patient data.

Limitations of Traditional ML: Existing machine learning approaches require extensive manual feature engineering, struggle with small datasets (dozens to hundreds of cases annually), and typically predict only categorical outcomes (e.g., "Reimburse" vs. "Do Not Reimburse") without capturing specific conditions (e.g., price reductions, population restrictions).
Limitations of Standard LLMs: While Large Language Models (LLMs) can process unstructured text, they suffer from epistemic opacity (inability to quantify uncertainty) and data contamination risks (memorizing historical outcomes rather than reasoning). A single LLM prompt cannot capture the stochastic variability inherent in human committee deliberation.
The Gap: There is no prospective method to predict not just if a drug will be reimbursed, but under what specific conditions, with calibrated confidence intervals that distinguish between high-confidence predictions and uncertain guesses.

2. Methodology: Monte Carlo Committee Simulation

The authors propose a Neurosymbolic AI framework that simulates a multi-panelist committee to generate probabilistic predictions.

A. System Architecture (Neurosymbolic)

Neural Component: 14 persona-conditioned LLM panelists act as "experts."
- Panelist Types: 7 distinct personas (Patient/Public, Health Economics, Policy, Clinical, ITC Specialist, Senior Clinical, General) modeled on CDA-AMC committee structures.
- Model Configuration: A mixed-model ensemble using GPT-5 (for complex, structured reasoning) and GPT-5-mini (for simplified role-focused reasoning). Each persona has two instances (one per model) to introduce model diversity and reduce correlated errors.
Symbolic Component: Formal voting rules and statistical aggregation.
- Weighted Voting: Structured prompt panelists (GPT-5) have a weight of 2.0; simplified panelists (GPT-5-mini) have a weight of 1.0.
- Aggregation: Predictions are aggregated via weighted plurality voting.
- Convergence: The system runs a Monte Carlo simulation (multiple rounds) until the probability distribution stabilizes (checked every 5 rounds) or a maximum of 50 rounds is reached.

B. Prediction Framework

The system operates on two levels:

Recommendation Prediction: Predicts the category: Reimburse (R), Reimburse with Conditions (RWC), or Do Not Reimburse (DNR).
Condition Prediction: For RWC cases, it predicts the presence of 5 specific condition categories:
- Population Restrictions
- Prescriber/Setting Requirements
- Continuation Conditions
- Economic Conditions
- Evidence Conditions

C. Uncertainty Quantification & Abstention

The system employs a Two-Axis Uncertainty Model to determine confidence:

Stability (Inter-round): Measures if the winner flips across simulation rounds.
Contestation (Intra-round): Measures how close the vote is within a single round.
Strength of Mandate: Classifies predictions into High, Contested, or Weak based on metrics like final_support and vote_margin.
Selective Prediction: The system abstains (refuses to predict) if uncertainty exceeds pre-specified thresholds (e.g., final_support < 0.60), allowing users to trade coverage for accuracy.

D. Study Design & Validation

Temporal External Validation: The study used a strict temporal split to prevent data contamination.
- Training/Calibration: Models were trained on data prior to their knowledge cutoffs (GPT-5: Sept 30, 2024; GPT-5-mini: May 31, 2024).
- Test Set: Recommendations published by CDA-AMC between October 2024 and December 2025 (n=67). This ensures the models could not have memorized the outcomes, forcing genuine reasoning.
Ground Truth: Derived from official CDA-AMC structured data exports.

3. Key Results

Recommendation Prediction

Accuracy: On the 44 cases where the system expressed confidence (non-abstained), accuracy was 93.2% (95% CI: 84.1–100.0%), significantly outperforming the majority class baseline (91.8%).
Discrimination: The system achieved an AUROC of 0.817, demonstrating genuine ability to distinguish between classes, unlike the baseline (AUROC 0.50).
Calibration: Expected Calibration Error (ECE) was low (0.091), indicating reliable probability estimates.
Abstention Efficacy: The system abstained on 10.2% of cases (5 submissions). Analysis showed these abstained cases had only 40.0% accuracy, confirming the system correctly identified its own uncertainty.
Mandate Stratification:
- High Mandate: 96.8% accuracy.
- Contested: 84.6% accuracy.
- Weak Mandate: 40.0% accuracy.
- 83.3% of all errors occurred in cases flagged as uncertain.

Condition Prediction

Subset Accuracy: The system correctly predicted the exact combination of all 5 condition categories in 48.8% of cases. This is a significant improvement over a random baseline (3.1%) given the 32 possible combinations ( $2^5$ ).
Hamming Accuracy: 86.3% (proportion of correctly predicted individual condition labels).
Per-Category Performance:
- Economic Conditions: 97.6% accuracy.
- Population Restrictions: 90.2% accuracy.
- Continuation Conditions: 68.3% accuracy (but highest discriminative ability with AUROC 0.896).

4. Key Contributions

First Prospective Condition Prediction: This is the first study to prospectively predict the specific conditions attached to HTA recommendations, moving beyond simple binary classification.
Neurosymbolic Committee Simulation: Introduces a novel architecture combining neural reasoning (LLM personas) with symbolic aggregation (weighted voting, convergence criteria) to simulate human deliberation and quantify uncertainty.
Robust Temporal Validation: Addresses the critical issue of LLM data contamination by validating on data strictly post-dating the models' knowledge cutoffs, proving the system reasons rather than memorizes.
Actionable Uncertainty: Provides a "Strength of Mandate" metric that allows stakeholders to filter predictions by confidence, enabling a human-in-the-loop workflow where low-confidence cases are flagged for manual review.

5. Significance and Implications

For Pharmaceutical Sponsors: Enables proactive market access strategies. Sponsors can anticipate specific negotiation hurdles (e.g., "price reduction required" or "prior therapy restriction") and allocate resources to address them before submission.
For HTA Agencies: Serves as a decision-support tool for workload planning and consistency checking, identifying cases likely to require complex deliberation.
For AI in Healthcare: Demonstrates that LLMs can be deployed in high-stakes, low-data domains if paired with uncertainty quantification and rigorous temporal validation. It shifts the paradigm from "black box" prediction to "calibrated forecasting aid."

Conclusion: The Monte Carlo Committee Simulation successfully bridges the gap between unstructured evidence and structured policy outcomes. By simulating a diverse committee and rigorously quantifying uncertainty, the system offers a reliable, interpretable, and actionable forecasting tool for drug reimbursement, validated on data the models could not have previously seen.

Monte Carlo Committee Simulation with Large Language Models for Predicting Drug Reimbursement Recommendations and Conditions: A Novel Neurosymbolic AI Approach