📄 radiology and imaging

Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

This study demonstrates that a soft voting ensemble of MedGemma-4B-it large language models significantly improves diagnostic accuracy and discriminatory performance for paediatric pneumonia detection in chest X-rays compared to individual agents, offering a promising privacy-preserving tool for clinical triage and decision support.

Original authors: Tan, J., Tang, P. H.

Published 2026-04-12

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Tan, J., Tang, P. H.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine a busy pediatric emergency room where doctors are stretched thin. They have thousands of chest X-rays to look at to find pneumonia in children, but there aren't enough specialist radiologists to read them all quickly. This delay can be dangerous.

This paper introduces a new "digital team" designed to help speed things up. Here is the story of how they built it, using some simple analogies:

The Problem: One Doctor vs. A Crowd

Usually, we rely on a single, highly trained AI (called a Multimodal Large Language Model or MLLM) to look at an X-ray and say, "Yes, that's pneumonia," or "No, it's clear."

Think of this single AI like one expert detective. Even the best detective can make mistakes, get tired, or miss a tiny clue. In the medical world, we call this "underperforming." The researchers wanted to know: What if we didn't just hire one detective, but hired a whole team?

The Experiment: The "Council of 15"

The researchers took 2,300 chest X-rays from two different hospitals. Instead of asking one AI to make the call, they asked 15 different AI detectives (all based on a model called MedGemma) to look at the same X-ray independently.

Each detective had five options to choose from, ranging from "Definitely Pneumonia" to "Definitely Clear."

The Strategy: How to Listen to the Team

Once the 15 detectives gave their opinions, the researchers tried three different ways to decide the final answer:

The "Average" Approach: Just taking the middle ground of everyone's opinion. (Like asking 15 people for a price estimate and taking the average).
The "Majority Vote": Whatever the most detectives agreed on wins. (Like a class vote where the side with the most hands raised wins).
The "Soft Vote" (The Winner): This is the cleverest method. Instead of just counting "Yes" or "No," this method listens to how confident each detective is.
- Analogy: Imagine 10 detectives say "It's probably pneumonia" with 51% confidence, and 5 detectives say "It's definitely pneumonia" with 99% confidence. A simple majority vote might get confused, but the "Soft Vote" listens to the intensity of the conviction. It weighs the strong opinions heavier than the weak ones.

The Results: The Team Wins

The "Soft Vote" strategy was the clear champion. It was significantly better at:

Spotting the sickness: It correctly identified pneumonia more often than a single detective could.
Avoiding false alarms: It was very good at saying "No pneumonia" when the lungs were actually clear (high specificity).
Consistency: The team agreed with each other much more reliably than a single agent did.

Statistically, this wasn't just a lucky fluke; the improvement was so strong that the odds of it happening by chance were less than 1 in 1,000.

Why This Matters for You

This isn't just about math; it's about real-world impact.

Speed: This system can work in "near real-time," meaning a doctor in a busy ER could get a second opinion instantly.
Privacy: The system is designed to keep patient data safe.
Communication: Because these are "Language Models," they don't just give a "Yes/No" answer. They can explain why they think it's pneumonia in plain English, helping both doctors and worried parents understand the situation.

The Bottom Line:
By turning a single AI into a "committee" of 15 and using a smart way to tally their votes (Soft Voting), the researchers created a super-reliable assistant. This tool acts like a safety net, helping doctors catch dangerous pneumonia cases quickly while avoiding unnecessary panic for clear cases. It's a step toward a future where every child gets a top-tier diagnosis, no matter how busy the hospital is.

1. Problem Statement

Paediatric pneumonia remains a primary cause of morbidity and mortality globally. While Chest X-rays (CXR) are the standard diagnostic tool, a critical bottleneck exists: a shortage of specialist radiologists leads to significant delays in reporting. Although Multimodal Large Language Models (MLLMs) offer the unique advantage of not only analyzing images but also communicating findings to both clinicians and laypersons, they currently lag behind state-of-the-art deep learning classifiers in diagnostic accuracy. The core challenge addressed is how to bridge this performance gap to enable reliable, real-time clinical decision support.

2. Methodology

The study employed a retrospective cohort design utilizing 2,300 paediatric CXRs sourced from two independent hospital datasets. The technical approach focused on ensemble learning strategies applied to MLLMs:

Base Model: The study utilized MedGemma-4B-it, a specialized medical large language model.
Agent Configuration: Fifteen independent MedGemma-4B-it agents were deployed to classify each CXR.
Classification Task: Agents categorized images into five pneumonia likelihood categories.
Ensemble Strategies: Three aggregation methods were tested and compared against the performance of a single "average agent":
1. Majority Voting: Hard voting based on the most frequent class prediction.
2. Soft Voting: Aggregation based on the average probability scores across agents.
3. GPTOSS-20B Aggregation: A larger model used to synthesize the outputs of the smaller agents.
Evaluation Metrics:
- Primary: One-vs-Rest (OvR) Area Under the Receiver Operating Characteristic Curve (AUROC).
- Secondary: Accuracy, Sensitivity, Specificity, F1-score, Cohen's Kappa, and One-vs-One (OvO) AUROC.
- Validation: Statistical significance was assessed across both a "balanced" dataset and a "real-world" dataset.

3. Key Contributions

Ensemble MLLM Framework: The paper demonstrates that ensembling multiple smaller, specialized MLLM agents can outperform individual agents and potentially rival traditional deep learning classifiers in medical imaging tasks.
Optimization of Aggregation: It identifies Soft Voting as the superior aggregation strategy for this specific task, outperforming both Majority Voting and larger model aggregation (GPTOSS-20B).
Dual-Functionality: The proposed system addresses the dual need for high diagnostic accuracy and natural language explainability, allowing for direct communication of findings to diverse stakeholders (clinicians and patients).
Privacy-Preserving Architecture: The system is designed to support near real-time decision-making while maintaining privacy, making it suitable for integration into emergency department workflows.

4. Results

The experimental results indicated that the Soft Voting ensemble strategy significantly outperformed the baseline average agent across both datasets:

OvR AUROC: Showed statistically significant improvements ( $p_{balanced} = 0.0002$ , $p_{real-world} = 0.0003$ ).
Accuracy: Significant gains were observed ( $p_{balanced} = 0.0008$ , $p_{real-world} < 0.0001$ ).
Cohen's Kappa: Improved agreement with ground truth ( $p_{balanced} = 0.0006$ , $p_{real-world} = 0.0054$ ).
OvO AUROC: Enhanced multi-class discrimination ( $p_{balanced} < 0.0001$ , $p_{real-world} = 0.0011$ ).
F1-Score: The balanced dataset showed a superior F1-value ( $p = 0.0028$ ).

Notably, the system demonstrated high specificity, which is crucial for triage applications to minimize false positives while effectively flagging high-risk cases.

5. Significance

This research validates that ensemble strategies can unlock the latent potential of MLLMs in medical imaging, overcoming their historical underperformance compared to traditional deep learning models. By achieving high diagnostic accuracy combined with explainable outputs, the system offers a viable solution to radiologist shortages. It has the potential to be integrated into emergency departments to accelerate triage, reduce reporting delays, and improve patient outcomes through early detection of paediatric pneumonia, all while preserving data privacy.