Quality versus quantity of training datasets for… — Plain-Language Explanation

Original authors: Castelo, A., O'Connor, C., Gupta, A. C., Anderson, B. M., Woodland, M., Altaie, M., Koay, E. J., Odisio, B. C., Tang, T. T., Brock, K. K.

Published 2026-02-18

📖 3 min read☕ Coffee break read

View on medRxiv ↗PDF ↗

CC0 1.0

Original authors: Castelo, A., O'Connor, C., Gupta, A. C., Anderson, B. M., Woodland, M., Altaie, M., Koay, E. J., Odisio, B. C., Tang, T. T., Brock, K. K.

Original paper dedicated to the public domain under CC0 1.0 (https://creativecommons.org/publicdomain/zero/1.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to teach a robot how to draw a perfect outline around a liver in a medical scan. The big question the researchers asked was: Is it better to show the robot a few perfect examples, or thousands of "okay" examples?

Here is the breakdown of their experiment using a simple analogy:

The Setup: The Art Class

Think of the AI model as a student in an art class, and the medical scans as the subjects they need to draw.

The "Highly Curated" Group: This is like a student who only gets to study 244 drawings, but every single one was drawn by a master artist. The lines are perfect, the shading is correct, and there are no mistakes.
The "Mixed-Curation" Group: This is like a student who gets to study 2,840 drawings. Most are good, but some have shaky lines, some are a bit messy, and a few are just "okay." It's a huge pile of work, but the quality varies.

The Test: The Final Exam

After the students studied their respective piles of drawings, they took a test. The researchers gave them new, unseen scans to outline and measured how close their drawings were to the "perfect" answer.

They used a scoring system (like a "closeness score") to see who did better.

The Results: What Happened?

The Main Score (3D Performance):
Surprisingly, the student who studied the small pile of perfect drawings did just as well as the student who studied the huge pile of messy drawings.
- The Analogy: It's like saying a student who memorized 244 perfect recipes can cook a meal just as deliciously as a student who tasted 2,800 different recipes, even if some of those 2,800 were slightly burnt or undercooked. In terms of the final dish, they were tied.
The "Real World" Test (Generalizability):
However, when the researchers tested the students on a completely different type of exam (using data from a different hospital), the student with the huge pile of messy drawings actually won by a small margin.
- The Analogy: The student who saw thousands of different, slightly imperfect examples learned to handle "weird" situations better. They were more flexible and adaptable when the test looked different from what they studied.

The Big Takeaway

The paper concludes that quality and quantity are a balancing act, not a simple "one is better than the other" rule.

If you need a model that performs perfectly on standard, clean data, you might not need millions of images; a smaller, high-quality set is enough.
But if you want the AI to be a "chameleon" that can handle weird, messy, or different real-world data, having a massive amount of data—even if it's not perfect—gives it an edge.

In short: You don't always need a library of a million books to learn a subject, but if you want to be an expert who can handle any question thrown at you, having a massive (even if slightly messy) library helps you see the bigger picture.

1. Problem Statement

In the field of medical artificial intelligence (AI), specifically for image segmentation, there is a persistent tension between the quantity of available training data and the quality (curation level) of annotations. While large datasets are generally assumed to improve model performance, high-quality, expert-curated datasets are scarce, time-consuming to produce, and expensive. Conversely, larger datasets often contain "mixed-curation" data with varying annotation standards. This study addresses the critical question: Does investing in a smaller, highly curated dataset yield better segmentation performance than leveraging a significantly larger dataset with mixed annotation quality?

2. Methodology

The researchers conducted a comparative analysis using abdominal computed tomography (CT) scans to train and evaluate 3D convolutional neural networks.

Data Sources & Composition:
- Total Dataset: 3,089 abdominal CT scans with whole-liver contours.
- Sources: Data was aggregated from MD Anderson Cancer Center (MDA) and a MICCAI challenge.
- Data Splitting:
  - Testing Set: 249 scans withheld.
  - External Validation: 30 scans from the MICCAI challenge (reserved specifically for generalizability testing).
  - Training Groups: The remaining scans were categorized into two groups based on annotation quality:
    1. Highly Curated: Expert-level annotations.
    2. Mixed-Curation: A combination of varying annotation standards.
- Sampling Strategy: The training groups were randomly sampled into sub-datasets of various sizes to create models trained on different data volumes.
Model Architecture:
- All models utilized the 3D nnU-Net (a self-configuring state-of-the-art segmentation framework) for training.
Evaluation Metrics:
- 3D Volumetric Metrics: Dice Similarity Coefficient (DSC), Surface DSC with 2mm margins (SD 2mm), and the 95th percentile of Hausdorff Distance (HD95).
- 2D Slice Metrics: 2D Axial Slice DSC (Slice DSC).
- Statistical Analysis: P-values were calculated to determine the statistical significance of performance differences between models.

3. Key Contributions

Systematic Comparison: The study provides one of the first rigorous, controlled comparisons of dataset size versus annotation quality specifically for whole-liver segmentation using a standardized architecture (nnU-Net).
Quantification of Trade-offs: It quantifies the "order of magnitude" difference in dataset size required to match the performance of highly curated data.
Generalizability Insight: The research distinguishes between performance on internal test sets versus external validation sets, highlighting how data diversity impacts model robustness.

4. Key Results

The study yielded nuanced findings depending on the evaluation metric and the dataset used for testing:

Internal 3D Performance (Equivalence):
- A model trained on a small, highly curated dataset (244 scans) achieved performance statistically indistinguishable from a model trained on a massive, mixed-curation dataset (2,840 scans).
- Metrics:
  - DSC: Both achieved 0.971 ( $p > 0.999$ ).
  - SD 2mm: Both achieved 0.958 ( $p > 0.999$ ).
  - HD95: Highly curated (2.98mm) vs. Mixed-curation (2.87mm), with $p > 0.999$ .
- Conclusion: For standard 3D volumetric overlap and surface distance on internal data, a dataset 10x smaller but highly curated is sufficient.
External Generalizability (Quantity Advantage):
- When tested on the 30 external MICCAI scans, the mixed-curation model (trained on 710 scans) significantly outperformed the highly curated model (244 scans) on the 2D Slice DSC metric.
- Metrics: Mixed-curation (0.929) vs. Highly Curated (0.923), with $p = 0.012$ .
- Conclusion: Larger, diverse datasets provide better generalization to unseen external data, even if the annotations are less curated.

5. Significance and Conclusion

The paper concludes that the trade-off between dataset quality and quantity is goal-dependent:

For High-Precision Internal Tasks: If the goal is to maximize volumetric accuracy on data similar to the training distribution, quality trumps quantity. A highly curated dataset an order of magnitude smaller can achieve equivalent results to a massive mixed dataset.
For Generalizability and Robustness: If the goal is to deploy models across diverse external populations or institutions, quantity and diversity (mixed-curation) offer tangible benefits, particularly in local improvements and generalizability metrics.

Final Takeaway: Researchers and clinicians should not blindly pursue larger datasets if high-quality annotations are available for a smaller subset, as the latter may suffice for core performance. However, for robust clinical deployment where external validation is critical, investing in larger, diverse datasets remains beneficial despite potential annotation inconsistencies.

Quality versus quantity of training datasets for artificial intelligence-based whole liver segmentation