Standardised Human Phenotype Ontology Annotation… — Plain-Language Explanation

Original authors: Campos, L. C., Favreau, E., Greene, D., Blach, J., Thomas, M., Alsehaim, K., Mutlu, L., Elhadari, S., Herwadkar, A., Payne, J., Lever, C., Mahmoud, D., Moreira, F., O'Sullivan, M., Berry, M., Twigg, G

Published 2026-04-29

📖 4 min read☕ Coffee break read

View on medRxiv ↗PDF ↗

CC BY 4.0

Original authors: Campos, L. C., Favreau, E., Greene, D., Blach, J., Thomas, M., Alsehaim, K., Mutlu, L., Elhadari, S., Herwadkar, A., Payne, J., Lever, C., Mahmoud, D., Moreira, F., O'Sullivan, M., Berry, M., Twigg, G., Hart, A. C. J., Joshi, N., Fuller, S., INTREPID Consortium,, Smith, K. G. C., Turro, E., Cook, M. C., Wallace, C., Burns, S. O.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to describe a complex recipe to a friend, but everyone uses their own slang. One person says "a pinch of spice," another says "a dash of heat," and a third says "something spicy." If you try to compare their notes later, it's a mess. You can't tell if they actually made the same dish or if they just used different words for the same thing.

This is exactly the problem doctors faced with Common Variable Immunodeficiency (CVID). It's a condition where the immune system is weak, but it looks different in every single patient. Some get lots of infections, while others get infections plus autoimmune problems, swollen organs, or lung issues. Because doctors described these symptoms in different ways, it was hard to see the big picture or group patients who were actually similar.

The Solution: A Universal "Medical Dictionary"
The researchers in this paper decided to fix this by using a tool called the Human Phenotype Ontology (HPO). Think of HPO as a massive, standardized dictionary for human diseases. Instead of writing "my tummy hurts," a doctor using HPO would select the exact term "Abdominal pain" from a list. Instead of "bad lungs," they'd pick "Bronchiectasis."

The team, led by the INTREPID consortium, built a special digital tool (a "Phenotype Capture Tool") to help doctors use this dictionary. But they knew that just giving doctors a dictionary isn't enough; they needed to teach them how to use it properly. So, they trained 28 clinicians across 11 UK hospitals on how to describe patients using these specific terms.

The Experiment: Training Makes Perfect
The researchers tested their idea in two ways:

The Test Drive: They gave 10 doctors the same fake patient case and asked them to describe it using the HPO dictionary. Before training, their descriptions were all over the place. After training, they were almost identical. It was like teaching a group of chefs to measure ingredients with the same cup instead of guessing with their hands.
The Real World: They looked at 526 real CVID patients. They compared the notes doctors wrote before the training to the notes written after.
- Before: The notes were sparse. On average, doctors listed about 7 symptoms per patient.
- After: The notes were rich and detailed. The average jumped to 19 symptoms per patient.
- The Result: The doctors didn't just write more; they wrote better. They stopped using vague terms and started using precise ones, capturing the full complexity of the disease.

What They Discovered: Sorting the "Infection" vs. "Complex" Groups
With this high-quality data, the researchers could finally sort the patients into two distinct groups, like sorting a mixed bag of marbles by color:

Group A (Infection-Only): Patients who mostly just struggled with getting sick.
Group B (Complex): Patients who had infections plus other messy complications like autoimmune attacks, enlarged spleens, or liver issues.

They found that 58% of the patients fell into the "Complex" group.

Connecting the Dots: Genes and Symptoms
Because the data was so clean, they could finally draw clear lines between what was happening inside the patients' bodies (their genes and blood cells) and what was happening on the outside (their symptoms).

The "Complex" Clue: Patients in the "Complex" group were much more likely to have specific genetic mutations (specifically in a gene called NFKB1) and specific abnormalities in their immune cells (like a lack of "switched memory" B cells).
Specific Matches:
- If a patient had a mutation in the NFKB1 gene, they were highly likely to have autoimmune neutropenia (where the body attacks its own infection-fighting white blood cells).
- If a patient had a specific mutation in the TACI gene, they were more likely to get repeated yeast (Candida) infections.
- Patients with high levels of a specific type of immune cell (CD21low) were more likely to have autoimmune thrombocytopenia (low platelet counts).

The Bottom Line
This study proves that if you teach doctors to speak the same "language" (HPO) and give them the right tools, you can turn a messy, confusing pile of patient notes into a clear, organized map.

By doing this, they didn't just count symptoms; they discovered that the "Complex" type of CVID is biologically different from the "Infection-Only" type. They found that certain genetic errors are directly linked to specific, severe complications. This means that in the future, looking at a patient's specific genetic code and detailed symptom list could help doctors understand exactly what kind of CVID a patient has, rather than just treating them all the same way.

In short: They built a better filing system, taught the staff how to use it, and in doing so, found hidden patterns that explain why some patients get sicker than others.

1. Problem Statement

Common Variable Immunodeficiency (CVID) is the most prevalent primary immunodeficiency (PID), characterized by significant clinical heterogeneity. Patients present with diverse manifestations ranging from recurrent infections to complex immune dysregulation (autoimmunity, lymphoproliferation, granulomas).

The Challenge: Current classification systems (e.g., EUROClass) rely on B-cell phenotyping but often lack granular, standardized clinical data. Existing registries suffer from inconsistent data entry, variable coding practices, and a lack of structured phenotypic depth, which hinders the identification of genotype-phenotype correlations and the stratification of patients for targeted therapies.
The Gap: While the Human Phenotype Ontology (HPO) offers a structured framework for rare diseases, its application in large, real-world CVID cohorts has been limited by barriers such as clinician unfamiliarity, time pressures, and the lack of standardized tools for data capture.

2. Methodology

The study leveraged the INTREPID (Integrative Translational Research in Primary Immunodeficiency) project, a UK-wide whole-genome sequencing (WGS) initiative.

Cohort: 526 adult CVID patients recruited from 11 UK centers.
Tool Development: The authors developed a web-based Phenotype Capture Tool (PCT).
- Features an autocompletion search box for HPO terms.
- Allows entry of numerical laboratory values which are automatically mapped to corresponding HPO terms.
- Includes a specific "Yes/No" field for Granulomatous Lymphocytic Interstitial Lung Disease (GLILD), as it was not initially a standard HPO term.
Clinician Training: A structured training program was implemented where clinicians practiced on mock cases.
- Pre- vs. Post-Training Analysis: The study compared HPO annotations made by clinicians before training (at initial WGS recruitment) against annotations made after training for the same patients.
- Consistency Metrics: Pairwise Lin's semantic similarity scores were calculated to measure inter-clinician agreement.
Data Analysis:
- Categorization: Patients were stratified into two groups based on HPO terms: Infection-only (recurrent infections/bronchiectasis) vs. Complex (infections + autoimmunity, splenomegaly, lymphoproliferation, GLILD, etc.).
- Statistical Testing: Associations between phenotypes, B-cell subsets (EUROClass), immunoglobulin levels, and genetic variants were tested using Cochran-Mantel-Haenszel tests (stratified by center) with Benjamini-Hochberg (BH) correction for multiple testing.
- Dimensionality Reduction: Principal Component Analysis (PCA) using HPO embeddings (node2vec) was used to visualize phenotypic clustering.

3. Key Contributions

First Large-Scale HPO Application in CVID: This is the first study to systematically apply granular HPO annotation to a large, real-world CVID cohort (n=526).
Validation of Training Workflows: Demonstrated that structured clinician training significantly improves both the granularity (number of terms per patient) and consistency (inter-rater reliability) of phenotypic data.
Refined Phenotypic Stratification: Successfully defined a "Complex CVID" subgroup distinct from "Infection-only" patients using a logic-based set of HPO terms.
Ontology Curation: Identified 41 missing disease-specific HPO terms (e.g., specific granuloma types, recurrent infections) and proposed new hierarchical relationships, submitting them to the HPO database.
Open Science: Released the Phenotype Capture Tool, the analysis pipeline (GitHub), and a comprehensive dataset of 883 unique HPO terms mapped to the cohort.

4. Key Results

Impact of Training:
- Granularity: The median number of non-redundant HPO terms per patient increased from 7 (IQR 8) pre-training to 19 (IQR 11) post-training ( $P < 2.2 \times 10^{-16}$ ).
- Consistency: Pairwise semantic similarity between clinicians annotating the same case was high (median 0.80), indicating standardized coding practices were achieved.
Cohort Stratification:
- Infection-only: 42% (220 patients).
- Complex CVID: 58% (306 patients).
- PCA analysis confirmed that the first principal component effectively separated these two groups.
Genotype-Phenotype Associations:
- Complex Phenotype: Significantly associated with very low switched memory B cells (<2%) and expanded CD21low B cells (>10%).
- Genetic Drivers: Patients with complex disease were significantly more likely to carry pathogenic variants in IUIS-listed genes overall ( $P=0.002$ ) and specifically NFKB1 variants ( $P=0.0035$ ).
- Specific Associations:
  - NFKB1 variants $\leftrightarrow$ Autoimmune neutropenia ( $P = 7.4 \times 10^{-6}$ ).
  - Pathogenic IUIS variants (non-NFKB1/TACI) $\leftrightarrow$ Autoimmune hemolytic anemia ( $P = 0.009$ ).
  - Canonical TNFRSF13B (TACI) variants $\leftrightarrow$ Recurrent candida infections ( $P = 0.046$ ).
  - Elevated IgM $\leftrightarrow$ Pulmonary interstitial lymphocytic infiltration ( $P = 4.5 \times 10^{-7}$ ).
B-Cell Correlates:
- Very low switched memory B cells were associated with splenomegaly.
- Expanded CD21low B cells were linked to autoimmunity, specifically autoimmune thrombocytopenia.

5. Significance and Clinical Implications

Standardization: The study proves that HPO-based phenotyping is feasible in routine clinical practice when supported by user-friendly tools and training. It reduces inter-clinician variability, enabling robust multi-center comparisons.
Precision Medicine: By distinguishing "Complex" from "Infection-only" CVID, the study facilitates better patient stratification. This is crucial for identifying patients who may benefit from immunosuppressive therapies (for complex disease) versus those requiring only immunoglobulin replacement.
Genetic Discovery: The high-quality phenotypic data strengthens the power of WGS to identify monogenic causes (like NFKB1) and link them to specific clinical sub-phenotypes, moving beyond the "diagnosis-based" framework to a "phenotype-driven" approach.
Future Directions: The authors highlight that while manual annotation is effective, future work should integrate automated extraction from unstructured electronic health records (EHR) to capture longitudinal data and further deepen phenotypic resolution.

In conclusion, this paper establishes a new standard for CVID research, providing a validated, high-quality, HPO-coded dataset that enables the dissection of disease heterogeneity and supports the development of targeted therapeutic strategies.

Standardised Human Phenotype Ontology Annotation Enables High Quality Phenotypic Data Capture in a Real-World Common Variable Immunodeficiency Cohort

1. Problem Statement

2. Methodology

3. Key Contributions

4. Key Results

5. Significance and Clinical Implications

More like this