📄 hiv aids

Biologically informed genetic data transformations improve multi-omic comorbidity prediction in people with HIV

This study demonstrates that in people with HIV, biologically informed genetic data transformations—specifically polygenic risk scores and AlphaGenome-derived gene-level impact scores—significantly improve multi-omic prediction accuracy for coronary artery disease and chronic kidney disease compared to using raw SNP genotypes or principal components.

Original authors: Ryan, B., Thorball, C. W., Ait Oumelloul, M., Kouyos, R., Tarr, P. E., Fellay, J.

Published 2026-03-10

📖 4 min read☕ Coffee break read

CC BY 4.0

Original authors: Ryan, B., Thorball, C. W., Ait Oumelloul, M., Kouyos, R., Tarr, P. E., Fellay, J.

Original paper licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/). ⚕️ This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your body is a massive, complex orchestra. To understand why some musicians (people) develop specific health issues like heart disease or kidney trouble, scientists usually try to listen to every single instrument in the orchestra at once. This is called multi-omics.

In this study, the researchers focused on people living with HIV. Even though modern medicine (ART) keeps them alive and healthy, they are still at higher risk for "comorbidities" like Coronary Artery Disease (CAD) and Chronic Kidney Disease (CKD).

The researchers wanted to build a "crystal ball" (a computer model) to predict who might get sick. They knew two things were important:

The Genetic Scorecard (Genomics): Your DNA, which is like the sheet music written before the concert even starts.
The Current Performance (Other Omics): Your blood proteins or metabolites, which are like the actual sound the orchestra is making right now.

The Problem: Too Much Sheet Music

The researchers ran into a huge problem. Your DNA sheet music is enormous—it has millions of notes (called SNPs). If you try to feed all those millions of raw notes into a computer along with the current performance data, the computer gets confused. It's like trying to read a library's worth of books while simultaneously listening to a symphony; the computer just can't find the signal in the noise.

Usually, scientists try to fix this by:

Throwing away most of the notes: Keeping only a random handful.
Grouping notes together: Using a technique called PCA (Principal Component Analysis) to summarize the music.

But the researchers suspected these methods were too blunt. They wanted to see if smarter ways of organizing the genetic data would help the computer predict the future better.

The Experiment: Four Ways to Read the Sheet Music

The team tested four different ways to translate the raw genetic data into something the computer could understand:

Raw Notes (Raw SNPs): Feeding the computer the millions of raw genetic letters.
The Summary (PCA): Grouping the notes into broad themes.
The "Risk Score" (PRS): Instead of looking at every note, they used a pre-made "Risk Score" based on what we already know about heart and kidney disease from huge global studies. It's like having a cheat sheet that says, "These specific notes usually mean trouble."
The "AI Translator" (AlphaGenome): They used a super-advanced AI (a "foundational DNA model") that reads the DNA and translates it into a summary of how specific genes might be impacted. It's like having a genius conductor who looks at the sheet music and instantly tells you, "This section is likely to be loud and chaotic."

The Results: Quality Over Quantity

Here is what they found, using a simple analogy:

The "Raw Notes" and "Summary" approaches failed. When they tried to mix the raw genetic data or the simple summaries with the current blood data, the computer actually got worse at predicting who would get sick. It was like trying to mix a messy pile of sheet music with the live audio; the noise drowned out the useful signal.
The "Smart Translations" won. When they used the Risk Scores (PRS) or the AI Translator (AlphaGenome), the prediction accuracy went up.
- For Kidney Disease, the best model combined the "AI Translator" genetic data with the blood metabolites. It was the most accurate crystal ball.
- For Heart Disease, the "Risk Score" (PRS) was the star player.

The Big Takeaway

The main lesson of this paper is: Don't just dump raw data into a computer.

If you want to predict disease using a person's DNA and their current blood work, you need to translate the DNA first. You need to turn those millions of confusing genetic letters into a few meaningful, biologically smart summaries (like a Risk Score or an AI-generated impact score).

In everyday terms:
Imagine you are trying to guess the weather.

Method A (Raw Data): You give the computer 10 million raw temperature readings from every single leaf on every tree in the world. The computer crashes.
Method B (Smart Translation): You give the computer a simple report that says, "The air pressure is dropping, and the humidity is high." The computer predicts rain perfectly.

This study shows that for people with HIV, using these "smart translations" of their DNA helps doctors predict heart and kidney problems much better than using raw genetic data alone. It's a step toward personalized medicine that doesn't require millions of people to work—it just requires the right way of looking at the data we already have.

1. Problem Statement

People with HIV (PWH) face increased risks of age-associated comorbidities, specifically Coronary Artery Disease (CAD) and Chronic Kidney Disease (CKD), due to systemic inflammation and premature aging. While these conditions have genetic components, integrating genomic data with other omics layers (e.g., proteomics, metabolomics) for prediction remains challenging.

Key Challenges Identified:

Data Scale and Sparsity: Genomic data involves millions of Single Nucleotide Polymorphisms (SNPs), creating a high-dimensional, sparse, and categorical dataset (0, 1, 2) that is difficult to integrate with continuous omics data (like transcriptomics or proteomics).
Lack of Standardization: Current multi-omics methods often rely on aggressive dimensionality reduction (e.g., PCA) or arbitrary SNP selection without biological rationale, which can degrade predictive performance.
Sample Size Limitations: Large-scale genomic studies typically require tens of thousands of participants, but multi-omics cohorts are often smaller, making direct integration of raw SNP matrices inefficient and prone to overfitting.

2. Methodology

Datasets:
The study utilized two subsets from the Swiss HIV Cohort Study (SHCS):

CAD Cohort: 436 cases and 436 controls (matched) with proteomic profiles.
CKD Cohort: 166 cases and 166 controls (matched) with metabolomic profiles.
Note: Samples were taken prior to diagnosis, making the task prognostic.

Genomic Data Transformations:
The authors evaluated four distinct ways to represent genotype data:

Raw SNP Matrices: Pruned SNPs (after LD pruning, ~25k SNPs) used directly.
Principal Component Analysis (PCA): Dimensionality reduction of the pruned SNPs.
Polygenic Risk Scores (PRS): Aggregated scores derived from GWAS summary statistics (using the PGS Catalog) for all available phenotypes.
AlphaGenome Scores: Gene-level impact scores derived from a foundational DNA model (AlphaGenome) predicting variant effects on gene expression in specific tissues (coronary artery for CAD, kidney for CKD).

Model Architecture & Integration:

Classifiers: Linear (Lasso Logistic Regression) and Deep Learning (Two-layer Perceptron with ReLU, Dropout, Adam optimizer).
Integration Strategies:
1. Feature Concatenation: Appending genomic features directly to omics features.
2. Multimodal Encoder: A deep-learning architecture where each modality is compressed into latent embeddings via separate perceptrons, then merged via mean pooling.
Evaluation: Five-fold nested cross-validation (80/20 split for outer loop; 5-fold for inner hyperparameter tuning) to ensure patient splits remained consistent across all configurations.

3. Key Results

Single-Omic Performance:

CKD: Metabolomics was the strongest predictor (Accuracy $\approx$ 0.68). No genomic modality alone exceeded the 50% baseline.
CAD: PRS was the strongest genomic predictor (Accuracy $\approx$ 0.60), outperforming raw SNPs and PCA. Proteomics also showed predictive value (Accuracy $\approx$ 0.57).

Multi-Omic Integration Performance:

Failure of Raw/PCA Integration: Integrating raw SNPs or PCA embeddings with other omics reduced performance compared to the best single-omic models.
- Example (CKD): Metabolomics alone = 0.68; Metabolomics + Raw SNPs = 0.63–0.64.
Success of Biologically Informed Transformations: Integrating PRS or AlphaGenome scores maintained or slightly improved performance relative to single-omic baselines.
- CKD Best Result: Metabolomics + AlphaGenome (Logistic Regression) achieved 0.67 ± 0.02 accuracy (comparable to metabolomics alone but with reduced standard error).
- CAD Best Result: Proteomics + PRS (Logistic Regression) achieved 0.61 ± 0.04 accuracy, a slight improvement over PRS alone (0.60) and significantly better than Proteomics + Raw SNPs (0.55).
Model Complexity: Linear models (Logistic Regression) generally performed as well as or better than deep learning multimodal encoders, suggesting that for these specific datasets, linear weighting of features was sufficient and higher-order cross-modal interactions were not the primary driver of signal.

4. Key Contributions

Demonstration of Transformation Necessity: The study provides empirical evidence that directly integrating raw genomic data (SNPs/PCA) into multi-omics models often degrades performance.
Validation of Biological Transformations: It highlights that transforming genomics into biologically meaningful, lower-dimensional features (PRS and AlphaGenome scores) preserves predictive signal and enables effective integration with other omics layers.
Framework for Small Cohorts: The approach offers a viable pipeline for multi-omics prediction in cohorts with limited sample sizes (hundreds rather than tens of thousands) by leveraging external GWAS summary statistics and foundation models.
Benchmarking Integration Strategies: It compares concatenation vs. latent space integration, finding that simple concatenation with linear models is often optimal when signals are weak and independent.

5. Significance and Implications

Clinical Relevance: As PWH live longer due to ART, predicting comorbidities like CAD and CKD is critical. This study suggests that incorporating "biologically informed" genetic scores can enhance prognostic models without requiring massive sample sizes.
Methodological Shift: The findings argue against the "brute force" approach of feeding raw SNPs into multi-omics models. Instead, researchers should pre-process genomic data using domain knowledge (GWAS) or foundation models (AlphaGenome) to extract relevant biological signals before integration.
Future Directions: The study suggests that foundational DNA models like AlphaGenome hold promise for capturing regulatory signatures across tissues. Future work should explore varying window sizes for these models and integrating multi-tissue/multi-track data to further refine predictions.

Conclusion:
The paper concludes that biologically informed genomic transformations (specifically PRS and AlphaGenome scores) are superior to raw SNP matrices or PCA for multi-omics integration. These methods allow for the effective combination of genomics with proteomics or metabolomics to predict comorbidities in HIV patients, even in cohorts where sample sizes are too small for traditional large-scale genomic integration.