VICatMix: variational Bayesian clustering and variable selection for discrete biomedical data

Imagine you are a detective trying to solve a massive mystery. You have a room full of thousands of people (patients), and you have a giant stack of clues for each person (genetic data, protein levels, mutation history). Your goal is to sort these people into groups based on who they are most similar to, so you can figure out which group needs which specific medicine. This is called clustering.

However, there are two big problems with this detective work:

The Noise: Most of the clues are useless. Out of 1,000 genetic markers, maybe only 10 actually tell you anything about the disease. The other 990 are just background chatter (noise) that confuses the detective.
The Speed: The room is so big, and the clues so complex, that if you try to sort everyone out by hand (or using old, slow computer methods), you'll be there until the sun burns out.

Enter VICatMix. Think of it as a super-smart, high-speed sorting robot designed specifically for this messy, noisy room.

Here is how it works, broken down into simple concepts:

1. The "Over-Prepared" Detective (The Model)

Usually, when you try to sort people, you have to guess how many groups there are beforehand. "Okay, I think there are 3 types of cancer." But what if there are actually 5? Or 7?
VICatMix is like a detective who says, "I'm going to prepare for way more groups than I think there are." It sets up 30 or 40 empty rooms (clusters) just in case.

The Magic Trick: As the robot starts sorting people, it realizes, "Hey, nobody is going into Room 37 or Room 39." So, it naturally closes those empty rooms. It figures out the true number of groups on its own, without you having to guess.

2. The "Noise Filter" (Variable Selection)

Remember that stack of 1,000 clues where 990 are useless?
Old methods try to use all the clues, getting confused by the noise. VICatMix has a special "Noise Filter." It asks every single clue: "Are you actually important for sorting these people?"

If a clue says "I'm just random noise," VICatMix ignores it.
If a clue says "I'm a key driver of the disease," VICatMix highlights it.
This is crucial for finding the "smoking gun" genes that cause cancer, rather than getting lost in the weeds.

3. The "Speedy Brain" (Variational Inference)

Traditionally, to find the perfect groups, computers use a method called MCMC. Imagine this as a hiker trying to find the highest peak in a foggy mountain range. They have to wander around randomly, checking every single spot to make sure they aren't missing a higher peak. It's accurate, but it takes forever.

VICatMix uses Variational Inference (VI). Instead of wandering randomly, it's like a drone that flies straight up, using a map to estimate the highest peak instantly.

The Trade-off: It's an approximation, not a perfect walk-through. But it's so much faster that it can handle huge datasets (like thousands of patients) in minutes or hours, whereas the old method might take days or weeks.

4. The "Group Consensus" (Model Averaging)

Because VICatMix is so fast, it can run the sorting process 30 times in the time it takes the old method to run once.

The Problem: Sometimes, the robot gets stuck in a "local optimum"—it finds a good solution, but not the best one, just because it started in a slightly different spot.
The Solution: VICatMix runs the sort 30 times with different starting points. Then, it looks at all 30 results and asks, "Okay, in 25 out of 30 runs, did Patient A and Patient B end up in the same group?"
The Result: It creates a "Super-Group" based on the consensus. This smooths out the mistakes and gives a much more reliable answer than any single run could.

Why Does This Matter? (Real World Examples)

The paper tests this robot on real medical data:

Yeast: It successfully grouped yeast genes by their function, matching what scientists already knew.
Leukemia (AML): It looked at mutation data from 185 patients. Without the noise filter, it would have failed. But with it, it found 6 specific genes that were the real culprits. These are genes doctors already know are dangerous, proving the robot works.
Pan-Cancer: It took data from 12 different types of cancer (breast, lung, colon, etc.) and sorted them. It didn't just group them by cancer type; it found sub-groups within those types (like "Basal-like" breast cancer), which is vital for giving patients the right treatment.

The Bottom Line

VICatMix is a new tool for doctors and scientists. It takes messy, high-dimensional biological data, filters out the junk, figures out how many distinct groups of patients exist, and does it all incredibly fast. It turns a mountain of confusing data into a clear map, helping us move closer to precision medicine—where treatment is tailored to the specific group a patient belongs to, rather than a "one size fits all" approach.

1. Problem Statement

The paper addresses the challenge of clustering high-dimensional discrete (categorical/binary) biomedical data, such as genomic mutation profiles, gene expression states, and 'omics' data.

The Core Challenge: Traditional clustering methods (e.g., $k$ $k$ -means, hierarchical clustering) lack statistical rigor and do not provide probabilistic interpretations. Model-based approaches using Finite Mixture Models (FMM) are statistically sound but face significant hurdles:
- Unknown Cluster Count ( $K$ ): The true number of clusters is rarely known a priori.
- Computational Cost: Bayesian inference using Markov Chain Monte Carlo (MCMC) is computationally expensive and struggles with large datasets. MCMC also suffers from convergence issues, label switching, and getting stuck in local optima.
- Variable Selection: In high-dimensional 'omics' data, only a subset of features drives the clustering structure. Noise from irrelevant variables degrades performance.
- Local Optima: Variational Inference (VI), while faster than MCMC, is deterministic and prone to converging to poor local optima depending on initialization.

2. Methodology: VICatMix

The authors propose VICatMix, a Variational Bayesian Finite Mixture Model specifically designed for categorical data with integrated variable selection.

A. Model Specification

Likelihood: The data is modeled as a mixture of $K$ components, where each component follows a categorical distribution (equivalent to Bernoulli for binary data).
Variable Selection: The model introduces binary indicator variables $\gamma_j$ $γ_{j}$ for each covariate $j$ $j$ .
- If $\gamma_j = 1$ , the variable contributes to the cluster structure.
- If $\gamma_j = 0$ , the variable follows a "null" distribution (independent of cluster membership), effectively removing it from the clustering signal.
Priors:
- Mixing Weights: A symmetric Dirichlet prior with $\alpha_0 < 1$ is used. This creates an overfitted sparse mixture, allowing the model to start with $K > K_{true}$ and asymptotically "empty" superfluous clusters as data increases, thereby inferring the true number of clusters.
- Selection Indicators: A hierarchical Bernoulli-Beta prior is placed on $\gamma_j$ to allow the probability of a variable being relevant to be inferred.

B. Variational Inference (VI)

Instead of MCMC, VICatMix uses Variational Inference to approximate the posterior distribution.

Optimization: The method maximizes the Evidence Lower Bound (ELBO), converting the inference problem into an optimization problem.
Mean-Field Approximation: The posterior is approximated as a product of independent distributions for latent variables ( $Z$ ), mixing weights ( $\pi$ ), parameters ( $\Phi$ ), and selection indicators ( $\gamma, \delta$ ).
Efficiency: This approach is significantly faster than MCMC, scaling linearly with the number of observations and variables, making it feasible for large 'omics' datasets.

C. Mitigating Local Optima: Model Averaging (VICatMix-Avg)

To address the sensitivity of VI to initialization and the risk of local optima, the authors introduce a co-clustering matrix approach:

Multiple Runs: The model is run $M$ times with different random initializations.
Co-clustering Matrix ( $P$ ): An $N \times N$ matrix is constructed where $P_{ij}$ represents the empirical probability that observations $i$ and $j$ are assigned to the same cluster across the $M$ runs.
Summarization: A single, robust clustering solution ( $Z^*$ $Z^{*}$ ) is derived from $P$ $P$ using:
- Medvedovic Clustering: Agglomerative hierarchical clustering on the distance matrix $(1-P)$ .
- Variation of Information (VoI): An information-theoretic loss function optimized via hierarchical clustering (average or complete linkage) to find the clustering closest to the "consensus" of the runs.
Variable Selection Aggregation: Variables are selected if they appear in the top set of selected variables in a high proportion (e.g., $\geq 95\%$ ) of the $M$ runs.

3. Key Contributions

VICatMix Algorithm: A novel variational Bayesian framework for categorical data that simultaneously performs clustering and variable selection.
Efficiency: By replacing MCMC with Variational Inference, the method achieves orders-of-magnitude speedups, enabling analysis of datasets with thousands of samples and variables.
Robustness via Averaging: The introduction of the co-clustering matrix and model averaging (VICatMix-Avg) effectively mitigates the local optima problem inherent in VI, providing stable estimates of the true number of clusters and feature saliency.
Handling Sparsity: The sparse Dirichlet prior allows for automatic determination of the number of clusters without pre-specifying $K$ .
Open Source Implementation: The method is implemented as an R package (VICatMix) with C++ acceleration, making it accessible to the biomedical community.

4. Results

A. Simulation Studies

Accuracy: VICatMix-Avg consistently outperformed competitors (PReMiuM, BayesBinMix, FlexMix, BHC) in terms of Adjusted Rand Index (ARI), often achieving ARI > 0.9.
Cluster Count: The model accurately identified the true number of clusters ( $K_{true}$ ), whereas methods like BIC-based selection often underestimated $K$ , and MCMC methods struggled with label switching.
Variable Selection: In noisy datasets (where 25-50% of variables were irrelevant), VICatMix with variable selection (VICatMixVarSel) maintained high accuracy, while models without selection degraded significantly. F1 scores for variable selection were high (0.9+).
Scalability: Run-times scaled linearly with $N$ and $P$ . VICatMix was significantly faster than MCMC-based competitors (e.g., PReMiuM, BayesBinMix), handling datasets with 20,000+ samples in hours rather than days.

B. Real-World Applications

Yeast Galactose Data:
- Successfully clustered genes into functional groups consistent with Gene Ontology (GO) categories.
- When $K$ was set high, the model naturally subdivided broad GO categories into biologically meaningful sub-clusters.
Acute Myeloid Leukaemia (AML) Mutation Data:
- Applied to TCGA mutation data (185 patients, 151 genes).
- Without variable selection: All samples collapsed into a single cluster (demonstrating the necessity of feature selection for sparse mutation data).
- With variable selection: Identified 6 key driver genes (including DNMT3A, NPM1, FLT3, IDH2, RUNX1, TP53) known to be clinically relevant for AML prognosis.
- Over-representation analysis confirmed these genes were significantly associated with AML.
Pan-Cancer Integrative Analysis:
- Applied to a "Matrix of Clusters" derived from 12 cancer types and 5 'omics' platforms (TCGA).
- Successfully separated samples by tissue of origin (e.g., LAML, OV, BRCA).
- Identified clinically relevant subtypes within breast cancer (BRCA), perfectly separating Basal-like samples (132/141) from other subtypes, aligning with PAM50 classifications.
- Demonstrated the ability to detect hierarchical structures (e.g., separating colorectal adenocarcinomas COAD/READ).

5. Significance

Precision Medicine: VICatMix provides a computationally efficient tool for stratifying patients into molecular subtypes, which is critical for tailoring treatments in precision medicine.
Driver Gene Discovery: The integrated variable selection capability allows researchers to filter out noise and identify the specific genomic features driving disease subtypes, as demonstrated in the AML application.
Integrative Analysis: The method facilitates the integration of diverse 'omics' data types (e.g., combining mutation, methylation, and expression data) to discover novel disease subtypes that might be missed by single-platform analyses.
Methodological Advancement: It bridges the gap between the statistical rigor of Bayesian non-parametrics and the computational feasibility required for modern high-throughput biological data, offering a robust alternative to slow MCMC methods.

In conclusion, VICatMix represents a significant step forward in the analysis of discrete biomedical data, offering a fast, accurate, and interpretable solution for clustering and feature selection in high-dimensional settings.