GraphHDBSCAN*: Graph-based Hierarchical Clustering on High Dimensional Single-cell RNA Sequencing Data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a librarian trying to organize a massive, chaotic library containing millions of books. But these aren't normal books; they are tiny, fragile scrolls written in a language so complex and sparse that most of the pages are blank. This is what scientists face when they try to organize Single-Cell RNA sequencing (scRNA-seq) data. Each "book" is a single cell from your body, and the "words" are the genes it is using.

The goal is to group these cells into families (like "muscle cells," "immune cells," or "nerve cells") to understand how the body works. But there's a catch: these families aren't just flat lists. They have a family tree. A broad category like "Immune Cell" splits into "White Blood Cells," which then splits into "Monocytes," which further splits into specific subtypes.

The Problem: The "Flat Map" vs. The "Tree"

Currently, most scientists use tools like Louvain or Leiden to organize these cells. Think of these tools as a flat map. They are great at drawing borders between countries (cell types), but they treat every country as a separate, isolated island. They ignore the fact that some countries are neighbors, or that a region is actually a province of a larger nation. They also struggle when the library is so huge and the books are so similar that it's hard to tell them apart (the "curse of dimensionality").

Other tools, like HDBSCAN, try to draw a family tree based on how crowded different areas of the library are. However, in a massive, high-dimensional library, these tools often get confused. They might think two completely different books are the same because the "distance" between them looks weird in such a vast space. Worse, they often throw away thousands of books as "garbage" (noise) because they can't figure out where they belong.

The Solution: GraphHDBSCAN*

The authors of this paper introduce a new tool called GraphHDBSCAN*. You can think of this as a smart, 3D holographic map that understands both the flat layout and the family tree simultaneously.

Here is how it works, using simple analogies:

1. The "Friendship Network" (Graph Construction)

Instead of trying to measure the distance between every single book (which is impossible in a huge library), GraphHDBSCAN* first builds a friendship network.

It asks: "Who are the top 10 neighbors of this book?"
Then, it looks deeper: "Do these neighbors also know each other?"
This creates a Weighted Structural Similarity (WSS) graph. Imagine a web where strong lines connect books that share many mutual friends. This web is much more stable and reliable than just measuring raw distance in a foggy room.

2. The "Crowd Density" Detective (Hierarchical Clustering)

Once the web is built, the tool acts like a detective looking for crowds.

It doesn't just look for one big crowd; it looks for crowds inside crowds.
It can see a massive crowd of "Immune Cells," then zoom in to see a smaller, denser crowd of "Monocytes" inside that, and even smaller groups of specific subtypes.
Because it uses the "friendship web" instead of raw distance, it doesn't get confused by the size of the library. It finds the structure naturally.

3. The "Rescue Team" (Label Propagation)

One of the biggest headaches in cell biology is the "noise." These are cells that the computer thinks are garbage or outliers. In real life, these might just be rare cells or cells in a weird state, not garbage.

Old methods would just throw these cells in the trash bin.
GraphHDBSCAN* has a Rescue Team. It looks at the "noise" cells and asks, "Who are your closest friends in the friendship web?" It then gently assigns them to the most likely family.
In the paper's experiments, this team successfully "rescued" thousands of cells that were previously discarded, reassigning them to the correct cell types with high accuracy.

Why This Matters

The authors tested their new tool against the current industry standards (Louvain and Leiden) on real biological data.

The Result: GraphHDBSCAN* didn't just draw a better flat map; it drew a better family tree.
The Discovery: It found hidden subtypes of cells (like specific types of Monocytes) that other methods missed. It revealed the "hierarchy" of life that was previously invisible.
The Efficiency: It does all this without needing the user to tweak a million settings (it's "hyperparameter-free" in practice), making it easy for biologists to use.

The Big Picture

If traditional methods are like sorting a deck of cards into piles of suits (Hearts, Spades, etc.), GraphHDBSCAN* is like sorting them into a deck, then realizing that the Hearts are actually a family with a King, Queen, and Jack, and that the Jacks have their own distinct personalities.

It turns a flat, confusing list of millions of cells into a clear, navigable family tree of life, helping scientists understand not just what cells exist, but how they are related and how they evolve. This is a huge step forward in decoding the complexity of human biology.

1. Problem Statement

Single-cell RNA sequencing (scRNA-seq) generates massive, high-dimensional, and sparse datasets to study cellular heterogeneity. A primary challenge in analyzing this data is clustering: grouping cells with similar expression patterns to identify biologically meaningful types and states.

Limitations of Current Methods:
- Modularity-based methods (e.g., Louvain, Leiden): These are the current standards (used in Seurat and SCANPY) but produce only flat partitions. They ignore the natural hierarchical organization of cell types (e.g., broad types splitting into subtypes) and are sensitive to hyperparameters (specifically the resolution parameter) and stochasticity.
- Density-based methods (e.g., HDBSCAN):* While capable of handling varying densities and producing hierarchical structures, standard HDBSCAN* relies on pairwise distances. In high-dimensional spaces, the "curse of dimensionality" renders Euclidean or cosine distances less informative, leading to poor cluster formation and excessive noise detection (discarding valid cells as outliers).
The Gap: There is a need for a method that combines the hierarchical, density-based strengths of HDBSCAN* with the robustness of graph-based representations to handle high-dimensional scRNA-seq data effectively without requiring prior dimensionality reduction (like PCA or UMAP).

2. Methodology: GraphHDBSCAN*

The authors propose GraphHDBSCAN, a graph-based, hyperparameter-free extension of HDBSCAN designed for high-dimensional data. The workflow consists of four main stages:

A. Graph Construction and Transformation

Instead of operating directly on raw feature space distances, the method constructs a graph representation:

k-NN Graph: A k-nearest neighbor graph is built from the raw gene expression data.
Weighted Structural Similarity (WSS): The edges of the k-NN graph are re-weighted using a Weighted Structural Similarity metric. This generalizes the Shared Nearest Neighbor (SNN) concept to weighted graphs.
- Formula: $\sigma(u,v) = \frac{\sum_{x \in \Gamma_u \cap \Gamma_v} w(u,x)w(v,x)}{\sqrt{\sum_{x \in \Gamma_u} w^2(u,x)} \sqrt{\sum_{x \in \Gamma_v} w^2(v,x)}}$
- This captures both neighborhood overlap and edge strength, providing stability in high dimensions.
Dissimilarity Conversion: The similarity weights are transformed into dissimilarities ( $d = 1 - \sigma$ ) to serve as input for density-based clustering concepts.

B. Hierarchical Density Clustering

The method applies HDBSCAN* principles directly to the WSS graph:

Core Distances & Mutual Reachability: Core distances and Mutual Reachability Distances (MRD) are computed based on the graph edges rather than a complete distance matrix.
CORE-SG Optimization: To efficiently explore the hierarchy across a range of minPts (density smoothing) values without re-computing Minimum Spanning Trees (MSTs) for every parameter, the authors utilize CORE-SG (Core-distance based Spanning Graph). This allows the derivation of multiple MSTs from a single compact graph, making the algorithm effectively hyperparameter-free regarding the density scale.
Output: A condensed tree representing the full density-based hierarchy of the data.

C. Flat Partitioning (FOSC)

To extract a flat clustering solution from the hierarchy, the method uses the Framework for Optimal Selection of Clusters (FOSC). This relies on the Excess of Mass (EOM) criterion to select clusters that persist (are stable) across the widest range of density levels.

D. Label Propagation for Noise

Standard HDBSCAN* labels sparse points as noise. In scRNA-seq, discarding cells is often undesirable. GraphHDBSCAN* employs a semi-supervised label propagation strategy (based on HDBSCAN*(cd,–)):

Non-noise points are treated as labeled data.
Labels are propagated along the MST to noise points, assigning them to the densest accessible region.
This ensures a complete partition of the dataset while maintaining the statistical principles of the density-based approach.

3. Key Contributions

Graph-Based HDBSCAN:* A novel adaptation of HDBSCAN* that operates on a Weighted Structural Similarity graph, overcoming the limitations of distance-based metrics in high-dimensional spaces.
Hierarchical Recovery: Unlike Louvain/Leiden, it recovers an interpretable, density-based hierarchy that reveals how cell populations split and merge, aligning with biological ontologies.
Hyperparameter-Free Operation: By leveraging CORE-SG, the method computes a family of hierarchies across minPts values automatically, removing the need for manual tuning of the density parameter.
Robust Noise Handling: Introduces a density-aware label propagation mechanism that rescues "noise" cells (often doublets or rare states) by assigning them to appropriate clusters, rather than discarding them.
No Dimensionality Reduction Required: The method can operate directly on high-dimensional data after standard feature selection (HVGs), avoiding the information loss and distortion associated with PCA/t-SNE/UMAP embeddings.

4. Results and Evaluation

The authors evaluated GraphHDBSCAN* on multiple scRNA-seq datasets (including PBMC, Zheng, and CITE-seq data) against state-of-the-art methods (Louvain, Leiden, and original HDBSCAN*).

Biological Insight:
- On CITE-seq data, the hierarchy correctly separated Monocytes from T-cells and further resolved Monocyte subtypes (Classical vs. Non-classical) based on markers like CD11c and CD36, which were missed by flat clustering in the original study.
- On the Zheng dataset, it identified previously unreported CD34+ progenitor subpopulations and correctly separated NK cells and B-cells.
Benchmarking (Flat Partitioning):
- Metrics: Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI).
- Performance: GraphHDBSCAN* consistently outperformed or matched Louvain and Leiden across datasets. It showed lower variability and higher stability, even when using default hyperparameters (where Louvain/Leiden defaults are often tuned for scRNA-seq).
- Comparison: It significantly outperformed the original HDBSCAN*, demonstrating the necessity of the graph-based transformation.
Label Propagation:
- In the PBMC3k dataset, the method initially identified ~62% of cells as noise. After label propagation, these were reassigned to biologically consistent clusters (e.g., T-cells, cDC2).
- Correlation analysis showed that label-propagated cells had high gene expression correlation (0.98) with their assigned clusters, confirming the accuracy of the reassignment.
Scalability: Runtime analysis showed GraphHDBSCAN* scales smoothly with dataset size, with only a modest overhead compared to Louvain/Leiden due to the hierarchical computation.

5. Significance

GraphHDBSCAN* represents a significant advancement in single-cell analysis by bridging the gap between structural graph clustering and density-based hierarchical clustering.

Biological Relevance: It provides a more natural representation of cell differentiation and lineage, capturing the "tree of life" structure inherent in biology rather than forcing flat partitions.
Methodological Shift: It challenges the reliance on dimensionality reduction (UMAP/PCA) as a prerequisite for clustering, suggesting that graph-based structural similarity is a more robust feature space for high-dimensional biological data.
Practical Utility: By offering both a hierarchical view for exploration and a high-quality flat partition for downstream analysis (with optional noise rescue), it serves as a comprehensive tool for resolving complex cellular heterogeneity.