Finding stable clusterings of single-cell RNA-seq data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Picture: The "Cell Party" Problem

Imagine you walk into a massive, noisy party with 100,000 people. Your goal is to figure out who belongs to which group. Are they all just "people," or are there distinct groups like "Dancers," "Talkers," "Eaters," and "Sleepers"?

In the world of biology, these "people" are cells, and the "groups" are cell types (like immune cells, skin cells, or cancer cells). Scientists use a tool called single-cell RNA sequencing to listen to what genes each cell is "saying" (expressing) to figure out who belongs where.

The problem? The data is messy. It's like trying to sort the party guests based on a blurry, static-filled recording. Sometimes, the computer groups people together that don't belong, or splits a single group into two. How do we know if the groups we found are real, or just a fluke of the noise?

The Core Idea: The "Half-Party" Test

The author, Victor Klebanoff, asks a simple but profound question: "If we had twice as many guests at the party, would our groups change?"

Since we can't magically summon more guests, he uses a clever trick called reverse engineering:

Take the whole party (all the data).
Split the guests into two random halves (Sample A and Sample B).
Try to sort Sample A into groups.
Try to sort Sample B into groups.
The Test: Do the groups in Sample A look like the groups in Sample B? Do they match the groups found in the full party?

If yes: The groups are stable. They are real, solid groups that exist regardless of who you happen to pick.
If no: The groups are unstable. They are likely just random noise or artifacts of the specific people you happened to pick.

The Method: Building a Family Tree

To do this sorting, the paper uses a specific algorithm that works like building a family tree (or a decision tree).

The Map: First, they turn the complex gene data into a map where similar cells are close together and different cells are far apart.
The Tree: They start with the whole group and split it in two. Then they split those two groups in two again, and so on. This creates a tree structure.
The Branches: The "length" of the branches in this tree represents how hard it was to split the groups. A short branch means the split was easy and clear. A long, shaky branch means the split was messy.
The Pruning: They look at this tree and say, "Okay, if we stop cutting here, we get 10 groups. If we stop there, we get 15." They test every possible number of groups to see which one is the most stable.

The "Outlier" Problem: The Loudmouths and the Ghosts

In any dataset, there are troublemakers.

The Loudmouths (Outliers): These are cells that are so weird or noisy that they mess up the whole map. They might be dead cells or cells that got damaged during the experiment.
The Ghosts: These are cells that don't fit anywhere.

The paper introduces a way to find and kick these troublemakers out before sorting begins. They look for cells that are "too far away" from their neighbors in the map. If a cell is an outlier, it's like a person at the party shouting over everyone else; if you remove them, the real groups become much clearer.

The Results: What Did They Find?

The author tested this method on seven different "parties" (datasets) ranging from small groups of cells to massive datasets with over 100,000 cells.

The Success Stories:
- The "Zhengmix" Data: This was a practice party where the organizers knew exactly who belonged to which group. The method found the groups perfectly. It was like sorting a deck of cards and getting all the suits right.
- The Lung Data: This was a huge, complex party. The method found a grouping of 16 clusters that was incredibly stable. It was so stable that even when they shuffled the guests, the groups stayed the same. This suggests these 16 groups are biologically real.
- The Retina Data: They found a stable way to sort eye cells, even though the data was tricky.
The Struggles:
- The Breast Cancer Data: This was a very messy party. No matter how they tried to sort the guests, the groups kept changing. The method concluded that the data was too noisy or the groups were too similar to be separated reliably. This is actually a good result! It tells scientists, "Hey, don't trust the groups you found in this data; they aren't stable."
- The Monocytes: This group of cells was so uniform (everyone looked the same) that the computer couldn't find any distinct groups. This makes sense biologically; they really are just one big group.

Why Does This Matter?

In science, reproducibility is king. If a scientist says, "I found a new type of cell," but another scientist tries to find it with a slightly different set of data and fails, the first discovery is shaky.

This paper provides a stability test. It's like a quality control check for data analysis.

Before: Scientists might just pick a number (e.g., "Let's find 10 groups") and hope for the best.
After: Scientists can now ask, "Is this grouping stable?" If the answer is yes, they can trust it. If the answer is no, they know to dig deeper or try a different method.

The Takeaway

Think of this paper as a detective's guide to sorting a chaotic crime scene. Instead of just guessing who the suspects are, the detective (the algorithm) checks if the clues hold up when you look at them from different angles (different samples).

If the clues point to the same suspects every time, you have a solid case. If the clues change every time you look, you know you're chasing ghosts. This method helps biologists stop chasing ghosts and start finding the real, stable groups of cells that make up our bodies.

1. Problem Statement

The clustering of single-cell RNA sequencing (scRNA-seq) data, typically represented as Unique Molecular Identifier (UMI) count matrices, is a fundamental step in identifying cell types. However, a major challenge in the field is the lack of consensus on how to determine if a clustering is stable or replicable.

The Core Question: If data for twice as many cells were available, would the clustering results change?
The Challenge: Traditional stability metrics often rely on arbitrary parameters or lack rigorous definitions of what constitutes a "stable" cluster versus an unstable one. Furthermore, a single clustering solution may contain a mix of highly stable and extremely unstable clusters, making it difficult to determine if the entire result is suitable for downstream analysis.
Goal: To develop a pipeline that not only generates clusterings but rigorously evaluates their stability and the stability of individual clusters within those solutions.

2. Methodology

The author proposes a comprehensive pipeline that integrates data preprocessing, spectral clustering, and a novel stability assessment framework based on subsampling.

A. Data Preprocessing and Transformation

Filtering: Genes with non-zero counts in fewer than 50 cells are removed. Cells with high mitochondrial gene content (in specific datasets) are excluded.
Variability Calculation: The variability of each gene is calculated using the Sum of Squares (SSQ) of Pearson residuals derived from a Poisson model. This metric ( $S_g$ ) is computed for the full dataset and for multiple subsamples.
Dimensionality Reduction:
- Only genes that are highly variable in the full dataset and in every subsample are retained (Analysis Genes).
- A Pearson residuals matrix is constructed and treated as a low-rank matrix perturbed by noise.
- The rank of the unperturbed matrix is estimated using Erichson's optht program (implementing Gavish and Donoho's algorithm).
- Singular Value Decomposition (SVD) is applied to generate a low-rank Euclidean representation of the cells.

B. Outlier Detection

Euclidean Outliers: Before clustering, the distribution of $k$ -nearest neighbor ( $k$ NN) distances in the Euclidean space is analyzed. Points with distances significantly larger than the mean (specifically, $> \text{mean} + 3\sigma$ ) are identified as outliers and excluded.
Iterative Filtering: The pipeline performs up to three iterations. In subsequent iterations, cells and genes that disproportionately contribute to the SSQ of Pearson residuals (identified via a closed-form expression in Appendix B) are removed to ensure the remaining data distribution is consistent across samples.

C. Clustering Algorithm

Divisive Hierarchical Spectral Clustering: Instead of using the Leiden algorithm (which requires tuning resolution parameters), the author uses a divisive approach based on Ng, Jordan, and Weiss' spectral clustering algorithm.
- Affinity: Defined as the inverse of the Euclidean distance between $k$ -nearest neighbors (specifically $k=64$ ).
- Hierarchy: The algorithm recursively splits the data into two clusters until stopping conditions are met (maximum tree depth or minimum cluster size).
Tree Mapping: The resulting hierarchy is mapped to a set of nested clusterings. The "length" of a branch in the tree is defined by the Normalized Cut value. By sorting nodes by their distance from the root, the tree is converted into a sequence of clusterings of increasing size (e.g., 2-cluster, 3-cluster, etc.).

D. Stability Assessment Framework

The core innovation is the method for evaluating stability:

Subsampling: The full dataset is randomly split into complementary pairs of samples (e.g., 20 pairs, 40 samples total).
Comparison:
- A clustering is generated for the full dataset ( $C$ ) and for each subsample ( $C_s$ ).
- The clustering of $C$ is restricted to the cells in $C_s$ and compared against the clustering generated solely from $C_s$ .
Metrics:
- Misclassification Error Distance (MED): Measures the global disagreement between the full-set clustering and the sample clustering. It is normalized by the expected error of random label shuffling.
- Cluster Misclassification Error Rate (CMER): Measures the error rate for specific clusters within a sample.
Stability Criteria:
- A clustering is considered stable if the 90th percentile of normalized MED $\le$ 0.10.
- A cluster is considered stable if the 90th percentile of normalized CMER $\le$ 0.50.
- A clustering is deemed admissible for downstream analysis if its unstable clusters contain fewer than 500 cells.

3. Key Contributions

Novel Stability Definition: Proposes a rigorous, quantitative definition of stability based on the consistency of clusterings across random half-samples, moving beyond qualitative assessments.
Hierarchical Spectral Clustering with Tree Mapping: Introduces a method to generate a full hierarchy of clusterings from a single spectral clustering run, avoiding the need to tune resolution parameters for different cluster sizes.
Iterative Outlier Removal: Develops a specific metric based on Pearson residuals to identify and remove "recalcitrant" cells and genes that destabilize the clustering, improving the robustness of the results.
Cluster-Level Stability: Distinguishes between the stability of the entire clustering solution and the stability of individual clusters, acknowledging that a solution can be partially stable.

4. Results

The pipeline was tested on seven public datasets (ranging from ~4k to ~100k cells):

Zhengmix4eq (4 cell types): The method identified a 4-cluster solution that was extremely stable (CMER $\le$ 0.02) and perfectly matched the ground truth labels.
Zhengmix8eq (8 cell types): A 7-cluster solution was found to be very stable. The 8-cluster solution was stable but showed lower agreement with ground truth for T-cell subtypes, consistent with known difficulties in separating T-cell variants.
CD14 Monocytes: No stable clusterings were found (high MED for all sizes), correctly suggesting the data represents a homogeneous population where forced clustering is spurious.
68k PBMC:
- An admissible 12-cluster solution was found (stable global MED, with unstable clusters being small).
- A previously discussed 9-cluster solution (compatible with some published k-means results) was found to be unstable (90th percentile MED = 0.21), with several totally unstable clusters.
25k Retinal: An admissible 11-cluster solution was identified. It showed good compatibility with published clusters, though it split some known cell types (rods and cones), suggesting potential biological substructure or instability in those specific splits.
65k Lung:
- A 19-cluster solution was admissible but contained totally unstable clusters.
- A 16-cluster solution was found to be exceptionally stable (MED 90th percentile $\approx$ 0.01), with nearly all clusters being extremely stable. This solution aligned well with 56 reported cell types (Adjusted Rand Index = 0.81).
100k Breast Cancer: No clustering was found to be fully stable (MED > 0.10). The best solution (9 clusters) had unstable clusters, particularly one representing myeloid cells. The iterative filtering disproportionately affected plasmablasts, highlighting challenges with highly variable cell populations in large datasets.

5. Significance

Reliability in Downstream Analysis: The paper provides a practical framework to filter out unstable clusterings before performing differential expression or biological interpretation, reducing the risk of false discoveries.
Handling Heterogeneity: The method successfully identifies when data is too homogeneous to cluster (monocytes) or when specific clusters are inherently unstable (T-cells, plasmablasts), offering a more nuanced view than standard "best-fit" clustering.
Reproducibility: By defining stability through subsampling consistency, the approach directly addresses the replicability crisis in scRNA-seq analysis.
Future Directions: The author notes that while the current thresholds (e.g., 500 cells for unstable clusters) are arbitrary, the framework allows for the systematic exploration of stability criteria. The work also highlights the need for faster distance calculation methods for high-dimensional data and better understanding of why certain biological groups (like T-cells) remain difficult to separate.

In summary, Klebanoff presents a robust, mathematically grounded pipeline that shifts the focus from simply finding a clustering to finding stable and admissible clusterings, thereby enhancing the reliability of single-cell genomics research.