GraphBG: Fast Bayesian Domain Detection via Spectral Graph Convolutions for Multi-slice and Multi-modal Spatial Transcriptomics

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are trying to understand a massive, bustling city. You have a map where every single building (a cell) has a list of its activities (gene expression). But here's the catch: you also have the exact GPS coordinates for every building.

Your goal is to figure out which buildings belong to which neighborhood (spatial domains). Is that building part of the "downtown financial district," the "quiet residential suburb," or the "industrial factory zone"?

The Problem:
Existing tools for mapping these cities have three big flaws:

They are slow: If the city has 300,000 buildings, old tools might take days to figure out the neighborhoods.
They get lost: If you have maps of the same city from different days (multiple tissue slices), old tools can't easily stitch them together into one big picture.
They are one-dimensional: They only look at the building's activities. They ignore other clues, like the color of the paint on the roof (protein data) or the type of foundation (chromatin data), which could help identify the neighborhood better.

The Solution: GraphBG
The authors of this paper built a new tool called GraphBG (Graph-based Bayesian Gaussian Mixture). Think of it as a super-smart, high-speed urban planner that uses a "connect-the-dots" approach to map the city.

Here is how it works, using simple analogies:

1. The "Neighborhood Watch" (Spectral Graph Convolutions)

Instead of looking at a building in isolation, GraphBG looks at its immediate neighbors. It draws a web connecting every building to the 4 closest ones.

The Analogy: Imagine a game of "telephone." If a building says, "I'm a factory," GraphBG checks its neighbors. If the neighbors are also factories, it confirms the label. If a factory is suddenly surrounded by houses, the tool realizes something is weird and adjusts. It uses a mathematical shortcut (approximate spectral graph convolution) to do this "neighbor check" incredibly fast, even for huge cities.

2. The "Uncertainty Detective" (Variational Bayesian Model)

Once the tool has checked the neighborhoods, it needs to group the buildings. Old tools just guess the groups. GraphBG is a "probabilistic detective."

The Analogy: Instead of saying, "This building is a house," it says, "I am 90% sure this is a house, but there's a 10% chance it's a mixed-use building." This "uncertainty awareness" prevents the tool from making rigid mistakes and helps it handle messy data where boundaries aren't clear.

3. The "City Planner" for Big Data (Metacells & Multi-Slice)

When the city is too big (hundreds of thousands of buildings), GraphBG doesn't try to analyze every single brick.

The Analogy: It groups 50 nearby buildings into a "Super-Block" (called a Metacell). It analyzes the Super-Block instead of the individual buildings. This makes the math 100x faster.
The Multi-Slice Trick: If you have 31 different maps of the same city taken at different times, GraphBG uses a "batch correction" tool (like a translator) to ensure that a "Super-Block" on Map A means the same thing as a "Super-Block" on Map B. It then stitches them all together into one giant, coherent map.

4. The "Multi-Sensory Detective" (Multi-Modal)

Sometimes, gene expression (the building's activities) isn't enough. You might also have protein data (the building's paint color) or DNA accessibility data (the building's foundation).

The Analogy: GraphBG listens to all these different "languages" at once. It uses a technique called Kernel CCA to translate the "paint color" language and the "foundation" language into a common dialect. Now, it can use all the clues to decide if a building is a factory, rather than just guessing based on one clue.

Why is this a big deal?

The paper tested GraphBG on real biological data, including a massive dataset of 370,000 cells from 31 slices of mouse tissue.

Speed: While other tools took hours or crashed due to memory limits, GraphBG finished the job in 5 minutes.
Accuracy: It correctly identified biological "neighborhoods" (like liver zones) that other tools missed or got wrong.
Discovery: When applied to a diseased liver, it didn't just find the damage; it showed how the disease spread from the liver cells to the surrounding tissue, revealing a story of inflammation and scarring that other tools couldn't see.

In Summary:
GraphBG is like upgrading from a hand-drawn sketch to a real-time, AI-powered satellite map. It's fast enough to handle the biggest cities, smart enough to stitch together different maps, and sensitive enough to use every clue available to tell you exactly where you are in the tissue.

1. Problem Statement

Spatial Transcriptomics (ST) technologies allow for gene expression profiling with spatial context, enabling the reconstruction of tissue architecture. However, current analysis methods face three critical bottlenecks as datasets scale:

Scalability: Existing tools struggle with modern datasets containing hundreds of thousands of spots (e.g., MERFISH, Slide-seqV2) and multiple tissue slices, often failing due to memory constraints or excessive runtime.
Multi-slice Integration: Most methods analyze single slices independently. This leads to inconsistent domain labels and granularities across slices, hindering the construction of unified spatial atlases.
Multi-modal Limitations: Current tools generally assume unimodal input (gene expression only). They lack robust frameworks to integrate emerging multi-omics data (e.g., gene expression + protein abundance, or chromatin accessibility) to improve domain resolution.

2. Methodology: The GraphBG Framework

The authors propose GraphBG, a unified framework that combines approximate spectral graph convolutions with a Variational Bayesian Gaussian Mixture Model (VB-GMM). The framework is extended into two specific variants: GraphBG-MS (Multi-Slice) and GraphBG-MM (Multi-Modal).

A. Core Components (Unimodal GraphBG)

Preprocessing: Standard normalization, log-transformation, and selection of Highly Variable Genes (HVGs), followed by PCA for dimensionality reduction.
Spatial Graph Construction: An undirected graph is built where nodes are spots/cells and edges connect spatially adjacent neighbors (using Euclidean distance).
Spectral Graph Convolutions:
- Instead of deep Graph Neural Networks (GNNs) which are computationally heavy, GraphBG uses a first-order Chebyshev approximation of spectral graph convolutions.
- This acts as a normalized graph smoothing operator ( $D^{-1/2}AD^{-1/2}$ ), efficiently encoding local spatial dependencies into the embeddings without iterative training.
Variational Bayesian Clustering (VB-GMM):
- The smoothed embeddings are clustered using a VB-GMM.
- Unlike standard GMMs that use Expectation-Maximization (EM), the Bayesian formulation places priors (Dirichlet for mixing coefficients, Gaussian-Wishart for means/precisions) on parameters.
- Advantage: This provides uncertainty-aware clustering, mitigates overfitting, and automatically handles model complexity.
Post-Processing: A refinement step reassigns spot labels based on the majority vote of their 50 nearest spatial neighbors to ensure spatial coherence.

B. GraphBG-MS (Multi-Slice Analysis)

Designed to handle large datasets spanning multiple tissue sections:

Metacell Aggregation: To reduce complexity, spots within each slice are aggregated into "metacells" using MiniBatch k-Means on graph embeddings.
Batch Correction: Metacell embeddings are harmonized across slices using ComBat to remove technical batch effects while preserving biological structure.
Joint Clustering: A single VB-GMM is applied to the batch-corrected metacells from all slices simultaneously.
Label Propagation: Cluster labels are projected back to individual spots, followed by spatial refinement.

C. GraphBG-MM (Multi-Modal Analysis)

Designed to integrate diverse molecular modalities (e.g., RNA + Protein):

Modality-Specific Encoding: Each modality is processed independently through spectral graph convolutions to generate modality-specific embeddings.
Kernel Canonical Correlation Analysis (KCCA):
- Embeddings from different modalities are aligned into a shared latent space using Kernel Multi-view CCA.
- This maximizes the correlation between modalities in a high-dimensional Reproducing Kernel Hilbert Space (RKHS), effectively fusing complementary signals.
Unified Clustering: The concatenated joint embedding is clustered using VB-GMM.

3. Key Contributions

Unified Scalable Framework: GraphBG addresses the scalability bottleneck by replacing deep learning training with efficient spectral approximations and probabilistic modeling, enabling the processing of >370,000 cells in minutes.
Probabilistic Rigor: The integration of VB-GMM provides a principled approach to clustering that accounts for uncertainty and avoids overfitting, outperforming deterministic GNNs.
Novel Extensions:
- GraphBG-MS: Solves the multi-slice alignment problem via metacell abstraction and joint probabilistic modeling, outperforming divide-and-conquer strategies.
- GraphBG-MM: Introduces KCCA-based fusion for multi-omics, demonstrating that integrating protein and RNA data yields superior spatial domain resolution compared to single-modality approaches.
Open Source: The software is released under an open-source BSD license with code for reproducibility.

4. Results and Performance

The authors benchmarked GraphBG against state-of-the-art methods (Louvain, BayesSpace, GraphST, SpaceFlow, SpatialGlue, scNiche) across diverse datasets.

Single-Slice Accuracy: On the gold-standard 10x Visium DLPFC dataset, GraphBG achieved the highest Normalized Mutual Information (NMI: 0.692) and Homogeneity (HOM: 0.711), outperforming GNN-based baselines like GraphST and SpaceFlow.
Scalability & Runtime:
- On a MERFISH dataset with 378,000 cells, GraphBG completed clustering in 40.66 minutes.
- Competitors failed: GraphST crashed due to memory limits (315 GB) at 100k cells; SpaceFlow took 106 minutes.
- On a 31-slice MERFISH dataset (>300k cells), GraphBG-MS finished in 5 minutes, compared to 133 minutes for SpaceFlow-DC and 221 minutes for scNiche.
Multi-Modal Performance:
- On simulated and real CITE-seq (RNA + Protein) data, GraphBG-MM consistently outperformed SpatialGlue.
- It achieved higher Moran's I scores (spatial autocorrelation), indicating better preservation of spatial continuity. For example, in mouse thymus data, GraphBG-MM scored 0.745 vs. 0.161 for SpatialGlue.
Biological Validity (Mouse Liver):
- Applied to mouse liver data, GraphBG-MS successfully recapitulated canonical lobular zonation (periportal vs. pericentral gene expression gradients).
- It identified a broader set of metabolic markers (e.g., Cytochrome P450 enzymes) compared to scNiche.
- In disease models (mTORC1-driven liver failure), it revealed distinct spatial remodeling patterns, capturing systemic and fibrotic responses that other methods missed.

5. Significance

GraphBG represents a significant advancement in spatial omics analysis by solving the "trilemma" of accuracy, scalability, and integrability.

For Large-Scale Atlases: It enables the analysis of organ-scale datasets (hundreds of thousands of cells) that were previously computationally intractable for many existing tools.
For Multi-Slice Studies: It provides a robust method for generating consistent, comparable domain maps across multiple tissue sections, essential for building comprehensive spatial atlases.
For Multi-Omics: It offers a flexible architecture to leverage complementary information from proteins, chromatin, and RNA, leading to higher-resolution biological insights.

The paper concludes that GraphBG is a general, extensible tool that sets a new standard for spatial domain detection, facilitating next-generation discoveries in developmental biology, tissue atlasing, and clinical diagnostics.