CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your body is a bustling, high-tech city. In this city, cells are the citizens, and they constantly send messages to one another to keep everything running smoothly—telling the heart to beat, the skin to heal, or the immune system to fight off invaders. These messages are called Cell-Cell Interactions (CCI).

For a long time, scientists could see that these messages were being sent, but they couldn't see where inside the city hall (the cell) the message was actually being processed. Was it discussed in the main office (the nucleus, where the blueprints are kept)? Or was it handled in the workshop (the cytoplasm, where the work gets done)?

Until now, we've been looking at the city from a drone, seeing the whole building but not the specific rooms.

The Problem: The "Whole Building" Blur

Scientists have powerful tools (like scRNA-seq) that can read the messages inside individual cells. But these tools usually give you a "blurry photo" of the whole cell. They tell you, "Hey, Cell A is talking to Cell B!" but they don't tell you if that conversation happened in the CEO's office or the breakroom.

However, a new technology called subcellular spatial transcriptomics (sST) is like a high-resolution camera that can zoom in and see exactly which room a message is in. The problem? This high-tech camera is expensive, rare, and hard to use on every single tissue sample. Most scientists still only have the "blurry drone photos" (standard cell data).

The Solution: CCIDeconv (The "Room Decoder")

The authors of this paper created a smart computer program called CCIDeconv. Think of it as a super-smart translator or a Sherlock Holmes for cell biology.

Here is how it works, using a simple analogy:

The Training Phase (Learning the Rules):
Imagine you have 9 different "high-resolution" city maps (the 9 sST datasets) where you can clearly see which messages are in the office and which are in the workshop.
CCIDeconv studies these 9 maps intensely. It learns patterns like: "When the Fibroblast talks to the Mast Cell, they usually whisper in the workshop (cytoplasm)." or "When the Macrophage talks to the Cancer Cell, they often shout in the CEO's office (nucleus)."
The Prediction Phase (Solving the Mystery):
Now, you give CCIDeconv a "blurry drone photo" (a standard cell dataset) from a new patient. It doesn't have the high-res camera. But because it learned the rules from the 9 detailed maps, it can guess where the conversation is happening.
It takes the blurry message and says, "Based on what I've learned, this interaction is 80% likely happening in the nucleus and 20% in the cytoplasm."

How It Works Under the Hood

The program uses a hierarchical model, which is like a two-step detective process:

Step 1 (The Gatekeeper): It first asks, "Is this message even worth decoding?" Some messages are too faint or messy to figure out. If it's too messy, the Gatekeeper says, "Nope, we can't tell."
Step 2 (The Splitter): If the message is clear, it splits the signal into two buckets: Nucleus and Cytoplasm. It uses a "voting system" (combining two different AI brains) to decide exactly how much of the conversation belongs to each room.

The Big Discovery

The team found that location matters.

Some conversations only happen in the nucleus.
Some only happen in the cytoplasm.
Sometimes, the same pair of cells talks in both places, but the content of the conversation is different depending on the room.

For example, they found that in lung cancer, certain cells were having a secret meeting in the nucleus that they weren't having in the cytoplasm. This is huge because it means diseases might be driven by conversations happening in the "wrong room."

Why This Matters

The coolest part of this paper is that you don't need the expensive high-res camera to use this tool.

If you have very little training data, the program needs to know the "spatial coordinates" (the map) to work well.
But, if you train it on many different datasets (like they did with 9 different tissues), it learns the rules so well that it can predict the "room location" even from standard, blurry cell data.

The Takeaway

CCIDeconv is like giving a pair of X-ray glasses to scientists who only have a regular flashlight. It allows them to take standard cell data and figure out exactly where inside the cell the biological conversations are happening. This helps doctors and researchers understand diseases better, potentially leading to new drugs that can block specific "bad conversations" happening in the wrong room of the cell.

1. Problem Statement

Cell-cell communication (CCI) is fundamental to biological processes like development and disease progression. Traditionally, CCI inference relies on single-cell RNA sequencing (scRNA-seq) or spatial transcriptomics (ST) data to identify ligand-receptor (LR) interactions between cell types. However, current methods treat cells as homogeneous units, ignoring the subcellular localization of these interactions.

The Gap: CCI events are often localized to specific subcellular regions (e.g., cytoplasm vs. nucleus). For instance, some signaling occurs at the plasma membrane, while others (like GPCR signaling or nuclear translocation of receptors) occur intracellularly.
The Challenge: While emerging subcellular spatial transcriptomics (sST) technologies (e.g., 10X Xenium) can quantify gene expression at the subcellular level, there is a lack of computational methods to deconvolute CCI scores from whole-cell data into specific subcellular compartments (nucleus and cytoplasm). Furthermore, most scRNA-seq datasets lack spatial coordinates, making it difficult to infer subcellular interaction patterns without specialized training.

2. Methodology: CCIDeconv

The authors propose CCIDeconv, a hierarchical supervised machine learning framework designed to predict subcellular CCI (sCCI) scores from whole-cell data.

A. Data Preparation & Score Calculation

Datasets: Nine publicly available sST datasets from the 10X Xenium platform were used, covering various human tissue types.
Subcellular Aggregation: Using the MoleculeExperiment package, data was aggregated into three compartments: Whole Cell, Nucleus, and Cytoplasm (calculated as Cell minus Nucleus).
Communication Score Modification: The authors modified the CellChat communication score formula.
- They removed the agonist/antagonist expression components (setting them to 1) to focus on LR expression.
- Spatial Procedure (SP): Includes spatial distance ( $S_{i,j}$ ) in the score calculation: $C = \frac{L_i R_j}{K h + L_i R_j S_{i,j}}$ .
- Single-cell Procedure (ScP): Sets spatial distance to 1, effectively removing spatial dependency for non-spatial data application.
- Separate scores were calculated for Cytoplasm-Cytoplasm ( $C_{cyt}$ ) and Nucleus-Nucleus ( $C_{nuc}$ ) interactions.

B. Model Architecture

CCIDeconv employs a hierarchical classification and regression pipeline:

Classification Step: A classifier (Voting Classifier of Random Forest and XGBoost) categorizes each detected whole-cell CCI event into:
- Class $o$ (Terminal): Low expression signal; cannot be deconvoluted.
- Class $x$ (Separable): High signal; suitable for deconvolution.
Regression Step: For events in Class $x$ , two separate regression models (XGBoost Regressor) predict the specific communication scores for the Nucleus and Cytoplasm.

C. Features & Training

Input Features: Whole-cell communication scores, Hill functions of LR expression, sender/receiver cell types, HGNC symbols, and subcellular/molecular metadata from CellChatDB and the Human Protein Atlas (HPA).
Validation Strategy: Leave-One-Group-Out Cross-Validation (LOGO-CV) across the nine datasets. This ensures the model is tested on unseen tissue types to assess generalizability.
Hyperparameter Tuning: Bayesian optimization was used to tune model parameters.

3. Key Contributions

Novel Framework: Introduction of CCIDeconv, the first method to deconvolute whole-cell CCI scores into nucleus and cytoplasm-specific scores using a hierarchical ML approach.
Subcellular Resolution: Demonstrated that CCI patterns differ significantly between subcellular regions, identifying unique nucleus-nucleus and cytoplasm-cytoplasm interactions that are masked in whole-cell analyses.
Spatial vs. Non-Spatial Adaptability: Showed that while spatial features improve performance with small training sets, the model achieves robust performance on non-spatial scRNA-seq data when trained on a sufficient number of diverse sST datasets.
Open Source: The code and models are publicly available, enabling researchers to apply subcellular deconvolution to existing scRNA-seq datasets.

4. Key Results

Subcellular Heterogeneity: Analysis of nine sST datasets revealed distinct LR pairs and communication scores between the nucleus and cytoplasm. PCA analysis confirmed that the Hill function of LR expression is the primary driver of these region-specific patterns.
Model Performance (LOGO-CV):
- Classification: Achieved a median AUC of 0.79 and macro recall of 0.69.
- Regression: Demonstrated strong predictive power with an $R^2$ of 0.87 for cytoplasm and 0.80 for the nucleus.
- Robustness: The model remained stable across different tissue types. 67.3% of LR pairs were correctly classified across various training combinations.
Impact of Training Data Size:
- With <4 training datasets, models using spatial features (SP) outperformed those without (ScP).
- With >4 training datasets, models trained without spatial features achieved performance comparable to those with spatial features. This implies that CCIDeconv can accurately predict sCCI from standard scRNA-seq data if trained on a diverse, large-scale sST dataset.
Application to Lung Cancer: When applied to a non-spatial lung cancer scRNA-seq dataset, the model successfully recovered known biological patterns (e.g., fibroblast-mast cell contact) and identified novel nuclear communication between mononuclear phagocytes and malignant cells (specifically the FN1-CD44 pair), consistent with known endocytic nuclear translocation mechanisms.

5. Significance and Future Directions

Biological Insight: CCIDeconv allows researchers to dissect where signaling occurs within a cell, providing deeper insights into disease mechanisms (e.g., distinguishing surface vs. intracellular signaling in cancer or neurodegeneration).
Resource Efficiency: It enables the re-analysis of the vast archive of existing non-spatial scRNA-seq data to infer subcellular interactions without requiring new, expensive sST experiments.
Limitations & Future Work:
- Currently limited to Nucleus and Cytoplasm due to annotation resolution in current sST datasets. Future work aims to extend this to mitochondria or Golgi bodies.
- Relies on transcriptomics as a proxy for protein interactions; the authors note that proteomics data would be ideal but is currently sparse.
- The 10X Xenium panel (5,000 genes) may miss some low-expression signaling genes, potentially limiting the detection of cross-region interactions.

In conclusion, CCIDeconv bridges the gap between high-resolution subcellular spatial data and widely available single-cell transcriptomics, offering a powerful tool to refine our understanding of cellular communication networks.