Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq

Here is an explanation of the paper "ScNucAdapt: Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq" using simple language and creative analogies.

The Big Picture: Two Different Ways to Take a "Cell Census"

Imagine you are a detective trying to identify the population of a bustling city. You want to know exactly who lives there: the doctors, the teachers, the artists, and the construction workers.

In the world of biology, scientists use two main tools to take this "census" of cells:

scRNA-seq (Single-cell): This is like interviewing people while they are walking down the street. You get the whole person, but you can only interview people who are willing to walk out of their houses. If a house is locked or the person is too sick to walk out, you miss them.
snRNA-seq (Single-nucleus): This is like looking through the windows of the houses. You can't see the whole person, but you can see the "brain" (the nucleus) inside. This is great for frozen samples or tissues that are too delicate to be taken apart (like a fragile glass sculpture).

The Problem:
The problem is that these two methods produce different "languages." The street interview (scRNA-seq) might list 10 types of workers, while the window peek (snRNA-seq) might only show 8 types, or describe them slightly differently.

Previously, scientists had to treat these two datasets as completely separate worlds. They couldn't easily say, "Oh, this 'Window Person' is the same as that 'Street Person'." This made it hard to combine data from fresh samples and old, frozen samples to get a full picture of health and disease.

The Solution: ScNucAdapt (The Universal Translator)

The authors created a new computer program called ScNucAdapt. Think of it as a super-smart translator that can take the "Street" data and the "Window" data and merge them into one perfect map.

Here is how it works, broken down into three simple steps:

1. The Shared Translator (The Encoder)

Imagine you have two groups of people speaking different dialects. ScNucAdapt first teaches both groups to speak a "Universal Language" (a shared latent space). It strips away the specific quirks of the street interview vs. the window peek and focuses only on the core identity of the cell.

Analogy: It's like translating both English and French into a universal "Emoji" language so everyone understands the core message, regardless of the original dialect.

2. The Dynamic Grouping (Clustering)

Usually, when you try to match two lists, you need to know exactly how many items are on the list. But in biology, we often don't know how many cell types are in the new sample.
ScNucAdapt is smart enough to guess and adjust. It starts by grouping the new cells into piles. Then, it uses a "Split and Merge" strategy:

If a pile looks too messy, it splits it into two smaller piles.
If two piles look exactly the same, it merges them into one.
Analogy: Imagine you are organizing a messy closet. You don't know how many shirts you have. You start by making piles. If a pile has a red shirt and a blue shirt, you split them. If you find two piles of identical blue shirts, you merge them. You keep doing this until the piles are perfect.

3. The "Partial" Match (Partial Domain Adaptation)

This is the most important trick. Sometimes, the "Street" list has 10 job types, but the "Window" list only has 8. The other 2 job types simply don't exist in the window view.
Old methods tried to force a match, which caused confusion (like trying to match a "Plumber" to a "Gardener" just because they were the closest option).
ScNucAdapt uses Partial Domain Adaptation. It says: "I will only match the 8 types that exist in both lists. I will ignore the 2 types that are unique to the Street list so they don't mess up the matching."

Analogy: Imagine you are matching socks from two different drawers. One drawer has 10 pairs, the other has 8. ScNucAdapt finds the 8 matching pairs and leaves the 2 extra pairs in the first drawer alone, rather than forcing them to match with the wrong socks.

Why Does This Matter?

It Saves Frozen Samples: Scientists have warehouses full of frozen tissue samples (snRNA-seq) that were previously hard to analyze. Now, they can combine them with fresh data to get a bigger, better picture.
It Finds Rare Cells: Some cells are so fragile they break during the "street interview" (scRNA-seq). But the "window peek" (snRNA-seq) catches them. ScNucAdapt helps us identify these rare cells by comparing them to known data.
It's Accurate: The paper tested this on bladder, kidney, brain, and tumor tissues. In almost every test, ScNucAdapt was more accurate than existing methods, correctly identifying cell types even when the data was messy or incomplete.

The Bottom Line

ScNucAdapt is like a master bridge-builder. It connects two different islands of biological data (fresh cells and frozen cells) that were previously isolated. By using a smart translator and a flexible grouping system, it allows scientists to finally see the whole city of our bodies, leading to better understanding of diseases and new discoveries in medicine.

Here is a detailed technical summary of the paper "Partial domain adaptation enables cross domain cell type annotation between scRNA-seq and snRNA-seq".

1. Problem Statement

Single-cell RNA sequencing (scRNA-seq) and single-nucleus RNA sequencing (snRNA-seq) are complementary technologies. While scRNA-seq profiles whole cells, snRNA-seq is essential for analyzing frozen tissues or fragile cell types that are difficult to dissociate. However, integrating these two modalities for cross-domain cell type annotation remains a significant challenge due to:

Distributional Differences: Systematic biases (batch effects) exist between scRNA-seq and snRNA-seq data due to different biological capture mechanisms (whole cell vs. nucleus).
Label Space Mismatch (Partial Domain Adaptation): In many real-world scenarios, the target dataset (e.g., snRNA-seq) contains only a subset of the cell types found in the source dataset (e.g., scRNA-seq), or vice versa. Traditional domain adaptation methods assume identical label spaces, leading to "negative transfer" where irrelevant source classes degrade target performance.
Unknown Cell Composition: The number of cell types in the target dataset is often unknown prior to annotation.

Existing methods typically treat these datasets independently or fail to handle the partial overlap of cell types effectively.

2. Methodology: ScNucAdapt

The authors propose ScNucAdapt, a deep learning framework based on Partial Domain Adaptation (PDA). The framework is designed to handle both paired and unpaired datasets with unknown target label spaces.

Core Architecture

Shared Encoder:
- A shared neural network (two fully connected layers) extracts features from both source (labeled) and target (unlabeled) datasets, projecting them into a common latent space.
Dynamic Clustering in Target Data:
- Since the number of cell types in the target is unknown, ScNucAdapt employs a dynamic clustering mechanism inspired by DeepDPM and PRAGA.
- It uses a Gaussian Mixture Model (GMM) with a split-and-merge framework (based on Metropolis-Hastings criteria) to automatically adjust the number of clusters.
- Clusters are split if the likelihood ratio exceeds a threshold and merged if the resulting cluster improves the marginal likelihood. This allows the model to discover the true number of cell types in the target domain without prior knowledge.
Source Class-Target Cluster Matching:
- To align the source and target domains while mitigating negative transfer, the method uses Cauchy-Schwarz (CS) Divergence.
- It calculates the CS divergence between source class distributions and target cluster distributions.
- Matching Strategy: For each target cluster, the algorithm selects the source class with the lowest CS divergence. This ensures that only shared cell types are aligned, while non-overlapping source classes are ignored.
Training Strategy:
- Two-Stage Training:
  - Warm-up: The encoder is trained for $T$ epochs using only classification loss on the source data to learn meaningful initial features.
  - Refinement: The model enters a loop where it performs GMM clustering, split/merge operations, and source-target matching on the target data. The encoder is updated via backpropagation using a combined loss function:
    $L = L_{cls} + \lambda \cdot L_{cs}$
    Where $L_{cls}$ is the weighted cross-entropy loss for the source, and $L_{cs}$ is the CS divergence loss between matched source-target pairs.

3. Key Contributions

First Cross-Modal PDA Framework: ScNucAdapt is the first method specifically designed for cross-annotation between scRNA-seq and snRNA-seq, addressing both distributional shifts and partial label space mismatches.
Dynamic Cluster Discovery: Unlike methods requiring pre-defined cluster numbers, ScNucAdapt dynamically determines the number of cell types in the target dataset through a split-and-merge mechanism.
Negative Transfer Mitigation: By utilizing CS Divergence for selective matching, the method effectively filters out non-overlapping cell types, preventing the "negative transfer" common in standard domain adaptation.
Robustness: The framework is validated on both paired and unpaired datasets across diverse tissue types (bladder, kidney, tumors, mouse cortex).

4. Experimental Results

The authors evaluated ScNucAdapt on eight cross-domain tasks involving bladder, kidney, tumor (metastatic breast cancer and CLL), and mouse cortical tissues.

Performance Metrics: The method was compared against SingleCellNet, ScMap, and ScAdapt (a domain adaptation baseline).
Accuracy: ScNucAdapt consistently outperformed all baselines.
- Bladder Immune (Partial): 91.05% accuracy (vs. 90.24% for ScAdapt).
- Kidney (Partial): 87.23% accuracy (vs. 84.01% for ScAdapt).
- Tumor (CLL): 98.39% accuracy.
- Mouse Cortex: Achieved up to 100% accuracy in snRNA-seq to scRNA-seq transfer.
Macro-F1 Score: ScNucAdapt showed significant improvements in Macro-F1 scores, indicating better performance on minority cell types compared to baselines.
Visualization: UMAP plots demonstrated that ScNucAdapt successfully merged scRNA-seq and snRNA-seq batches while maintaining clear separation by cell type.
Ablation Studies: Removing either the CS divergence or the dynamic clustering module resulted in significant performance drops, confirming the necessity of both components.
Sensitivity Analysis: The model was found to be robust to variations in the initial cluster number ( $C$ ) and the trade-off hyperparameter ( $\lambda$ ).
Scalability: Memory consumption scaled linearly with dataset size (up to ~16k cells), though runtime is currently bottlenecked by the GMM split/merge operations performed every epoch.

5. Significance and Future Work

Significance:
ScNucAdapt provides a practical and robust framework for integrating historical frozen samples (snRNA-seq) with fresh tissue data (scRNA-seq). This capability is crucial for:

Unifying cellular identities across different experimental protocols.
Enabling the study of rare or fragile cell types that are only accessible via snRNA-seq.
Accelerating discoveries in disease progression by allowing the reuse of archived datasets.

Limitations & Future Directions:

Label Noise: The method assumes clean source labels; noisy labels could degrade performance.
Novel Cell Type Discovery: Current PDA assumes target types are a subset of source types. Future work should integrate Open-Set Domain Adaptation to detect novel cell types in the target domain.
Heterogeneous Features: The current method assumes shared gene sets. Future iterations could address Heterogeneous Domain Adaptation where gene sets differ significantly between modalities.
Overfitting & Imbalance: Further improvements are needed to handle high-dimensional sparsity and severe class imbalances within domains.

In conclusion, ScNucAdapt represents a significant advancement in single-cell analysis, offering a flexible solution for the complex task of cross-modal cell type annotation.