scDEcrypter: Uncertainty-aware differential expression analysis for viral infection in scRNA-seq

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery in a crowded city. The city is your body, the people are your cells, and a virus is an intruder trying to sneak in and cause chaos.

Your goal is to find out exactly what the virus is doing to the city's citizens. You have a massive list of notes (data) from every single person in the city, but there's a huge problem: most of the notes don't say who is actually infected.

The Problem: The "Invisible" Virus

In the world of single-cell RNA sequencing (scRNA-seq), scientists try to read the "notes" (genetic instructions) of individual cells to see how they react to a virus.

However, viruses are sneaky:

They hide: Sometimes a cell is infected, but the virus doesn't leave enough "footprints" (viral genetic material) for the scientists to see.
The Bystander Effect: Some uninfected cells are just standing next to the infected ones, reacting to the noise and panic. They look like they are part of the problem, but they aren't.
The Labeling Gap: Scientists usually only label a tiny fraction of cells as "definitely infected" because they found clear footprints. The rest are a mystery.

If you try to solve the mystery using only the few "definitely infected" people you know, you miss the bigger picture. If you try to guess who is infected based on who looks suspicious, you might accuse innocent bystanders.

The Solution: scDEcrypter

The authors of this paper created a new tool called scDEcrypter. Think of it as a super-smart, probabilistic detective that doesn't just look for footprints; it looks at the whole neighborhood.

Here is how it works, using a simple analogy:

1. The "Training Class" vs. The "Exam" (Data Splitting)

Imagine you are teaching a class of students (the cells) to identify infected people.

The Training Set: You show the students a group of people where you know for sure who is infected and who is not. You teach them the patterns: "Look at the eyes, the posture, the nervousness."
The Test Set: You then give them a new group of people where you don't know who is infected.
The Rule: You make sure the students never cheat by looking at the answers while they are being tested. This prevents them from just memorizing the specific people they saw in the training class.

2. The "Fuzzy" Labels (Partial Observability)

Old methods were like a strict teacher who said, "If you aren't 100% sure, you can't count this person."
scDEcrypter is more like a wise counselor. It says, "I'm not 100% sure this person is infected, but they have a 70% chance of being infected."
Instead of forcing a "Yes/No" label, it assigns a probability score (a weight) to every single cell. It acknowledges uncertainty. "This cell is likely infected, that one is a bystander, and this one is definitely healthy."

3. The "Two-Way" Mix (The Mixture Model)

The virus doesn't just affect everyone the same way. A virus might act differently in a lung cell than in a skin cell.
scDEcrypter looks at two things at once:

Who are you? (Cell Type: Lung, Skin, Immune cell?)
What is your status? (Infected, Bystander, Healthy?)

It creates a "mix" of possibilities. It asks: "If I am a Lung cell, what is the probability I am infected? If I am a Skin cell, what is the probability?" It uses the few people it knows are infected to teach the model how to recognize the rest of the infected people, even if they are hiding.

Why This Matters: The Results

The authors tested this tool on real data from Flu and SARS-CoV-2 infections.

Finding the Hidden: While traditional methods only found about 5% of infected cells, scDEcrypter found about 24%. It realized that many cells were infected even though they didn't have enough viral footprints to be "labeled" by old rules.
Separating the Noise: It successfully told the difference between a cell that was actually infected and a "bystander" cell that was just panicking.
Better Clues: Because it found more infected cells, it could identify the specific genes the virus was hijacking much better than other methods. It found biological pathways (like how the virus steals the cell's protein-making machinery) that other tools missed.

The Bottom Line

scDEcrypter is a new way to analyze viral infections in single cells that admits, "We don't know everything, but we can make a very educated guess."

Instead of throwing away the "mystery" cells because they lack clear labels, this tool uses math to estimate their status based on the patterns of the cells we do know. It turns a blurry, confusing picture into a sharp, high-definition map of how a virus attacks our bodies, helping scientists understand the disease better and potentially find better treatments.

1. Problem Statement

Single-cell RNA sequencing (scRNA-seq) is a powerful tool for studying viral infections, but it faces significant challenges in accurately identifying infected cells and performing robust differential expression (DE) analysis:

Sparse Viral Reads: Viral transcripts are often low in abundance, leading to under-detection.
Under-labeling: Standard pipelines often label only a tiny fraction of cells as infected (e.g., <1–5%) based on strict viral read thresholds, discarding many truly infected cells with low viral loads.
Bystander Effects: Uninfected "bystander" cells respond to signals from infected cells, exhibiting transcriptional profiles similar to infected cells, which confounds DE analysis.
Limitations of Existing Methods: Current DE tools (e.g., Seurat, MAST, DESeq2) typically require fully defined group labels. Methods designed for uncertainty (e.g., scANVI, GEDI) often struggle with complex experimental designs involving multiple variables (e.g., infection status + cell type) or lack interpretability.
Double-Dipping: Many approaches use the same data for both inferring cell states and testing for DE, leading to overfitting and biased inference.

2. Methodology: scDEcrypter

The authors propose scDEcrypter, a penalized two-way mixture model framework designed to handle partial observability of cell states.

Core Workflow

Data Splitting: The dataset is split into a Generation (Training) Set and a Test Set. This separation prevents "double-dipping," ensuring that parameter estimation and inferential testing are independent.
Pre-processing:
- Partial Labeling: Requires partial labels for infection status (e.g., confidently infected/uninfected based on viral reads) and an additional partitioning variable (usually cell type), which can be fully or partially known.
- Normalization: Independent normalization and variance-stabilizing transformation (using transformGamPoi) are applied to both sets.
- Feature Selection: The Generation set uses a small subset of Highly Variable Genes (HVGs) for model training to reduce overfitting. The Test set uses a larger HVG set for downstream DE analysis.
Modeling (Penalized Two-Way Mixture Model):
- Latent Variables: Infection status ( $V$ ) and cell type ( $C$ ) are modeled as latent variables.
- Distribution: Gene expression $Y_{ij}$ given cell type $c$ and viral state $v$ is assumed to follow a Normal distribution: $(Y_{ij}|C_i=c, V_i=v) \sim N(\mu_{jcv}, \sigma^2_{jcv})$ .
- Penalized Likelihood: The model employs a penalized maximum likelihood estimator using an Expectation-Maximization (EM) algorithm. A specific penalty term is applied to the mean parameters ( $\mu$ ) to encourage "equi-sparsity." This penalty shrinks mean estimates toward a vector of ones if a gene is not differentially expressed across viral states within a cell type, effectively performing variable selection.
Inference on Test Set:
- Parameters estimated from the Generation set are fixed.
- Weight Estimation: Cell-state weights (probabilistic membership to infection/cell type combinations) are inferred for the Test set cells.
- Differential Expression: A likelihood ratio test (LRT) is performed using the inferred weights to test for DE between infection states (e.g., Infected vs. Uninfected, or Infected vs. Bystander) within each cell type. The weights account for the uncertainty of each cell's state.

3. Key Contributions

Uncertainty-Aware Framework: Unlike standard methods that force binary labels, scDEcrypter uses probabilistic weights to account for the uncertainty of infection status, particularly for cells with low viral reads or bystander effects.
Handling Partial Labels: The method leverages a small set of confidently labeled cells to "anchor" the latent states, allowing the model to infer states for the vast majority of unlabeled cells.
Multi-Variable Integration: It simultaneously models infection status and cell type (or other partitioning variables), addressing complex experimental designs where existing tools fail.
Statistical Rigor: By implementing data splitting and penalized likelihood, the method mitigates overfitting and avoids the bias inherent in double-dipping approaches common in single-cell analysis.
Biological Interpretability: The model outputs biologically coherent pathways and identifies infection-specific genes that are often missed by threshold-based methods.

4. Results

The authors validated scDEcrypter through simulations and real-world applications on Influenza and SARS-CoV-2 datasets.

Simulation Studies

Accuracy: scDEcrypter achieved a balanced accuracy of 88.1% for infection state prediction across various scenarios. In moderate-to-large fold-change scenarios, accuracy reached 94.5%.
DE Performance: It achieved an average balanced accuracy of 90.7% in identifying infection-associated genes.
Comparison: scDEcrypter consistently outperformed Seurat, MAST, and DESeq2 in balanced accuracy and F1 scores, particularly in scenarios with low pre-labeling proportions (e.g., 1% labeled cells) and subtle fold-changes. It demonstrated higher robustness and lower variability than competing methods.

Application 1: Influenza Time-Course (Russell et al.)

Infection Recovery: While standard viral read counts identified 5% of cells as infected, scDEcrypter inferred that **24%** of cells were highly likely to be infected, aligning better with the study's Multiplicity of Infection (MOI) of 0.3.
DE Genes: scDEcrypter identified 3,073 shared DE genes across time points. In contrast, Seurat identified only 5 shared human genes, and scANVI identified only 13.
Biological Validation: It recovered 34/36 known DE genes from the original study and 55/61 curated viral replication genes (vs. 13/61 by Seurat).
Pathways: Enrichment analysis correctly identified translation/ribosome pathways (hijacking host synthesis) and specific influenza infection pathways. It also successfully tracked temporal trends, identifying genes like AIFM1 (increasing) and OAS3/IFNGR1 (decreasing).

Application 2: SARS-CoV-2 (Ravindra et al.)

State Distinction: The model distinguished between three states: Infected, Bystander (reactive but uninfected), and Uninfected.
Infection Rates: It estimated infection rates of 13%, 17%, and 24% at days 1, 2, and 3, respectively, which are consistent with SARS-CoV-2 replication kinetics, whereas standard thresholds yielded only 8.6%.
Cell-Type Specificity: It revealed that ciliated cells were the primary targets early on, while club cells shifted to a bystander state by day 2.
DE Analysis:
- Infected vs. Bystander: Enriched for viral infection pathways and heat-shock factors (e.g., HSF1), known to be hijacked by coronaviruses.
- Bystander vs. Uninfected: Enriched for stress responses, immune surveillance, and mitochondrial energy production.

5. Significance

scDEcrypter represents a major advancement in the analysis of viral scRNA-seq data. By explicitly modeling uncertainty and leveraging partial labels, it overcomes the "sparse read" bottleneck that limits current analyses.

Power: It significantly increases statistical power to detect infection-associated genes that would otherwise be lost due to strict filtering.
Biological Insight: It enables the disentanglement of direct viral effects from bystander responses, providing a clearer picture of host-pathogen interactions.
Generalizability: While designed for viral infections, the framework is applicable to any biological context with partially latent group labels (e.g., cancer subclones, drug-resistant populations, or immune activation states), provided a small set of confidently labeled samples exists.

The method is implemented as an R package available on GitHub, offering a statistically principled solution for robust, interpretable inference in complex single-cell datasets.