Benchmarking ambient RNA removal across droplet and well-plate platforms reveals artificial count generation as a critical failure mode of scAR and CellClear

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The "Noisy Room" Problem in Single-Cell Biology

Imagine you are trying to listen to a specific conversation in a crowded, noisy room. In the world of biology, scientists are trying to listen to the "whispers" of individual cells (their genetic instructions) to understand how they work. This is called single-cell RNA sequencing.

However, there's a problem: before the scientists can listen, some cells in the room break open (lyse) and spill their genetic "trash" (ambient RNA) into the air. When they try to listen to a healthy cell, they accidentally pick up some of that spilled trash too. It's like trying to hear a friend speak while someone else is shouting nearby; the friend's voice gets muddled with the background noise.

To fix this, scientists use computer programs (tools) to try to "clean up" the audio, removing the background noise so they can hear the cells clearly. But here's the catch: some of these cleaning tools are so aggressive they start inventing new voices that never existed.

The Great Tool Showdown

The authors of this paper decided to put six of the most popular "noise-cleaning" tools to the test. They treated them like contestants in a reality TV competition, giving them different challenges to see who performed best.

Here is how they tested them, using simple analogies:

1. The "Species Mix" Test (The Ground Truth)

The Setup: Imagine a room with two distinct groups of people: Humans and Mice. They are all talking at once.
The Goal: The computer tools need to clean the audio so that when they hear a "Human" voice, it's 100% human, and when they hear a "Mouse" voice, it's 100% mouse.
The Result:

CellBender and SoupX were like skilled sound engineers. They removed the background noise effectively without changing the original voices. They kept the data honest.
DecontX was a bit too cautious. It left some noise behind, but it didn't mess up the voices.
scAR and CellClear were the troublemakers. They removed the noise, but in doing so, they started hallucinating. They took the silence and filled it with fake voices. They made it sound like there were new types of people in the room (like "Martians") that weren't there at all.

2. The "Complex City" Test (Real Tissues)

The Setup: Instead of a simple mix of two groups, they tested the tools on complex tissues like blood (PBMC), brain tissue (Prefrontal Cortex), and white blood cells (WBC). This is like trying to clean audio in a busy city with thousands of different conversations happening at once.
The Result:

The good tools (CellBender, SoupX, DecontX) helped scientists find rare groups of cells that were previously hidden by the noise. For example, they helped identify Platelets (tiny blood cells) in the white blood cell dataset, which the "unclean" data had completely missed.
The bad tools (scAR, CellClear) created fake cell types. In the brain dataset, scAR invented 8 new types of brain cells that didn't exist. In the blood dataset, it invented "Granulocytes" and "Platelets" out of thin air. It's as if a sound engineer looked at a silent room and said, "Ah, I hear a jazz band!" when there was nothing there.

3. The "Speed and Scale" Test

The Setup: They tested how fast the tools could clean massive datasets (up to 172,000 cells).
The Result:

SoupX was the sprinter. It was incredibly fast and efficient.
CellBender was a marathon runner. It was accurate but took a long time, especially on huge datasets.
CellClear was the one who got stuck in traffic. It worked fine on small groups but completely failed to finish the job when the dataset got too big (it took nearly 24 days to process one dataset!).

The Big Takeaway: Don't Trust the "Magic" Cleaners

The most important lesson from this paper is about Integrity vs. Sensitivity.

Sensitivity is how well a tool removes the noise.
Integrity is whether the tool keeps the original data honest.

The paper found that scAR and CellClear were very "sensitive" (they removed a lot of noise), but they had zero integrity. They didn't just clean the data; they restructured it. They replaced the original numbers with mathematically generated "fake" numbers. This is dangerous because scientists might publish a paper saying, "We discovered a new type of brain cell!" when in reality, the computer just made it up.

The Final Verdict: Who Should You Use?

The authors give a simple guide for scientists based on their situation:

If you have Droplet-based data (the most common type) and a powerful computer (GPU):
- Winner: CellBender. It's the most accurate and reliable.
- Runner-up: SoupX. It's fast, simple, and doesn't mess up the data, though it's slightly less aggressive at removing noise.
If you have "Well-Plate" data (a different technology) or only have "cleaned" data from a public database:
- Winner: DecontX. It's the only tool that works without needing the raw, messy original data and works on these specific platforms.
Who to Avoid:
- scAR and CellClear. The paper strongly advises against using them for general analysis because they create "ghost" cells and fake data. They are like a photo editor who doesn't just remove a blemish but replaces your entire face with a generated image.

Summary

In the world of single-cell biology, cleaning your data is essential, but you must be careful not to over-correct. This paper warns us that some popular tools are so eager to remove noise that they start inventing new biological realities. The best approach is to choose a tool that respects the original data (like CellBender or SoupX) rather than one that tries to "fix" things by rewriting history.

1. Problem Statement

Single-cell and single-nucleus RNA sequencing (scRNA-seq/snRNA-seq) are plagued by ambient RNA contamination, where free-floating RNA from lysed cells is co-encapsulated with intact cells, leading to false-positive gene expression signals. This artifact reduces marker gene specificity, confounds differential expression analysis, and can create spurious cell populations.

While numerous computational tools exist to remove this noise (e.g., CellBender, SoupX, DecontX), there is no consensus on which tool performs best across different experimental platforms (droplet-based vs. well-plate-based) or data types. Previous benchmarks were limited in scope (few tools, single platforms) and failed to evaluate a critical failure mode: artificial count generation, where correction algorithms introduce non-zero values in positions that were originally zero, fundamentally distorting the count matrix.

2. Methodology

The authors conducted a systematic benchmark of six ambient RNA removal tools:

CellBender (Deep generative VAE)
DecontX (Bayesian mixture model)
SoupX (Maximum likelihood estimation)
scCDC (Gene-selective subtraction)
scAR (Deep generative VAE)
CellClear (Non-negative matrix factorization)

Datasets Used:

Partial Ground Truth: Six human-mouse cell line mixing (hgmm) datasets from 10x Genomics (ranging from 1k to 20k cells). These allow for the quantification of inter-species contamination (true ambient RNA) vs. intra-species counts.
Complex Tissue Datasets:
- PBMC (scRNA-seq): Droplet-based (10x Genomics).
- Prefrontal Cortex (snRNA-seq): Droplet-based (10x Genomics).
- WBC (scRNA-seq): Well-plate-based (BD Rhapsody), representing a non-droplet platform not previously benchmarked.

Evaluation Metrics:

Sensitivity, Specificity, Precision: Calculated using species-mixing ground truth.
Count Integrity: A novel criterion measuring the fraction of corrected counts that exceed the original raw counts (artificial inflation).
Downstream Impact: Marker gene enrichment, cell type annotation confidence (using CellTypist), and the emergence of spurious cell types.
Scalability: Runtime benchmarking on datasets up to 172,000 nuclei.

3. Key Contributions

Identification of "Artificial Count Generation": The study identifies a critical failure mode where tools (specifically scAR and CellClear) do not just denoise but restructure the count matrix, generating synthetic signal where none existed.
Platform-Specific Benchmarking: First comprehensive evaluation including a well-plate-based platform (BD Rhapsody), highlighting that ambient noise mechanisms differ between droplet and plate systems.
Decision Framework: Provides a practical guide for tool selection based on platform type, data availability (raw vs. filtered matrices), and computational resources.
Re-evaluation of "Sensitivity": Demonstrates that high sensitivity scores alone are insufficient if the tool introduces massive artificial counts, rendering the data biologically unreliable.

4. Key Results

A. The Failure of scAR and CellClear

Artificial Inflation: Both tools fundamentally alter the count matrix.
- CellClear: Replaces >93% of original counts with values derived from matrix factorization.
- scAR: Generates 17–60% artificial counts (where corrected > raw).
Spurious Cell Types: This inflation leads to the creation of biologically non-existent cell populations.
- In the BD Rhapsody (WBC) dataset, scAR generated three novel coarse cell types (Granulocytes, Megakaryocytes, Early MKs) that were absent in uncorrected data. Their defining markers had near-zero raw counts but were inflated to mean values of 3–5 counts.
- In the Prefrontal Cortex dataset, scAR generated 8 novel fine-grained cell types; CellClear generated 2.
Conclusion: These tools introduce biologically unsubstantiated signal, making them unsuitable for downstream analysis like trajectory inference or rare population detection.

B. Performance of Recommended Tools (CellBender, SoupX, DecontX)

CellBender:
- Best Overall: Achieved the highest sensitivity (94%) and precision (100%) on droplet-based data.
- Integrity: Minimal count distortion; did not generate artificial cell types.
- Cost: High computational cost (requires GPU for reasonable speed; ~100h for 172k nuclei on GPU).
SoupX:
- Lightweight Alternative: Fastest runtime (max 6.5h) with robust removal and minimal count distortion.
- Trade-off: Lower sensitivity (~71%) compared to CellBender but high specificity.
DecontX:
- Platform Agnostic: The only tool in the study that works on non-droplet platforms (BD Rhapsody) and without raw unfiltered count matrices (using only filtered matrices).
- Performance: Conservative removal (under-removes relative to ambient load) but preserves endogenous signal well. It successfully recovered Platelets in the WBC dataset, which were undetectable in uncorrected data.
- Scalability: Showed memory pressure issues on very large datasets (172k nuclei) but is generally scalable.

C. Performance of scCDC

Gene-Selective: Designed to target only specific contamination-causing genes rather than global removal.
Results: Low sensitivity (22%) and asymmetric removal (biased toward mouse genes in hgmm). It failed to recover Platelets in the WBC dataset, indicating it is not a substitute for global ambient correction.

D. Scalability

SoupX: Fastest across all scales.
CellClear: Failed to scale to atlas-size data (172k nuclei), taking 582 hours.
CellBender/scAR: Exhibit quadratic scaling behavior; require significant GPU resources for large datasets.

5. Significance and Recommendations

Significance:
This study shifts the paradigm for evaluating ambient RNA tools. It argues that count matrix integrity (avoiding artificial inflation) is a more critical criterion than raw removal sensitivity. Tools that "hallucinate" counts (scAR, CellClear) can lead to false biological discoveries, such as the appearance of new cell types that are actually artifacts of the algorithm.

Practical Recommendations:

For Droplet-based data (with raw counts & GPU): Use CellBender for best sensitivity and precision.
For Droplet-based data (lightweight/fast): Use SoupX for speed and robustness with minimal distortion.
For Non-Droplet platforms (e.g., BD Rhapsody) OR Public datasets (filtered matrices only): Use DecontX. It is the only tool validated for these constraints that maintains count integrity.
Avoid: scAR and CellClear for routine analysis due to systematic artificial count generation and spurious cell type emergence.

Decision Framework:

Step 1: Check Platform/Data Availability. If non-droplet or no raw matrix $\rightarrow$ DecontX.
Step 2: If droplet + raw matrix available $\rightarrow$ Choose between CellBender (high sensitivity, high compute) or SoupX (moderate sensitivity, low compute).
Step 3: Always verify count integrity (ensure corrected counts $\le$ raw counts) before proceeding to cell type annotation.