scprocess: a pipeline for processing, integrating and visualising atlas-scale single cell data

The paper introduces scprocess, a Snakemake-based pipeline designed to automate, standardize, and ensure the reproducibility of processing, integrating, and visualizing atlas-scale single-cell RNA sequencing data, particularly from 10x Genomics platforms.

Original authors: Koderman, M., Pilarski, J., Bianco, E., Gonzalez, D., Robinson, M. D., Macnair, W.

Published 2026-03-13
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a librarian who has just received a massive shipment of books. But these aren't normal books; they are millions of tiny, fragile diaries written by individual cells in a human body. Your job is to read them, organize them, and figure out what each cell is doing.

In the past, scientists could only read a few diaries at a time. But now, thanks to new technology, they can read millions of them at once. This is called "atlas-scale" research. The problem? The sheer volume of data is overwhelming. It's like trying to sort a mountain of books by hand while the mountain keeps growing. It's messy, slow, and if you make a mistake in how you sort one pile, the whole library gets messed up.

Enter scprocess. Think of scprocess as a super-smart, automated robot librarian designed specifically to handle this mountain of cellular diaries.

Here is how this robot works, broken down into simple steps:

1. The Delivery (Reading the Raw Data)

When the raw data arrives, it's like a chaotic box of shredded paper. The robot first needs to put the pieces back together.

  • The Old Way: Traditionally, this was done by a very slow, heavy-duty machine that ate up a lot of electricity (memory) and took forever.
  • The scprocess Way: This robot uses a new, turbo-charged engine called alevin-fry. It's like switching from a slow-moving truck to a high-speed drone. It sorts the shredded papers (reads) and counts them incredibly fast, using much less energy.

2. The Cleanup Crew (Filtering the Noise)

Not every piece of paper in the box is a real diary. Some are just empty wrappers (empty droplets), and some are just smudges of ink from a torn page (ambient RNA).

  • The Problem: If you don't clean this up, you might think a smudge is a real cell, or you might miss a real cell hiding in the noise.
  • The Solution: The robot has two cleaning modes. One is a heavy-duty vacuum (CellBender) that takes its time to be super precise. The other is a quick, efficient broom (DecontX) for when you need speed. It also has a special trick to spot "doublets"—cases where two cells got stuck together in one wrapper, which would confuse the analysis.

3. The Quality Check (The Bouncer)

Now that we have the papers, the robot acts like a bouncer at a club. It checks the ID of every "cell."

  • It looks for signs of damage: Did the cell lose too much ink? Is it full of "mitochondrial" trash (like a cell that is dying)?
  • The Smart Twist: The robot knows that sometimes a "damaged" cell is actually just a stressed-out cell, not a bad one. It uses a special metric (looking at how much "spliced" vs. "unspliced" ink there is) to tell the difference between a real nucleus and a cell that accidentally leaked its cytoplasm. It lets the user set the rules so they don't accidentally throw out the VIPs (important cells).

4. The Grouping Party (Integration and Clustering)

Now we have millions of clean diaries. We need to group them. Are these cells all from the brain? Are those from the liver?

  • The Challenge: If you try to compare millions of pages at once, your computer will crash.
  • The Solution: The robot doesn't read every single word. Instead, it picks out the most interesting sentences (Highly Variable Genes) that tell the story of what makes each cell unique. It ignores the boring, repetitive words that are the same in every cell.
  • The Party: It then throws a massive party where similar cells hang out together. It uses a "GPU" (a super-fast graphics card) to make this happen in minutes instead of days, creating a map where cells that look alike stand next to each other.

5. The Translator (Labeling the Cells)

Once the cells are grouped, the robot needs to give them names. "This group is a neuron," "That group is a muscle cell."

  • The Old Way: Scientists used to guess based on a few famous keywords.
  • The scprocess Way: It uses a statistical detective (pseudobulk analysis). Instead of asking every single cell what it is, it asks the "representative" of the group. This prevents the robot from getting confused by one noisy cell and gives a much more reliable answer. It can even use a pre-trained AI (like CellTypist) that has already read millions of other diaries to help guess the names.

6. The Zoom-In Feature (Subclustering)

Sometimes, the robot finds a group that looks like "Brain Cells," but it knows there's more detail hidden inside.

  • The Feature: You can tell the robot, "Zoom in on just the Brain Cells." It then re-runs the whole process on just that small group to find the tiny differences—like finding a specific type of neuron that only wakes up when you are stressed.

Why is this important?

Before scprocess, organizing this data was like trying to build a library by hand, one book at a time, with no instruction manual. Every scientist did it slightly differently, making it impossible to compare their results.

scprocess is the instruction manual and the robot arm combined. It ensures that:

  1. Speed: It handles millions of cells without breaking a sweat.
  2. Reproducibility: If you run the same data through it twice, you get the exact same result.
  3. Transparency: It keeps a detailed log (HTML reports) of every decision it made, so you can see exactly how the library was organized.

In short, scprocess turns a chaotic mountain of cellular data into a beautifully organized, searchable library, allowing scientists to finally understand the complex stories written inside our bodies.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →