dreampy: Pseudobulk mixed-model differential expression for single-cell RNA-seq in Python

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: Which genes are acting differently in sick people compared to healthy people?

To do this, you have a massive library of clues (single-cell RNA data) from hundreds of people. But here's the catch: the clues aren't all independent. You have multiple clues from the same person, and those people might have been tested in different labs or at different times.

For a long time, scientists had a tool to solve this mystery, but it only spoke R, a specific programming language. If you were a Python programmer (the other major language in this field), you had to stop, translate your data into R, run the detective work, and then translate the results back. It was like trying to solve a puzzle while wearing a translator's headset that kept lagging.

Enter dreampy.

The Problem: The "Fake Clue" Trap

In the old days, scientists treated every single cell as a unique, independent clue. But that's a trap! If you take 1,000 cells from one person, they aren't 1,000 different people; they are just 1,000 copies of the same person's story. If you count them all as separate, you trick yourself into thinking you have more evidence than you do. This is called pseudoreplication, and it leads to false alarms.

The solution is Pseudobulk. Instead of looking at 1,000 individual cells, you smash them together into one "super-clue" for each person. Now you have one data point per person, which is the true unit of evidence.

The Solution: A New Detective Tool

The dreamlet framework (created in R) is the gold standard for analyzing these "super-clues." It uses a very sophisticated method (a mix of linear models and statistics) to handle the messy reality of biology:

Batch effects: When a lab technician changes the lighting or the machine, it looks like a biological change.
Repeated measures: When the same person is tested multiple times.
Hierarchical structure: People are nested in groups, which are nested in batches.

However, dreamlet only lives in the R world.

dreampy is the new Python version of this exact same detective tool. It doesn't invent a new way to solve the mystery; it just speaks the language that Python users already know.

How It Works (The Analogy)

Think of the analysis pipeline as a factory assembly line for turning raw data into answers.

The Old Way (R dreamlet): The factory is a black box. You dump your raw data in one end, and the answer comes out the other. Inside, there are seven different machines (packages) working together, but you can't see or touch them individually. If something goes wrong, you have to guess where.
The New Way (dreampy): The factory is now a glass-walled workshop. Every single step is a separate, transparent station:
- Station 1: Smash cells together (Pseudobulk).
- Station 2: Clean the data (Filtering).
- Station 3: Normalize the weights (TMM).
- Station 4: Do the heavy math (Linear Mixed Models).
- Station 5: Shrink the noise (Empirical Bayes).
- Station 6: Print the report.

Because every station is a separate function in Python, a scientist can stop at any point, inspect the data, tweak the settings, or swap out a machine if they need to. It's like having a car where you can pop the hood and see exactly how the engine is running, rather than just pressing a "Go" button.

The Big Win: Saving Lost Data

The paper demonstrates the power of this tool with a real-world example involving Lupus (an autoimmune disease).

In a previous study, scientists had to throw away data from 50 healthy people because of a "glitch" in their math. The way they modeled the data made it look like those healthy people were identical to the sick people, so they had to delete them to avoid confusion. This made their study much weaker.

When the authors used dreampy with a mixed-model approach:

They didn't have to delete the 50 healthy people.
The math could handle the "glitch" automatically by treating the groups as random variations rather than fixed rules.
Result: They recovered the missing data, doubled their statistical power, and found a massive, clear signal of the "Interferon" response (a known Lupus signature) that was previously hidden in the noise.

Why This Matters

No More Language Switching: Python users can now use the most advanced statistical tools without leaving their ecosystem.
Transparency: You can see exactly how the math is being done at every step.
Better Science: By using the right math (mixed models), we can include more data, get more accurate results, and avoid throwing away valuable clues.

In short, dreampy is the bridge that brings the most powerful statistical detective work from the R world into the Python world, making it easier, faster, and more transparent for everyone to find the truth in complex biological data.

1. Problem Statement

Single-cell RNA-seq (scRNA-seq) studies now routinely profile hundreds of thousands of cells across hundreds of donors. A central challenge in analyzing this data is differential expression (DE) testing that correctly accounts for the hierarchical structure of the data (cells nested within donors, donors across batches/tissues).

The Pseudoreplication Issue: Early methods treated individual cells as independent observations, leading to inflated false positive rates because cells from the same donor are not independent.
The Preferred Solution: The field has converged on pseudobulk aggregation (summing counts per donor-cell type) followed by bulk RNA-seq statistical frameworks.
The Current Gap: The state-of-the-art framework for this, dreamlet (based on limma-voom and linear mixed models), exists entirely within the R/Bioconductor ecosystem. It relies on a complex chain of seven external R packages (edgeR, limma, lme4, lmerTest, pbkrtest, fANCOVA, variancePartition).
The Consequence: Researchers working primarily in Python (the standard environment for scRNA-seq preprocessing via scanpy and scverse) must export data to R, run the analysis, and import results back. This "language-switching" workflow hinders reproducibility, interactive exploration, and integration with downstream Python-based analyses. Existing Python tools (e.g., PyDESeq2, edgePython, InMoose) lack the specific combination of voom precision weighting, linear mixed models, and empirical Bayes moderation required for robust pseudobulk DE.

2. Methodology

dreampy is a native Python implementation that reproduces the full dreamlet pipeline. It integrates directly with AnnData (the standard data structure in the scverse ecosystem) and uses Python's scientific stack (NumPy, SciPy, pandas, Py-BOBYQA).

Pipeline Architecture

Unlike R dreamlet, which bundles most steps behind two entry points (processAssays and dreamlet), dreampy decomposes the workflow into nine composable, individually callable functions:

aggregate_pseudobulk(): Sums raw counts across cells for each donor–cell type combination.
filter_samples(): Removes samples with low cell counts or cell types with insufficient samples.
compute_tmm_factors(): Estimates Trimmed Mean of M-values (TMM) normalization factors.
filter_by_expr(): Filters lowly expressed genes based on library size and experimental design (reimplementing edgeR::filterByExpr).
log2cpm(): Transforms counts to log2 counts per million (with a prior count of 0.5).
estimate_weights(): Performs voom mean–variance modeling. It fits a nonparametric smooth (loess/lowess) to the square-root residual standard deviation vs. mean log-count to derive precision weights.
fit_models(): Fits precision-weighted linear models.
- Fixed Effects: Ordinary Weighted Least Squares (WLS).
- Random Effects: Weighted Linear Mixed Models (LMM) optimized via REML (Restricted Maximum Likelihood) using the BOBYQA derivative-free optimizer.
- Degrees of Freedom: Uses Satterthwaite approximation (default) or Kenward-Roger correction for denominator degrees of freedom.
ebayes(): Applies Empirical Bayes moderation to shrink gene-wise residual variances toward a common prior, stabilizing inference for genes with low degrees of freedom.
get_results(): Extracts final results (coefficients, moderated t-statistics, p-values, adjusted p-values, log-odds).

Key Design Decisions

Cold Start Initialization: Unlike R dreamlet, which "warm-starts" the optimizer for each gene using the previous gene's parameters (introducing gene-order dependency), dreampy uses a cold-start approach with independent initial values for each gene. This ensures deterministic results regardless of parallelization strategy.
REML Consistency: dreampy uses REML for both weight estimation and model fitting, whereas R dreamlet uses Maximum Likelihood (ML) for weights and REML for fitting. This provides a more uniform default for variance component estimation.
Collinearity Handling: dreampy explicitly detects and drops perfectly collinear random-effect terms (e.g., a donor appearing in only one batch) before fitting, preventing convergence failures common in R's optimizer when handling degenerate parameterizations.

3. Key Contributions

Native Python Implementation: Provides the first complete, native Python implementation of the limma-voom linear mixed model pipeline for pseudobulk DE.
Ecosystem Integration: Seamlessly integrates with AnnData and the scverse ecosystem, eliminating the need for R-R/Python data exchange.
Transparency and Modularity: By exposing every pipeline stage as a distinct function, dreampy allows users to inspect intermediate results (e.g., TMM factors, voom weights, variance components), facilitating debugging and customization for non-standard experimental designs.
Statistical Parity: Demonstrates near-identical statistical outputs to the R reference implementation.

4. Results

The authors validated dreampy against R dreamlet using two published datasets:

A. Cross-Language Validation

Wells et al. (2025) Dataset (T cell aging):
- Analyzed 13 T cell assays (41–153 pseudobulk samples).
- Correlation: Pearson correlations reached $r = 0.9999997$ for adjusted p-values and $r = 1.0000000$ for TMM factors.
- Accuracy: 332 of 351 metric tests passed at a correlation floor of $r \ge 0.999$ . Failures were primarily due to optimizer boundary behavior on small-sample subtypes or floating-point tie-breaking.
Perez et al. (2022) Dataset (Lupus):
- Analyzed 10 cell types across 261 donors.
- Accuracy: 249 of 270 tests passed. Discrepancies in the plasmablast assay were due to dreampy's explicit handling of collinearity (dropping a term) vs. R's retention of the full parameterization. Despite intermediate differences, final DE calls were concordant.

B. Biological Application: Lupus Cohort Reanalysis

Scenario: The original Perez et al. analysis used a fixed-effects model that forced the exclusion of 50 healthy controls (ImmVar cohort) because their processing batch was perfectly aliased with disease status.
dreampy Approach: Used a mixed-effects model (~sle + (1|donor_id) + (1|Processing_Cohort)) to treat batch as a random effect, recovering the excluded controls.
Outcome:
- Increased Power: Recovering the 50 controls roughly doubled the number of detected differentially expressed (DE) genes across major cell types (e.g., from 2,084 to 3,905 in classical monocytes).
- Robustness: The full-cohort analysis identified a coherent interferon-stimulated gene (ISG) signature across all 8 immune cell types, confirming known lupus biology that was missed or underpowered in the original analysis.
- Effect Size Correction: The full cohort provided more accurate log2FC estimates, correcting "winner's curse" inflation seen in the underpowered subset.

C. Performance

Benchmarks on an Apple M4 Max showed mixed speed results. dreampy was faster in preprocessing but slower in model fitting for some datasets due to the lack of warm-starting. However, the speed difference is dataset-dependent, and optimization (warm-starting) is planned for future versions.

5. Significance

Bridging the Ecosystem Gap: dreampy fills a critical void in the Python single-cell ecosystem, allowing researchers to perform state-of-the-art, statistically rigorous DE analysis without leaving Python.
Methodological Flexibility: By enabling mixed-effects modeling natively in Python, it allows researchers to properly account for complex experimental designs (repeated measures, batch effects) that fixed-effect models cannot handle without data loss.
Reproducibility and Transparency: The modular architecture promotes better scientific practice by making intermediate statistical steps inspectable, unlike the "black box" nature of the bundled R functions.
Validation of LLM-Assisted Development: The paper includes a reflection on using Large Language Models (LLMs) for the implementation. It highlights that while LLMs can accelerate the translation of established statistical code (R to Python), rigorous validation against ground truth (the R implementation) and domain expertise remain essential to ensure correctness.

In summary, dreampy democratizes access to advanced mixed-model differential expression analysis for the Python-centric single-cell community, ensuring that statistical best practices are accessible without language barriers.