found: Inferring cell-level perturbation from… — Plain-Language Explanation

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery in a massive, crowded city (the single-cell data). In this city, there are millions of people (cells), and you know that a specific event happened in the city (a perturbation, like a virus infection or a drug treatment).

However, there's a problem: the police report (the sample label) only says, "Something happened in this neighborhood." It doesn't tell you which specific people in that neighborhood were actually affected.

In reality, the event might have only shaken up a few people, while the rest of the neighborhood went about their day as usual. If you treat everyone in that neighborhood as "affected," you dilute the signal, making it hard to find the real clues. This is the problem scientists face with single-cell data: the "noise" of unaffected cells hides the "signal" of the affected ones.

The Old Way vs. The New Way

The Old Way (HiDDEN):
Previously, a method called HiDDEN was invented to solve this. It's like a smart algorithm that looks at the crowd and tries to guess, "Hey, you look a bit shaken up, you must have been affected," while saying to others, "You look fine, you're probably just bystanders." It creates a "suspicion score" for every single person.

The New Tool (found):
The paper you read introduces found. Think of found not as a new detective, but as a high-tech detective's toolkit that makes HiDDEN easier to use, faster, and more customizable.

Here is how found works, using simple analogies:

1. The Flexible Workshop (Customization)

Imagine HiDDEN is a recipe for a cake. The original recipe said, "Use flour, sugar, and eggs." But sometimes, you need a gluten-free cake, or maybe you want to use almond flour instead.

found is like a modular kitchen. It lets you swap out the ingredients (the mathematical methods) easily.
Do you want to use a different way to organize the data? Swap the "flour."
Do you want a different way to calculate the "suspicion score"? Swap the "sugar."
It allows scientists to tweak the recipe until it tastes perfect for their specific dataset, rather than being stuck with one rigid method.

2. The "Tuning" Knob (Hyperparameters)

When you drive a car, you have to adjust the seat, mirrors, and steering wheel to fit your body. If you don't, the car handles poorly.

In data science, these adjustments are called hyperparameters (like how many dimensions to look at, or how to group the cells).
found comes with an automatic cruise control. It has a special mode that tries out different "seat positions" automatically to find the one that gives the smoothest ride (the best results) for your specific data.

3. The Universal Translator (Python & R)

Scientists speak two main languages: Python and R. Often, a tool is built in one language, and scientists who speak the other have to struggle to use it, or worse, the tool behaves slightly differently in each version.

found is like a perfectly synchronized bilingual assistant. It was built in Python first, and then a special bridge was built so the R version speaks exactly the same language. You can switch between the two without the tool changing its mind or breaking.

4. The "Truth Filter" (The Results)

The paper tested this toolkit on real data (like immune cells reacting to a virus).

Without found: It's like looking at a blurry photo of a crowd. You see a general blur of "sickness," but you can't tell who is actually sick.
With found: The toolkit sharpens the image. It filters out the healthy people who were just standing nearby. Suddenly, the "sick" people stand out clearly.
The Result: When scientists used this filtered list to find which genes were reacting to the virus, they found many more important clues (genes) than they did before. It's like finding a needle in a haystack by first removing all the hay that isn't actually hay.

Why Does This Matter?

In the past, if a drug only worked on 10% of the cells in a sample, scientists might have missed it entirely because the other 90% "drowned out" the signal.

found changes the game by:

Making it accessible: You don't need to be a math wizard to use it; the toolkit handles the heavy lifting.
Making it robust: It helps scientists avoid "bad guesses" by testing different settings and showing them which one works best.
Finding the hidden: It allows researchers to see subtle, weak, or rare biological effects that were previously invisible.

In a nutshell: found is a flexible, user-friendly, and powerful toolbox that helps scientists clean up the "static" in their data, allowing them to hear the quiet whispers of biological change that were previously lost in the noise.

1. Problem Statement

Single-cell RNA sequencing (scRNA-seq) allows for cellular-resolution analysis, but a significant challenge arises in case-control studies where perturbation labels (e.g., "treated" vs. "control") are assigned at the sample level rather than the cell level.

The Mismatch: In many biological scenarios, perturbations affect only a subset of cells within a sample (heterogeneous effects), or the signal is weak.
The Consequence: Assigning a single "case" label to all cells in a treated sample dilutes the signal. Standard differential expression analysis often fails to detect subtle or heterogeneous perturbation effects, particularly in rare cell populations.
Limitations of Existing Solutions: Current strategies rely on experimental enrichment (e.g., FACS sorting) or in silico filtering, both of which require prior knowledge of the affected cell populations or strong signals, making them unsuitable for exploratory analysis.

2. Methodology: The HiDDEN Framework

The paper introduces found, a comprehensive Python and R implementation of the HiDDEN (Hierarchical Disentangling of Differential Expression Noise) framework. HiDDEN reframes the problem as a latent variable inference task to recover cell-level perturbation signals from noisy sample-level labels.

The pipeline consists of four modular stages:

Embedding (Dimensionality Reduction):
- High-dimensional count matrices are transformed into lower-dimensional embeddings to handle collinearity and technical noise.
- Options: PCA, NMF, scVI, with optional Harmony adjustment for batch effects.
- Hyperparameter: Embedding dimensionality ( $k$ ).
Continuous Scoring (Regression):
- A predictive model uses the embeddings and sample-level labels to assign a continuous perturbation score ( $\hat{p}$ ) to each cell (ranging from 0 to 1).
- Options: Logistic Regression, Random Forest, SVM.
Binarization (Discretization):
- Optional step to convert continuous scores into binary labels ("affected" vs. "unaffected") using clustering.
- Options: K-means, Gaussian Mixture Models (GMM).
Grouping Strategy:
- The pipeline can be run jointly across all cells or separately for specific cell types (e.g., splitting by cell_type metadata).

The found Implementation:

Flexibility: Unlike a fixed pipeline, found exposes HiDDEN as a composable framework, allowing users to swap methods at each stage.
Interoperability: It supports AnnData (Python) and SingleCellExperiment/Seurat (R) objects. The R package acts as a wrapper around the Python backend to ensure functional consistency.
Automation: Includes entry points for automatic hyperparameter selection (HiDDENt) and automatic grouping (HiDDENg).
Visualization: A dedicated found.pl module for evaluating outputs (e.g., violin plots of scores, bar plots of refined labels).

3. Key Contributions

Software Release: The release of found, a robust, open-source library (Python/R) that makes the HiDDEN framework accessible, reproducible, and customizable.
Systematic Benchmarking: The authors performed extensive benchmarking across 10 diverse datasets (covering human/mouse, brain/blood, various diseases, and stimulation conditions) to evaluate the impact of modeling choices.
Methodological Insights:
- Regression: Logistic Regression is identified as the superior method for generating $\hat{p}$ . Random Forests tend to overfit (producing binary 0/1 distributions), and SVMs produce dense distributions around zero, neither of which captures the "disease continuum" effectively.
- Embedding: Shifted-logarithm transformed data with PCA is recommended as the standard due to favorable scaling and robustness.
- Hyperparameters: Performance is highly sensitive to the embedding dimensionality ( $k$ ) and the grouping strategy. There is no universal "best" $k$ ; it must be tuned per dataset.
- Binarization: K-means is recommended over GMM due to comparable performance but superior runtime and memory efficiency.

4. Results

Downstream Sensitivity: In a case study using IL-15 stimulated PBMCs, HiDDEN-derived scores significantly improved downstream Differential Gene Expression (DEG) analysis.
- Using continuous scores ( $\hat{p}$ ) as a covariate in regression models revealed significantly more regulated genes across cell types compared to standard batch-level analysis.
- Using refined binary labels to filter out "unaffected" case cells before pseudobulking increased the number of detected DEGs by removing noise from the case group.
Benchmarking Findings:
- Data Dependency: Optimal modeling choices (especially $k$ and grouping) vary significantly between datasets. A "one-size-fits-all" approach is ineffective.
- Grouping: Splitting by cell identity generally yields better scores on larger datasets but can degrade performance on smaller datasets.
- Efficiency: The found pipeline is computationally efficient, with runtime and memory usage scaling predictably with dataset size (as shown in Supplemental Fig. 3).

5. Significance

The paper addresses a critical bottleneck in single-cell analysis: the inability to detect heterogeneous perturbation effects when labels are noisy or sample-level.

Practical Utility: found provides a "plug-and-play" yet highly customizable tool for researchers to refine labels without prior knowledge of affected populations.
Robustness: By demonstrating that modeling choices drastically alter outcomes, the authors advocate for a systematic evaluation approach (using the provided tuning and visualization tools) rather than default settings.
Accessibility: By supporting both Python and R ecosystems and integrating with standard single-cell data structures, found lowers the barrier to entry for method developers and domain scientists alike, enabling the discovery of subtle biological signals that were previously undetectable.

In summary, found transforms HiDDEN from a theoretical concept into a practical, industry-standard tool for refining single-cell perturbation analysis, emphasizing that careful selection of modeling parameters is essential for robust biological discovery.

found: Inferring cell-level perturbation from structured label noise in single-cell data