found: Inferring cell-level perturbation from structured label noise in single-cell data

This paper introduces **found**, a flexible Python and R implementation of the HiDDEN method that enables robust inference of cell-level perturbations from structured batch-level label noise in single-cell data through customizable pipelines and systematic benchmarking of modeling choices.

Original authors: Afanasiev, E., Goeva, A.

Published 2026-04-14
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery in a massive, crowded city (the single-cell data). In this city, there are millions of people (cells), and you know that a specific event happened in the city (a perturbation, like a virus infection or a drug treatment).

However, there's a problem: the police report (the sample label) only says, "Something happened in this neighborhood." It doesn't tell you which specific people in that neighborhood were actually affected.

In reality, the event might have only shaken up a few people, while the rest of the neighborhood went about their day as usual. If you treat everyone in that neighborhood as "affected," you dilute the signal, making it hard to find the real clues. This is the problem scientists face with single-cell data: the "noise" of unaffected cells hides the "signal" of the affected ones.

The Old Way vs. The New Way

The Old Way (HiDDEN):
Previously, a method called HiDDEN was invented to solve this. It's like a smart algorithm that looks at the crowd and tries to guess, "Hey, you look a bit shaken up, you must have been affected," while saying to others, "You look fine, you're probably just bystanders." It creates a "suspicion score" for every single person.

The New Tool (found):
The paper you read introduces found. Think of found not as a new detective, but as a high-tech detective's toolkit that makes HiDDEN easier to use, faster, and more customizable.

Here is how found works, using simple analogies:

1. The Flexible Workshop (Customization)

Imagine HiDDEN is a recipe for a cake. The original recipe said, "Use flour, sugar, and eggs." But sometimes, you need a gluten-free cake, or maybe you want to use almond flour instead.

  • found is like a modular kitchen. It lets you swap out the ingredients (the mathematical methods) easily.
  • Do you want to use a different way to organize the data? Swap the "flour."
  • Do you want a different way to calculate the "suspicion score"? Swap the "sugar."
  • It allows scientists to tweak the recipe until it tastes perfect for their specific dataset, rather than being stuck with one rigid method.

2. The "Tuning" Knob (Hyperparameters)

When you drive a car, you have to adjust the seat, mirrors, and steering wheel to fit your body. If you don't, the car handles poorly.

  • In data science, these adjustments are called hyperparameters (like how many dimensions to look at, or how to group the cells).
  • found comes with an automatic cruise control. It has a special mode that tries out different "seat positions" automatically to find the one that gives the smoothest ride (the best results) for your specific data.

3. The Universal Translator (Python & R)

Scientists speak two main languages: Python and R. Often, a tool is built in one language, and scientists who speak the other have to struggle to use it, or worse, the tool behaves slightly differently in each version.

  • found is like a perfectly synchronized bilingual assistant. It was built in Python first, and then a special bridge was built so the R version speaks exactly the same language. You can switch between the two without the tool changing its mind or breaking.

4. The "Truth Filter" (The Results)

The paper tested this toolkit on real data (like immune cells reacting to a virus).

  • Without found: It's like looking at a blurry photo of a crowd. You see a general blur of "sickness," but you can't tell who is actually sick.
  • With found: The toolkit sharpens the image. It filters out the healthy people who were just standing nearby. Suddenly, the "sick" people stand out clearly.
  • The Result: When scientists used this filtered list to find which genes were reacting to the virus, they found many more important clues (genes) than they did before. It's like finding a needle in a haystack by first removing all the hay that isn't actually hay.

Why Does This Matter?

In the past, if a drug only worked on 10% of the cells in a sample, scientists might have missed it entirely because the other 90% "drowned out" the signal.

found changes the game by:

  1. Making it accessible: You don't need to be a math wizard to use it; the toolkit handles the heavy lifting.
  2. Making it robust: It helps scientists avoid "bad guesses" by testing different settings and showing them which one works best.
  3. Finding the hidden: It allows researchers to see subtle, weak, or rare biological effects that were previously invisible.

In a nutshell: found is a flexible, user-friendly, and powerful toolbox that helps scientists clean up the "static" in their data, allowing them to hear the quiet whispers of biological change that were previously lost in the noise.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →