ESGI: Efficient splitting of generic indices in single-cellsequencing data

ESGI is a flexible, general-purpose framework that enables efficient demultiplexing and preprocessing of single-cell sequencing data with arbitrary and complex barcode architectures, overcoming the limitations of existing pipelines that are restricted to fixed designs and substitution-only error models.

Original authors: Stohn, T., van de Brug, N. D., Theodosiadou, A., Thijssen, B., Jastrzebski, K., Wessels, L. F. A., Bosdriesz, E.

Published 2026-03-06
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are running a massive, high-tech post office. Every day, millions of tiny letters (DNA/RNA sequences) arrive from thousands of different neighborhoods (cells). To sort these letters correctly, each one has a special barcode written on it. This barcode tells the post office exactly which house (cell) it belongs to and what kind of package it is (protein, RNA, etc.).

In the past, these barcodes were simple. They were always the same length, always in the same spot, and written perfectly. But recently, scientists have started inventing super-complex barcodes. Some are different lengths, some are in weird spots, and sometimes the handwriting gets a little messy (errors like missing letters or extra letters).

The old sorting machines (existing software) are rigid. They say, "If the barcode isn't exactly 10 letters long and in position 5, throw it away!" This means a lot of valuable data gets lost because the machine can't handle the messiness or the new designs.

Enter ESGI (Efficient Splitting of Generic Indices). Think of ESGI as a super-smart, flexible robot sorter that can handle any kind of barcode, no matter how messy or weird it is.

Here is how ESGI works, using simple analogies:

1. The "Shape-Shifting" Barcode Reader

Most old tools try to cut the barcode out of the letter using a cookie cutter. If the barcode is slightly longer or shorter than the cookie cutter, the cut is wrong, and the rest of the letter gets misaligned.

ESGI doesn't use a cookie cutter. Instead, it acts like a detective. It reads the letter from the beginning and says, "Okay, I see a barcode here. It might be 8 letters long, or maybe 9 because a letter got dropped." It figures out exactly where the barcode ends and where the next part of the message begins, even if the spacing is off. This is called handling insertions and deletions.

  • The Analogy: Imagine reading a sentence where someone accidentally deleted a word.
    • Old Tool: "This sentence is broken at word 5. I can't read the rest." (Gives up).
    • ESGI: "Ah, a word is missing here. I'll skip over the gap and continue reading the rest of the sentence correctly."

2. The "Universal Translator" for New Experiments

Scientists are constantly inventing new ways to label cells. Some use a single barcode, some use three barcodes glued together, and some use barcodes that change length to help the machine read them better (like adding "staggered" steps).

Existing tools are like specialized translators that only speak one language. If a scientist invents a new language, they have to build a whole new translator from scratch.

ESGI is like a universal translator app. You just give it a "map" (a pattern file) that says, "First, look for a barcode here, then a fixed word, then another barcode of variable length." ESGI follows that map for any experiment. You don't need to build a new machine; you just update the map.

3. The "Quality Control" Dashboard

When the old machines sort letters, they just give you the sorted pile. If they made a mistake, you don't know until it's too late.

ESGI comes with a detailed dashboard. It tells you:

  • "Hey, 10% of the letters had a missing letter in the second barcode."
  • "This specific barcode seems to get messy more often than others."
  • "We found 15% more valid letters because we were flexible with the errors."

This is like a mechanic not just fixing your car, but giving you a report on why the engine was making noise, so you can prevent it next time.

4. Why This Matters (The "Why Should I Care?")

Single-cell sequencing is like taking a census of a city, but instead of counting people, we are counting the molecules inside every single cell. This helps us understand diseases, how the brain works, and how proteins behave.

  • Without ESGI: We lose data because the sorting machine is too rigid. We miss rare cells or get confused by complex experiments.
  • With ESGI: We can use the newest, most complex experiments without waiting for a new software update. We get more accurate data, and we can fix the experiments faster because we know exactly where the errors are happening.

Summary

ESGI is a flexible, smart tool that sorts complex genetic data. Instead of forcing data to fit into a rigid box, it adapts to the data's shape. It handles messy barcodes, variable lengths, and new experimental designs, ensuring that scientists don't lose valuable information and can move faster in their discoveries. It turns a rigid, error-prone process into a fluid, intelligent one.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →