EMITS: expectation-maximization abundance estimation for fungal ITS communities from long-read sequencing

EMITS is a high-performance Rust-based tool that utilizes an expectation-maximization algorithm to resolve ambiguous read mappings and improve species-level abundance estimation in fungal ITS communities from long-read sequencing, significantly outperforming naive counting methods in accuracy and error reduction.

O'Brien, A., Lagos, C., Fernandez, K., Ojeda, B., Parada, P.

Published 2026-04-02
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to figure out who was in a crowded room based on a series of blurry, overlapping security camera footprints.

In the world of fungal biology, scientists use a specific "barcode" in DNA called ITS to identify different species of mushrooms and molds. Recently, scientists started using "long-read" sequencers (like Oxford Nanopore or PacBio) which are great at reading these barcodes in one go. However, there's a problem: many fungi are like cousins who look almost identical. Their DNA barcodes are so similar that the computer gets confused.

The Old Way: The "Best Guess" Mistake

Traditionally, when a computer sees a blurry footprint, it just picks the single best match it can find and says, "This is definitely Species A."

This is like a detective looking at a blurry photo and saying, "That looks like John, so it must be John," even though John's twin brother, Mike, looks 99% identical.

  • The Problem: If the computer is wrong, it miscounts the crowd. It might think there are 100 Johns when there are actually 50 Johns and 50 Mikes.
  • The Database Mess: Furthermore, the reference library (UNITE) has multiple entries for the same species (like having 20 different ID cards for John). The old method splits the count of "Johns" across all 20 cards, making it look like there are many different people when it's actually just one.

The New Solution: EMITS (The "Group Think" Detective)

The authors of this paper built a new tool called EMITS. Instead of making a snap judgment, EMITS uses a smart, iterative process called Expectation-Maximization (EM).

Think of EMITS as a detective who doesn't just pick one suspect. Instead, they:

  1. Make a guess: "Okay, let's assume 50% of these footprints belong to John and 50% to Mike."
  2. Check the evidence: They look at the footprints again. "If it's 50/50, does this blurry print look more like John or Mike?"
  3. Adjust the guess: "Actually, this print looks slightly more like John, so let's shift the odds to 60% John."
  4. Repeat: They do this over and over again, refining the numbers until the math settles on the most likely truth.

This allows the tool to say, "We can't be 100% sure on this specific footprint, but statistically, it's more likely to be John than Mike," and it distributes the "credit" fairly between the two.

How They Tested It

The team tested EMITS in three ways:

  1. The Simulation: They created a fake digital crowd where they knew exactly who was there. They added "noise" (blur) to the data. The old method got confused and made huge mistakes, while EMITS stayed calm and accurate, reducing errors by up to 92%.
  2. The "Mock" Community: They took a real jar of 10 known fungi, sequenced it, and let the tools guess. EMITS correctly identified the specific species in tricky groups (like Trichophyton and Penicillium), whereas the old method mixed them up.
  3. The Synthetic Crowd: They created a complex mix of 21 species. EMITS not only found the right species but also stopped the computer from inventing "ghost" species that weren't there (reducing false alarms by 54%).

Why This Matters

  • It's Fast: The tool is written in Rust, a programming language known for being incredibly fast and efficient.
  • It's Smart: It knows that different cameras (sequencers) have different types of blur, so it has "presets" to adjust its detective work accordingly.
  • It Cleans Up the Mess: It automatically combines the multiple ID cards for the same species so you get a clear count of the actual species, not just the database entries.

The Bottom Line

EMITS is a new, super-smart calculator for fungal DNA. It stops us from misidentifying look-alike fungi and stops us from double-counting the same species. By using a "group think" approach instead of a "best guess" approach, it gives scientists a much clearer, more accurate picture of the fungal world, which is crucial for medicine, agriculture, and ecology.

It works best when paired with another tool called ITSxRust, creating a complete, high-speed pipeline for analyzing fungal communities from long-read sequencing.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →