LRSomatic: a highly scalable and robust pipeline for somatic variant calling in long-read sequencing data

LRSomatic is a highly scalable, Nextflow-based pipeline designed to perform comprehensive somatic variant calling (including SNVs, indels, structural variants, and copy number changes) and epigenetic integration from long-read PacBio HiFi and ONT sequencing data, demonstrating state-of-the-art performance in both benchmarking and clinical case studies.

Original authors: Forsyth, R. A., Harbers, L., Verhasselt, A., Iraizos, A.-L. R., Yang, S., Vande Velde, J., Davies, C., Pillay, N., Lambrechts, L., Demeulemeester, J.

Published 2026-02-28
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your DNA is a massive, ancient library containing the instructions for building and running a human being. For decades, scientists have been trying to read this library, but they've been using a very specific, limited tool: short-read sequencing.

Think of short-read sequencing like trying to read a book by tearing out individual words and sentences, then trying to glue them back together. It works great for simple stories, but if the book has complex illustrations, repeated paragraphs, or pages stuck together (which happens often in cancer), this method gets confused. You might miss the big picture or misinterpret how the pages are connected.

Enter Long-Read Sequencing. This is like having a high-resolution camera that can photograph entire chapters or even whole books at once. It can see the complex twists, turns, and "stuck pages" that the old method missed.

However, there was a problem: We had the camera, but we didn't have the instruction manual on how to develop the photos. While there were tools to read the "normal" parts of the library, there was no reliable, all-in-one software to find the "typos" and "missing pages" that cause cancer (somatic variants) using these new, long photos.

Enter: LRSomatic (The New Librarian)

The paper introduces LRSomatic, a new software pipeline created by a team of scientists. You can think of LRSomatic as a super-smart, highly organized librarian designed specifically to work with these new, long-read photos.

Here is how LRSomatic works, broken down into simple concepts:

1. The "Universal Translator" (Platform Agnostic)

Whether you took the photos with a PacBio camera or an Oxford Nanopore camera, LRSomatic doesn't care. It speaks both languages fluently. It takes the raw data from either machine and organizes it into a format that scientists can actually use.

2. The "Detective Duo" (Tumor vs. Normal)

To find cancer mutations, you usually need two things: a picture of the patient's healthy cells (the "Normal") and a picture of their cancer cells (the "Tumor").

  • The Comparison: LRSomatic acts like a detective comparing two crime scene photos. It looks at the "Normal" photo to see what the original blueprint looked like, then scans the "Tumor" photo to find exactly what changed.
  • The Solo Detective: Sometimes, you only have the tumor photo (no healthy sample). LRSomatic has a special mode for this too, using advanced AI to guess what the "Normal" looked like and filter out the background noise, though it's slightly less precise than having both photos.

3. The "All-Seeing Eye" (Finding Everything)

Old tools were good at finding single-letter typos (SNVs) or small missing words (Indels). But cancer often involves massive structural changes—like entire paragraphs being deleted, swapped, or duplicated.

  • LRSomatic is built to find everything: tiny typos, missing words, and massive structural rearrangements. It can even spot fusion genes (where two different chapters get glued together incorrectly), which are common drivers of cancer.

4. The "Invisible Ink" Reader (Epigenetics)

This is the coolest part. Long-read sequencing doesn't just read the letters (A, C, T, G); it can also see "invisible ink" on the page.

  • Imagine the DNA library has sticky notes or highlighters attached to certain pages. These tell the cell which parts of the book are "open for business" (active) and which are "closed" (silent).
  • LRSomatic can read these Fiber-seq signals. It can tell you not just what the mutation is, but how it changes the way the cell reads the book. For example, in a specific sarcoma case study, LRSomatic showed that the cancer cells were reading the "wrong" copy of a gene because the "good" copy was covered in sticky notes (methylation) and silenced.

The Real-World Test: The Clear Cell Sarcoma Case

To prove it works, the team tested LRSomatic on a real patient with a rare cancer called Clear Cell Sarcoma.

  • The Gold Standard: They first ran the patient's data through the best existing tool for short-read sequencing (Oncoanalyser). It found the main culprit: a fusion between two genes (EWSR1 and ATF1).
  • The LRSomatic Result: LRSomatic found that same fusion, plus all the other major changes, using the long-read data.
  • The Bonus: Because it used the "invisible ink" reader, it also figured out which specific copy of the gene (mom's or dad's) was active, giving doctors a much deeper understanding of the tumor's behavior.

Why Does This Matter?

Before LRSomatic, if you wanted to use these powerful new long-read cameras to study cancer, you had to build your own software from scratch, which was hard, slow, and prone to errors.

LRSomatic is the "plug-and-play" solution. It's free, open-source, and designed to be used by any lab, anywhere in the world. It ensures that when scientists use long-read sequencing, they get a complete, accurate picture of the cancer, not just a blurry snapshot.

In short: LRSomatic turns the powerful new technology of long-read sequencing into a reliable, everyday tool for curing cancer.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →