Facilitating genome annotation using ANNEXA and long-read RNA sequencing

This study introduces an updated, open-source ANNEXA pipeline that leverages long-read RNA sequencing data, integrates multiple reconstruction tools and deep learning models, and provides rigorous quality control to enhance genome annotation and identify novel coding and non-coding transcripts across species.

Original authors: Hoffmann, N., Besson, A., Cadieu, E., Lorthiois, M., Le Bars, V., Houel, A., Hitte, C., Andre, C., Hedan, B., Derrien, T.

Published 2026-03-13
📖 5 min read🧠 Deep dive
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine your genome (your DNA) as a massive, ancient library containing the instructions for building and running a human or a dog. For a long time, this library was a bit messy: some books were missing pages, some chapters were out of order, and some stories were completely unwritten.

Genome annotation is the process of organizing this library, labeling every book, and writing summaries so scientists can actually understand what the instructions mean.

Recently, scientists got a new, super-powerful tool called Long-Read RNA Sequencing. Think of this as a high-speed camera that can photograph entire books from cover to cover in one shot, rather than taking blurry snapshots of just a few sentences. This helps us see the full stories (transcripts) much better.

However, there's a catch: even with this amazing camera, the photos can still be a bit blurry, cut off at the edges, or sometimes the camera accidentally invents a story that doesn't exist. We need a way to check the photos and make sure the library catalog is accurate.

Enter ANNEXA: The Super Librarian

This paper introduces ANNEXA, a new computer program (a pipeline) designed to be the ultimate "Super Librarian" for this new type of data. Here is how it works, using some everyday analogies:

1. The Double-Check System (Two Tools in One)

ANNEXA doesn't just trust one method. It uses two different "detectives" to read the library: StringTie2 and Bambu.

  • StringTie2 is like a creative writer who looks at the raw photos and tries to piece together every possible story, even the weird ones. It finds a lot of new stories but might sometimes invent a few fake ones.
  • Bambu is like a strict editor who only writes down stories that fit perfectly with the known rules of the library. It finds fewer new stories, but the ones it finds are very likely to be real.
  • ANNEXA runs both detectives, compares their notes, and combines the best parts of both lists.

2. The "Start-Button" Test (Quality Control)

One common problem with these photos is that they often get cut off at the beginning (the start of the story).

  • ANNEXA uses a special Deep Learning AI (a "smart robot") to look at the very first few words of every new story.
  • If the robot sees that the story starts in a weird, unnatural place (like a sentence starting in the middle of a word), it flags it as a "cut-off photo" and throws it out. This ensures that only complete, full-length stories make it into the final catalog.

3. The "Story Type" Sorter (Coding vs. Non-Coding)

Not all stories in the library are the same. Some are instruction manuals for building proteins (the "machines" of the cell), and others are "long non-coding RNAs" (lncRNAs), which are more like the library's management notes or regulatory signs.

  • ANNEXA has a special sorter (called FEELnc) that looks at every new story and asks: "Is this a machine manual or a management note?"
  • This is crucial because scientists often struggle to find the "management notes" because they are harder to spot. ANNEXA makes sure these are found and labeled correctly.

4. The "Dog vs. Human" Connection (Comparative Oncology)

To prove this new librarian works, the authors tested it on a real-world mystery: Cancer.

  • They looked at cancer cells from 8 dogs and 2 humans.
  • Dogs and humans are like cousins; they often get the same types of cancer (like melanoma and bone cancer), but their "libraries" are written in slightly different languages.
  • ANNEXA found brand new stories (genes) in both species. Even better, it found 5 specific stories that were new in dogs and new in humans, and they were located in the same spot in the library.
  • This is a huge deal! It suggests these new stories might be important keys to understanding cancer in both species.

Why Does This Matter?

Before ANNEXA, scientists had to manually check thousands of these "photos" to see if they were real, which took forever and was prone to error.

ANNEXA is like an automated quality-control factory:

  1. It takes the raw photos.
  2. It runs them through two different assembly lines.
  3. It uses a smart robot to cut off the blurry edges.
  4. It sorts the stories into the right bins.
  5. It prints a final, clean report with a visual map so scientists can see exactly what changed.

The Bottom Line

This paper gives scientists a free, open-source tool (ANNEXA) that makes the messy job of organizing the genome's library much faster and more accurate. By ensuring we have the right "books" on the shelf, we can better understand how diseases like cancer work in both humans and our furry friends, leading to better treatments for everyone.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →