New genetic codes in bacteria and archaea identified with a fast k-mer based algorithm

The paper introduces a fast k-mer based algorithm that accelerates genetic code inference by over 100-fold, enabling the analysis of thousands of bacterial and archaeal genomes to identify new code variations, including the first documented sense codon reassignment in archaea.

Original authors: Melnykov, A. V.

Published 2026-04-06
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine that every living thing on Earth—from the bacteria in your gut to the archaea in deep-sea vents—speaks a slightly different dialect of the same language. This language is the genetic code.

Usually, we think of this code as a universal dictionary where three letters (a "codon") always mean the same thing. For example, the code "ACA" usually means "Threonine" (an amino acid building block). But, just like human languages have regional slang or typos that become permanent, some bacteria and archaea have rewritten their dictionaries. They might decide that "ACA" actually means "Aspartate" instead.

The problem is that we have discovered millions of new bacterial and archaeal species just by sequencing their DNA from the environment (like soil or water), but we've never grown them in a lab. We have their "books" (genomes), but we don't know which "dictionary" they are using to read them.

The Old Way: The Slow, Exhaustive Translator

Previously, scientists used a tool called Codetta to figure out these dictionaries. Think of Codetta as a very thorough, highly educated translator who reads every single sentence of a book, compares it to a master library of known texts, and checks every word for consistency.

  • The Good: It's incredibly accurate.
  • The Bad: It's painfully slow. To check a million books, you'd need a supercomputer cluster the size of a warehouse running for weeks. It's too slow for the explosion of new data we have today.

The New Way: KACI (The Speedy Pattern Matcher)

The author, Artem Melnykov, invented a new tool called KACI (K-mer Assisted Code Inference).

The Analogy:
Imagine you are trying to guess the language of a stranger by looking at a few pages of their book.

  • Codetta reads the whole book, analyzes the grammar, and compares every sentence structure to known languages.
  • KACI is like a detective who only looks for specific, famous fingerprints (short patterns of letters) that are unique to certain languages.

Instead of reading the whole book, KACI scans the text for short, recognizable patterns (called k-mers). It has a giant reference card deck of these patterns. If it sees a pattern that usually appears with the word "Threonine," but in this new book, that pattern is followed by a different letter, it quickly realizes, "Aha! This book uses a different dictionary!"

The Result:
KACI is 144 times faster than Codetta. It's like swapping a slow, manual typewriter for a high-speed laser printer. You can now check thousands of genomes on a regular laptop in the time it used to take a supercomputer to check just a few.

What Did They Find?

Using this super-fast tool, the author scanned about 2.7 million bacterial and archaeal genomes and found some exciting new "dialects":

  1. Bacteria (The "ACA" Switch): In a group of bacteria found in soil and mines, the code "ACA" (usually Threonine) was found to actually mean Aspartate. It's like a group of people suddenly deciding that the word "Apple" now means "Banana."
  2. Bacteria (The "CGG" Switch): In bacteria found in human and pig guts, the code "CGG" (usually Arginine) seems to mean Alanine.
  3. Archaea (The Big Discovery): In some ancient microbes from deep-sea vents, the code "CGG" (usually Arginine) was found to mean Tryptophan. This is a huge deal because it's the first time we've found a "sense codon" (a word that builds proteins) being reassigned in the domain of Archaea.

Why Does This Matter?

If you are trying to translate a genome into a protein list (which scientists do to understand how an organism works), and you use the wrong dictionary, you will get gibberish. You might think a protein is one thing when it's actually something else entirely.

By finding these new "dialects," scientists can:

  • Fix the translation errors: Make sure our protein databases are accurate.
  • Understand evolution: See how life changes its rules over time.
  • Speed up discovery: Now that we have KACI, we can instantly check the genetic code of any new species we discover, rather than waiting months for a supercomputer to do the math.

The Bottom Line

The author built a "speed dial" for decoding life's language. While the old method was like reading a dictionary cover-to-cover, the new method is like using a smart search engine to find the specific words that tell you which dictionary is being used. It's faster, almost as accurate, and it's already helping us discover that life is even more creative with its rules than we thought.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →