CAPHEINE, or everything and the kitchen sink: a workflow for automating selection analyses using HyPhy

CAPHEINE is a portable, open-source computational workflow that automates the end-to-end process of evolutionary selection analysis using HyPhy, transforming raw unaligned pathogen sequences and a reference genome into comprehensive results for site-level, gene-level, and lineage-specific selection studies.

Original authors: Verdonk, H. E., Callan, D., Kosakovsky Pond, S. L.

Published 2026-03-05
📖 4 min read☕ Coffee break read
⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

Imagine you are a detective trying to solve a mystery: How do viruses change, survive, and sometimes jump from one host to another?

Viruses are like master shapeshifters. They mutate rapidly, creating thousands of slightly different versions of themselves. Some of these changes help them hide from our immune systems, while others might make them better at infecting new animals (like jumping from wild birds to cattle).

The problem is that scientists have so much data—millions of viral genetic sequences—that trying to find the "smoking gun" (the specific mutations that matter) is like trying to find a needle in a haystack while wearing blindfolded.

Enter CAPHEINE.

What is CAPHEINE?

Think of CAPHEINE as an automated, high-tech detective squad for viral evolution. It's a computer program (a workflow) built by researchers at Temple University that does all the heavy lifting for you.

Instead of a scientist spending weeks writing custom code to clean data, align sequences, and run complex statistical tests, they can just feed CAPHEINE two things:

  1. The Suspects: A pile of unorganized viral genetic sequences (the "unaligned" data).
  2. The Blueprint: A reference genome (the "ideal" version of the virus to compare against).

CAPHEINE then takes over, running a full investigation from start to finish.

How Does It Work? (The Kitchen Sink Analogy)

The authors jokingly call this workflow "everything and the kitchen sink." Here is what happens inside the machine, translated into everyday terms:

  1. The Cleanup Crew: First, CAPHEINE sweeps the floor. It removes messy data (like sequences with too many gaps or errors) and aligns all the viral sequences so they line up perfectly, like soldiers in a parade.
  2. The Family Tree Builder: It builds a family tree (phylogeny) to show how all these viral strains are related to each other.
  3. The Stress Test (Selection Analysis): This is the core magic. CAPHEINE runs six different "stress tests" (using tools called HyPhy) to answer specific questions:
    • FEL & MEME: "Are there specific spots on the virus that are changing too fast?" (This suggests the virus is evolving to escape our defenses).
    • BUSTED: "Is the whole gene under pressure to change?"
    • RELAX & Contrast-FEL: "Is the virus evolving differently in one group compared to another?" (e.g., Is the virus changing faster in cattle than in wild birds?)

The Real-World Test: The H5N1 Bird Flu Mystery

To prove it works, the team used CAPHEINE to investigate the H5N1 bird flu virus. They had a massive dataset:

  • The Background: Thousands of strains from wild birds (the virus's natural home).
  • The Foreground: Thousands of strains from a recent outbreak in cattle.

They wanted to know: Is the virus changing as it jumps from birds to cows?

The Results:

  • The General Rule: Most of the virus stays the same (it's under "purifying selection," meaning nature keeps it stable because changing too much breaks it).
  • The Exceptions: CAPHEINE found specific "hotspots" where the virus was changing rapidly.
  • The Smoking Gun: They found a specific spot (Site 88) in a gene called M2.
    • In wild birds, this spot usually has one type of amino acid (Aspartate).
    • In cattle, the virus switched to a different amino acid (Asparagine).
    • Why it matters: This gene helps the virus package its genetic material. The change suggests the virus is actively adapting its "packing tape" to work better inside a cow's body.

Why Should You Care?

Before tools like CAPHEINE, doing this analysis required a PhD in computer science just to set up the software. It was slow, prone to errors, and hard to repeat.

CAPHEINE is like a self-driving car for evolutionary biology:

  • It's Portable: It runs on Mac, Windows, or Linux (even on supercomputers).
  • It's Reproducible: If you run it today and I run it tomorrow, we get the exact same results.
  • It's Fast: It turns weeks of work into hours.

The Bottom Line

CAPHEINE doesn't just tell us that a virus is evolving; it tells us where and why. It helps scientists spot the exact mutations that might make a virus more dangerous, better at spreading, or resistant to vaccines.

By automating the boring stuff, CAPHEINE lets researchers focus on the big picture: How do we stop the next pandemic before it starts? It's a powerful new lens that helps us see the invisible battle between viruses and the hosts they infect.

Drowning in papers in your field?

Get daily digests of the most novel papers matching your research keywords — with technical summaries, in your language.

Try Digest →