Sequence Design and Phylogenetic Inference with Generative Flow Networks

⚕️

This is an AI-generated explanation of a preprint that has not been peer-reviewed. It is not medical advice. Do not make health decisions based on this content. Read full disclaimer

The Big Problem: The "Family Tree" Puzzle

Imagine you are trying to draw a family tree for 1,000 different species of birds. You have their DNA sequences, but you don't know who is related to whom.

Traditionally, scientists do this by lining up all the DNA sequences side-by-side (like aligning paragraphs of text) to see where they match. This is called a Multiple Sequence Alignment (MSA).

The Problem: This is incredibly slow, expensive, and prone to errors. If the alignment is slightly wrong, the whole family tree is wrong. It's like trying to solve a massive jigsaw puzzle where the pieces are all the same color and shape, and you have to force them together before you can see the picture.

The New Idea: The "Flow Network" Factory

The authors propose a new method called AncestorGFN. Instead of lining up the DNA first, they use a type of AI called a Generative Flow Network (GFlowNet).

Think of a GFlowNet not as a puzzle solver, but as a massive, magical factory that builds DNA sequences from scratch.

1. The Factory Floor (The DAG)

Imagine a factory floor where workers build RNA sequences (a cousin of DNA) one letter at a time.

The Start: You start with an empty table (no letters).
The Process: At each step, a worker adds a letter (A, U, G, or C) or swaps one out.
The Goal: The factory has a "Target List" of specific, important RNA sequences (like the let-7 microRNA, which is crucial for life). The factory wants to produce these specific sequences more often than others.

In this factory, the path from "empty table" to "finished product" is called a trajectory. The AI learns the "flow" of traffic: which paths are taken most often to get to the good products.

2. The "Flow" as a Family Tree

Here is the magic trick: The authors realized that the path the factory takes to build a sequence looks exactly like an evolutionary history.

Traditional View: We look at two finished birds and guess they are related because they look alike.
AncestorGFN View: We look at the factory floor. If two different finished sequences (Bird A and Bird B) both passed through the exact same intermediate station (e.g., a table with the letters "A-G-G-U" on it), the AI infers that they share a "common ancestor" at that station.

The "flow" (how much traffic goes through a specific path) tells us how likely that evolutionary path is. High traffic = a strong evolutionary link.

How They Tested It: The "Let-7" Challenge

To test this, they used the let-7 microRNA family. These are tiny, vital genetic switches found in almost every animal, from worms to humans. Because they are so important, they haven't changed much over millions of years, making them perfect for testing family trees.

The Experiment:

The Setup: They gave the AI a list of 58 different versions of the let-7 sequence found in nature.
The Training: The AI tried to generate these sequences. It didn't just guess; it learned a "reward system." If it got close to a target sequence, it got a small "high five" (partial reward). If it hit the exact target, it got a big "high five."
The Result:
- The Factory Map: When they looked at the map of paths the AI took, they saw clusters. Sequences that are biologically related (like those found in similar species) naturally grouped together in the factory's traffic flow.
- The Ancestors: The AI identified "intermediate stations" that acted as common ancestors. For example, it realized that many different modern sequences all evolved from a specific 4-letter pattern.
- Inventing New Things: When they asked the AI to "beam search" (look for the best paths), it didn't just copy the targets. It invented new, novel sequences that were very close to the real ones. This suggests the AI learned the "rules of the neighborhood" and could design new, functional genetic parts.

Why This Matters (The "Aha!" Moment)

No Alignment Needed: You don't need to line up the DNA first. The AI figures out the relationships while it is building the sequences. It's like learning to speak a language by listening to conversations, rather than memorizing a dictionary first.
Finding the "Missing Links": Traditional trees only show the start (ancestors) and end (modern species). This method shows the middle steps. It visualizes the "stepping stones" evolution took to get from point A to point B.
Designing New Life: Because the AI understands the "flow" of what works, it can suggest new genetic sequences that nature hasn't made yet but might work well. This is huge for drug design and synthetic biology.

The Catch (Limitations)

The authors are honest about the hurdles:

Short Sequences: They tested this on short snippets (10 letters). Real genes are much longer. It's like building a model of a house with Lego bricks; it works for a small shed, but building a skyscraper is much harder.
Qualitative vs. Quantitative: They showed the AI looks like it understands family trees, but they haven't mathematically proven it's better than the old, slow methods yet.
Reward Bias: The "family tree" the AI draws is influenced by how they set up the rewards. It's possible the AI is just following the rules they gave it, rather than discovering true biological history.

Summary Analogy

Imagine trying to figure out how a city's subway system evolved.

Old Way: You take a snapshot of every train station today, measure the distance between them, and guess which lines were built first.
AncestorGFN Way: You watch the trains being built in a factory. You see that Train A and Train B both passed through the same assembly line station 50 years ago. You conclude, "Aha! They are related because they share that specific manufacturing history."

This paper suggests that by watching how AI builds genetic sequences, we can accidentally (or intentionally) discover the evolutionary history of life itself, without needing the messy, slow process of aligning DNA first.

1. Problem Statement

Phylogenetic inference, the reconstruction of evolutionary relationships from molecular sequences, faces two primary challenges:

Computational Complexity: The search space for tree topologies grows exponentially with the number of taxa ( $(2n-5)!!$ for $n$ species), making exhaustive search infeasible.
Dependence on MSAs: Standard methods (parsimony, maximum-likelihood, Bayesian) rely heavily on Multiple Sequence Alignments (MSAs). MSAs are computationally expensive to generate and prone to errors, which propagate to the final inferred trees.

The authors propose a novel approach to perform simultaneous sequence generation and phylogenetic exploration without requiring explicit MSAs, leveraging the structural properties of Generative Flow Networks (GFlowNets).

2. Methodology: AncestorGFN

The core of the method is AncestorGFN, which adapts GFlowNets to model the generation of RNA sequences as a flow through a Directed Acyclic Graph (DAG).

2.1 State and Action Space

States: Represent RNA sequences. The initial state is an empty sequence ( $\epsilon$ ), and terminal states are complete sequences.
Actions: The model can perform three types of transitions:
1. Insertions: Adding a nucleotide (A, U, G, C).
2. Substitutions: Replacing an existing nucleotide.
3. Deletions: Removing a nucleotide.
  Note: In the 10bp experiments, the action space was restricted to insertions only to improve computational efficiency.

2.2 Training Objectives

The paper compares three GFlowNet training objectives to learn a policy that samples sequences proportional to a reward function $R(x)$ :

Trajectory Balance (TB): Optimizes the product of forward probabilities against the reward at the terminal state. It suffers from sparse rewards on long trajectories.
Detailed Balance (DB): Optimizes flow consistency for every transition.
Forward-Looking Detailed Balance (FL-DB): The primary method used. It reparameterizes the flow function to incorporate intermediate energy/reward signals.
- Key Mechanism: Instead of waiting for a terminal reward, FL-DB defines an intermediate energy function $E(s)$ based on sequence similarity to targets. This provides a "partial reward" at every step, enabling better credit assignment and faster convergence in large search spaces.

2.3 Phylogenetic Inference via Flow Traceback

Unlike traditional methods that output a tree, AncestorGFN infers phylogeny from the learned flow structure:

Flow Calculation: Edge flows are computed by forward-propagating from the source.
Greedy Traceback: Starting from a target terminal sequence, the algorithm traces back to the root by iteratively selecting the parent node with the maximum incoming flow.
Ancestry Inference: Intersection points of these maximum-flow trajectories represent shared intermediate states, which are interpreted as putative common ancestors.

2.4 Reward Design

To guide the model toward biologically relevant sequences, several reward strategies were employed:

Hamming/Alignment Rewards: Based on sequence similarity to target motifs.
Conservation Weighting: For the let-7 microRNA case, rewards were weighted by the number of species a sequence appears in (reflecting purifying selection).
Progressive Rewards: Intermediate rewards decay or adapt dynamically to prevent mode collapse and encourage exploration of diverse sequences.

3. Key Contributions

Reframing GFlowNets for Phylogeny: The authors propose viewing GFlowNet flow trajectories as a lens for qualitative phylogenetic analysis, where shared intermediate states suggest common ancestry without explicit tree topology optimization.
FL-DB with Intermediate Rewards: Demonstrated that Forward-Looking Detailed Balance with carefully designed intermediate rewards (e.g., similarity-based) enables effective exploration of large sequence spaces where sparse rewards fail.
Alignment-Free Inference: Established a proof-of-concept for inferring evolutionary relationships directly from generative trajectories, bypassing the need for error-prone MSAs.
De Novo Sequence Design: Showed that beam search at inference time can discover novel sequences clustering near known functional targets, bridging generative modeling with sequence design.

4. Experimental Results

Case Study 1: Short RNA Sequences (4bp)

Objective: Compare TB, DB, and FL-DB on a small state space with 13 target motifs.
Findings:
- Convergence: FL-DB and DB converged faster than TB due to localized gradient signals.
- Reward: FL-DB achieved higher mean rewards than TB/DB because partial rewards guided exploration more effectively.
- Phylogeny: Greedy traceback revealed that distinct target sequences (e.g., CCCA and GGGG) shared common ancestral states (e.g., CC or GG), qualitatively matching evolutionary intuition.

Case Study 2: Long Sequences and let-7 MicroRNA (10bp)

Scale: Expanded to $4^{10}$ ( $\approx 1$ million) possible sequences.
Coverage: On 100 random targets, FL-DB found 10/100 unique targets, while TB found only 2/100, highlighting the necessity of intermediate rewards in sparse landscapes.
let-7 Family:
- Trained on 58 unique, highly variable 10bp regions from 612 let-7 sequences across 107 species.
- Coverage: Achieved 74.1% (43/58) coverage of unique targets within 500 iterations.
- Correlation: A significant positive correlation ( $\rho = 0.509$ ) was found between sampling frequency and species conservation count, indicating the model learned evolutionary constraints.
Structural Comparison:
- Traditional (UPGMA): Clusters terminal sequences by Hamming distance (Figure 3).
- GFlowNet (DAG): Visualizes the state-space DAG (Figure 4). Unlike trees, the DAG reveals shared intermediate states (ancestors) learned during generation, offering a generative view of relationships.
Novel Sequence Design:
- Beam search ( $k=20$ ) at inference time yielded 5 known targets and 15 novel sequences.
- Novel sequences clustered near known targets (1–2 Hamming distance), suggesting the model learned meaningful sequence neighborhoods rather than random patterns.

5. Significance and Limitations

Significance:

New Paradigm: Offers a "generative" approach to phylogenetics, where the evolutionary tree is an emergent property of the generation process rather than a separate optimization target.
Scalability: Demonstrates that partial reward signals are critical for scaling generative models to larger discrete spaces.
Design Tool: The ability to generate novel, functional-like sequences via beam search opens avenues for de novo RNA design.

Limitations:

Sequence Length: Experiments were limited to 10bp; scaling to full-length miRNAs (22bp) or proteins remains computationally challenging.
Qualitative Evaluation: Phylogenetic validation is currently qualitative. There is no quantitative benchmark against ground-truth trees (e.g., using Robinson-Foulds distance) or standard tools (RAxML, MrBayes).
Reward Bias: The inferred "ancestors" may reflect the geometry of the reward function rather than true evolutionary history.
Data Preparation: While the inference is alignment-free, the let-7 dataset preparation relied on positionally indexed sequences from MirGeneDB, implying some implicit alignment assumption.

Future Directions:
The authors suggest defining explicit procedures to extract tree-like objects (e.g., maximum-flow arborescence) for quantitative evaluation, incorporating phylogenetic likelihood models as rewards, and testing on simulated datasets with known ground-truth trees.

Reproducibility:
Code is available at https://github.com/qhuang20/gflownet-seq-gen, and all experimental data is provided.