Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

Imagine you are teaching a very talented but slightly confused robot chef to invent new recipes. Your goal isn't just to find one perfect dish; you want the robot to explore a huge variety of delicious, unique, and valid recipes, giving more attention to the ones that taste better.

This is what GFlowNets do: they are AI systems designed to generate diverse, high-quality solutions (like new molecules or sentences) rather than just picking the single "best" one.

However, the paper explains that these robot chefs often get stuck in a rut. They either:

Stop too early: They decide the first few words of a sentence are the whole story (Prefix Collapse).
Get obsessed with length: They only write very short or very long sentences, ignoring the middle ground (Length Bias).
Only learn from their favorites: If they accidentally make one great dish, they keep making that exact same dish over and over, forgetting to try new things (Replay Bias).

The authors propose two new tools to fix this: RapTB and SubM.

1. RapTB: The "Rooted Guide" (Fixing the Learning Signals)

The Problem:
Imagine the robot chef is building a tower of blocks. The reward (a gold star) only comes at the very top when the tower is finished. If the tower falls, the robot gets no feedback on which block it placed wrong. It's like guessing in the dark. The robot learns to just copy the first few blocks of the few towers that happened to stand, leading to "mode collapse" (everyone building the same short tower).

The Solution (RapTB):
RapTB is like a wise mentor who doesn't wait until the tower is finished to give feedback.

Rooted: The mentor checks the tower starting from the very bottom (the root) every time.
Absorbed: If the robot builds a great top section, the mentor "absorbs" that success and says, "Hey, the bottom part you built earlier was also good because it led to this great top!"
Trajectory Balance: It ensures that every step the robot takes is consistent with the final goal, but it does so in a way that doesn't confuse the robot about when to stop building.

In Simple Terms: Instead of waiting for the final grade to tell the student they did well, RapTB gives them a "partial credit" score at every step, based on how well that step could lead to a great ending. This stops the robot from just copying the first few words of a lucky guess.

2. SubM: The "Curated Library" (Fixing the Memory)

The Problem:
Imagine the robot has a notebook (a replay buffer) where it writes down the recipes it tried. Usually, it just writes down the "best" recipes. But if it finds one amazing chocolate cake, it writes that down 100 times and forgets about the lasagna or the salad. The robot stops learning because its notebook is full of duplicates.

The Solution (SubM):
SubM is a smart librarian who curates the notebook.

Submodular: This is a fancy math word for "diminishing returns." It means the librarian knows that having 100 copies of the same chocolate cake adds zero value.
The Strategy: The librarian looks at the new recipes and asks: "Is this new cake different from what we already have? Does it have a good score? Is it a different length?"
The Result: The librarian keeps a mix of high-scoring recipes, but ensures they are all different from each other. If the notebook is full of short cakes, the librarian makes room for a long lasagna, even if the cake was slightly better.

In Simple Terms: SubM forces the robot's memory to be diverse. It prevents the robot from getting stuck in a loop of repeating the same few "winning" ideas, ensuring it explores the whole kitchen.

The Big Picture: What Happens When You Combine Them?

When you use RapTB (the smart mentor) and SubM (the diverse librarian) together:

The Robot Stops Collapsing: It stops making the same short, boring sentences over and over.
It Explores More: It tries longer, more complex structures (like long molecules or full stories) without getting confused.
It Finds Better Solutions: Because it's exploring more ground, it finds more unique, high-quality recipes that it would have missed otherwise.

The Analogy Summary:
Think of training an AI like training a jazz band.

Old Way: The band only plays the one song that got the biggest applause last night, over and over again, getting worse and worse because they aren't practicing anything new.
RapTB: The conductor gives feedback on every single note played, not just at the end of the song, helping the musicians understand why a specific note worked.
SubM: The setlist is managed by a DJ who ensures the band plays a mix of fast, slow, loud, and quiet songs, rather than just playing the same hit song 50 times in a row.

The result? A band that can improvise, play complex solos, and keep the audience entertained with a fresh, diverse set of music.

1. Problem Statement

The paper addresses the issue of mode collapse in Generative Flow Networks (GFlowNets) when applied to Large Language Models (LLMs) for sequential generation tasks (e.g., molecule generation, arithmetic expression solving). While GFlowNets aim to sample trajectories proportional to their rewards (covering diverse high-reward modes), LLM-based GFlowNets often fail by collapsing into a narrow set of solutions. The authors identify two specific failure modes:

Prefix Collapse: Early tokens in the generated sequence lose entropy, causing distinct high-reward trajectories to share nearly identical prefixes.
Length Bias: The model systematically favors sequences that are too short or too long, failing to explore the full length spectrum.

Root Causes Identified:

Weak Credit Assignment: Standard Trajectory Balance (TB) relies solely on terminal rewards. For long horizons, this provides high-variance, ambiguous feedback for intermediate steps, making it difficult to assign credit to early prefixes.
Replay-Induced Distribution Shift: Standard experience replay prioritizes high-reward trajectories. This creates a "rich-get-richer" dynamic where the training distribution shifts toward a tiny, non-representative subset of the search space, reinforcing the collapse.

2. Methodology

The authors propose a two-pronged solution: a new training objective (RapTB) to improve credit assignment and a new replay strategy (SubM) to maintain diversity.

A. RapTB: Rooted Absorbed Prefix Trajectory Balance

RapTB modifies the standard Trajectory Balance (TB) objective to provide dense, low-variance supervision for intermediate prefixes without inducing the "termination drift" seen in Subtrajectory Balance (SubTB).

Rooted Prefix Residuals: Unlike SubTB, which enforces consistency on all overlapping sub-trajectories (creating conflicting boundary conditions), RapTB restricts dense supervision to rooted prefixes (sub-trajectories starting from the root $s_0$ ). This eliminates heterogeneous boundary conditions that confuse the termination head.
Absorbed Suffix Rewards: To reduce the variance of the training target, RapTB "absorbs" future reward information into the current prefix. Instead of using the raw terminal reward for a prefix, it constructs a target $u_k^{tgt}$ $u_{k}^{t g t}$ by backing up rewards from the observed suffix $s_{k:\tau}$ $s_{k : τ}$ .
- It combines a max-backup ( $u_k^{max}$ , the best reward seen in the suffix) and a soft-backup ( $u_k^{soft}$ , a log-sum-exp aggregation of suffix rewards).
- This creates a smoothed, lower-variance "estimated reward" for the prefix, guiding the policy more reliably than sparse terminal feedback.
Gradient Detachment: Crucially, in the auxiliary loss branch, gradients are detached from the termination logits ( $\log q_\theta(\top|s)$ ). This prevents the model from satisfying the dense prefix constraints by simply shifting the termination probability (which causes length bias), forcing the model to improve actual token transitions.

B. SubM: Submodular Replay

To counteract the distribution shift caused by reward-prioritized replay, the authors introduce Submodular Replay (SubM).

Mechanism: Instead of keeping the top- $B$ highest-reward trajectories, SubM selects a subset of size $B$ from the union of the current buffer and new samples by maximizing a monotone submodular objective.
Objective Function: The function $f(S)$ $f (S)$ jointly optimizes for:
1. Quality/Feasibility: High reward scores.
2. Diversity: Modeled via a facility-location coverage term (maximizing similarity coverage over the candidate set).
3. Length Coverage: A concave-over-counts histogram term to ensure the buffer contains trajectories of various lengths.
Efficiency: The selection is performed via a greedy algorithm with a near-optimality guarantee, incurring negligible computational overhead.

3. Key Contributions

Characterization of Mode Collapse: The paper empirically defines and diagnoses mode collapse in LLM-GFlowNets as a combination of prefix collapse and length bias, driven by high-variance credit assignment and replay bias.
RapTB Objective: A novel training objective that augments global Trajectory Balance with rooted prefix constraints and suffix-absorbed targets. It provides dense learning signals while avoiding the termination drift inherent in SubTB.
SubM Strategy: A submodular replay refresh mechanism that explicitly balances reward, diversity, and length coverage, preventing the training distribution from collapsing onto a narrow set of modes.
Comprehensive Evaluation: Extensive experiments on molecule generation (SMILES), arithmetic expression generation (Expr24), and text generation (CommonGen) demonstrating the efficacy of the approach.

4. Experimental Results

The authors evaluated RapTB + SubM against standard TB, SubTB, and various replay strategies across three tasks:

Scaffold-Conditioned SMILES Generation:
- Performance: RapTB + SubM achieved the best trade-off between drug-likeness (QED score) and diversity (FPDiv), while maintaining high chemical validity (~98.8%).
- Comparison: Standard TB had high validity but poor diversity and reward. SubTB suffered from severe validity degradation and length bias.
- Long-Horizon: In stress tests with longer sequences ( $L_{max}=15$ ), TB failed to generate long valid molecules, while RapTB + SubM successfully covered the long-horizon regime with high quality.
Expr24 (Arithmetic Generation):
- Coverage: RapTB + SubM doubled the normalized coverage of the solution space compared to the strongest baseline (0.209 vs. 0.100) while maintaining near-perfect accuracy (>99%).
- Termination Drift: SubTB exhibited catastrophic termination drift (log-probability of stopping dropped to -79), whereas RapTB maintained calibrated termination probabilities.
CommonGen (Text Generation):
- SubTB caused the model to generate excessively long, nonsensical sentences (length 20) by suppressing stopping tokens. RapTB + SubM produced natural-length sentences (length ~11.8) with significantly higher BLEU scores.

5. Significance

This work provides a robust framework for training autoregressive GFlowNets, addressing the critical instability issues that have hindered their application in complex sequential generation tasks.

Theoretical Insight: It highlights the structural mismatch between standard Subtrajectory Balance and terminable prefix trees, proposing a "rooted" alternative that preserves global consistency while enabling local supervision.
Practical Impact: By combining a variance-reduced objective with a diversity-aware replay buffer, the method enables the discovery of diverse, high-quality solutions in high-dimensional spaces (e.g., drug discovery), moving beyond the "mode collapse" that plagues standard reinforcement learning and GFlowNet training.
Generalizability: The approach is applicable to any sequential decision-making problem modeled as a GFlowNet where termination is a learned action and diversity is a key metric.

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

1. RapTB: The "Rooted Guide" (Fixing the Learning Signals)

2. SubM: The "Curated Library" (Fixing the Memory)

The Big Picture: What Happens When You Combine Them?

1. Problem Statement

2. Methodology

A. RapTB: Rooted Absorbed Prefix Trajectory Balance

B. SubM: Submodular Replay

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank