Evaluating GFlowNet from partial episodes for stable and flexible policy-based training

Imagine you are an architect trying to design a new city. Your goal is to create a map where every possible neighborhood (a "combinatorial candidate") exists, but some neighborhoods are much more desirable than others (they have a high "reward" or score).

The challenge is that the number of possible neighborhoods is so huge (like the number of grains of sand on all the beaches in the world) that you can't just draw them all and pick the best ones. You need a smart guide—a GFlowNet—to help you explore this vast landscape and find the best neighborhoods efficiently.

This paper introduces a new, smarter way to train this guide. Here is the breakdown using simple analogies.

1. The Problem: The "Blind Guide" vs. The "Map"

In the world of GFlowNets, there are two main ways to train the guide (the policy):

The "Map" Approach (Value-Based): Imagine you are trying to draw a map of water flow. You want the water to flow from the start of the city to the best neighborhoods in proportion to how good they are. You check if the water flow matches the "ideal" flow at every intersection. This is reliable, but it's like trying to balance a giant, complex water system; it can be rigid and hard to tweak.
The "Guide" Approach (Policy-Based): Instead of drawing a map, you train a tour guide. The guide learns by walking through the city, making mistakes, and getting corrected. The problem with this method is: How do you know if the guide is doing a good job?
- In the past, the "scorecard" used to judge the guide was often shaky or unreliable. It was like asking the guide, "How far are we from the destination?" and getting a vague, noisy answer. This made the guide's training unstable and slow.

2. The Solution: The "Sub-EB" Scorecard

The authors of this paper realized that the "Map" approach (checking water flow) and the "Guide" approach (checking the tour guide) are actually two sides of the same coin.

They invented a new scorecard called Sub-EB (Subtrajectory Evaluation Balance).

The Analogy:
Imagine you are training a hiker to find the best scenic spots in a massive forest.

Old Method: You ask the hiker, "How good is this spot?" and they guess based on a shaky compass. Sometimes they guess right, sometimes wrong. You have to restart the training often.
New Method (Sub-EB): Instead of just guessing the final score, you check the hiker's path step-by-step. You ask: "If you walked from point A to point B, does the 'flow' of your journey match the 'flow' of the ideal path?"

The magic of Sub-EB is that it uses the same mathematical rules that make the "Map" approach work (flow balance) to create a perfect scorecard for the "Guide."

It tells the guide exactly how far off they are, not just at the end of the trip, but at every single turn along the way.
This makes the training stable (the guide doesn't get confused) and flexible (the guide can learn from different types of data).

3. Why This Matters: Three Superpowers

The paper shows that using this new scorecard gives the training process three major superpowers:

A. It's More Stable (No More Wobbly Legs)

Think of the old training method as a tightrope walker on a windy day. They might make it across, but they wobble a lot. The new method is like a tightrope walker with a safety net and a steady wind. The guide learns faster and doesn't crash as often. In the experiments, the new method converged (finished learning) much quicker and more reliably than the old ways.

B. It Can Learn from "Backwards" (The Time Traveler)

Usually, you can only train a guide by watching them walk forward. But with Sub-EB, you can also train the "backward policy" (imagine a guide who knows how to walk backward from the destination to the start).

Analogy: It's like teaching a driver not just how to drive forward, but also how to reverse perfectly. This helps the system understand the structure of the city better. The old methods struggled to do this without breaking, but Sub-EB handles it smoothly.

C. It Can Use "Old Maps" (Offline Learning)

Usually, the guide has to explore the forest while you are training them (Online). If you want to use a map drawn by someone else (Offline data), the old methods got confused.

Analogy: Imagine you are training a new chef. The old way required the chef to taste every dish while cooking it. The new way (Sub-EB) allows you to say, "Here is a list of dishes a famous chef made last year. Learn from that list, and then go cook."
This is huge because it means you can use existing data to speed up training without needing to generate new data for every single step.

4. The Results: Proving it Works

The authors tested this new method on three very different "forests":

Hypergrids: A giant, abstract grid of numbers. (Like a massive maze).
Sequence Design: Designing DNA strands or chemical molecules. (Like writing a perfect sentence or building a specific Lego structure).
Bayesian Networks: Figuring out how different variables in a system connect. (Like solving a complex mystery where clues are linked).

In all cases, the new method (Sub-EB) found better solutions, found them faster, and found a more diverse variety of good solutions than the previous best methods.

Summary

The Paper in a Nutshell:
The authors found a way to use the "physics of flow" (how water moves through a pipe) to create a perfect "report card" for training AI guides. This new report card (Sub-EB) makes the AI learn faster, more stably, and allows it to use old data and learn backwards, solving problems that were previously too messy or difficult to handle.

It's like upgrading from a compass that spins in the wind to a GPS that knows exactly where you are at every second of your journey.

1. Problem Statement

Generative Flow Networks (GFlowNets) are designed to sample combinatorial objects $x$ from a space $\mathcal{X}$ with probability proportional to a reward function $R(x)$ . Training GFlowNets typically involves two paradigms:

Value-Based Methods: These enforce flow balance conditions (e.g., Sub-Trajectory Balance, Sub-TB) to match forward and backward flows. While robust, they often rely on off-policy data collection and may struggle with deep exploration.
Policy-Based Methods: These utilize an Actor-Critic framework where a "Critic" (evaluation function $V(s)$ ) estimates the Kullback-Leibler (KL) divergence between forward and backward sub-trajectories to guide the "Actor" (forward policy $\pi_F$ ).

The Core Challenge: In policy-based training, reliably learning the evaluation function $V(s)$ is difficult. Existing methods (e.g., using $\lambda$ -Temporal Difference objectives) often suffer from high variance or bias, require fixed backward policies ( $\pi_B$ ), and struggle to integrate offline data collection techniques effectively. The paper argues that the relationship between the state flow function $F(s)$ (used in value-based methods) and the evaluation function $V(s)$ (used in policy-based methods) has not been fully exploited to create a stable learning objective for $V$ .

2. Methodology: Subtrajectory Evaluation Balance (Sub-EB)

The authors bridge the gap between value-based and policy-based perspectives by deriving a new condition called Subtrajectory Evaluation Balance (Sub-EB).

Theoretical Foundation

Connection between Flow and Evaluation: The paper proves that for a fixed forward policy $\pi_F$ , the state flow function $F(s)$ that satisfies the flow balance condition coincides with the exact KL divergence between forward and backward sub-trajectories starting from $s$ .
The Sub-EB Condition: This leads to a balance condition for the evaluation function $V(s)$ :
$\mathbb{E}_{P_F(\tau_{i:j})} \left[ \log \left( P_F(\tau_{i:j}|s_i) \exp V(s_i) \right) \right] = \mathbb{E}_{P_F(\tau_{i:j})} \left[ \log \left( P_B(\tau_{i:j}|s_j) \exp V(s_j) \right) \right]$
This condition implies that the difference in learned divergences ( $V(s_i) - V(s_j)$ ) must match the true divergence over the sub-trajectory between states $s_i$ and $s_j$ .

The Sub-EB Objective

Based on the condition, the authors propose a new loss function to learn $V$ :
$L_V(\phi) = \mathbb{E}_{P_F(\tau)} \left[ \sum_{\tau_{i:j}} w_{j-i} \left( \delta_V(\tau_{i:j}; \phi) \right)^2 \right]$
Where $\delta_V$ is the log-ratio mismatch of the sub-trajectory flows weighted by the evaluation function.

Key Difference from $\lambda$ -TD: Unlike traditional $\lambda$ -TD objectives that focus on edge-wise mismatches and events starting at a specific step, Sub-EB utilizes sub-trajectory-wise mismatches. It incorporates information from events both before and after a state $s$ , leading to a more balanced and stable estimation of $V$ .

Extensions

Parameterized Backward Policy ( $\pi_B$ ): The Sub-EB objective allows $\pi_B$ to be parameterized and updated jointly with $V$ and $\pi_F$ within a single phase. This removes the need for complex two-phase algorithms or fixed backward policies required by previous policy-based methods.
Offline Policy-Based Training: The authors extend the framework to an offline setting (Algorithm 2). By defining a backward evaluation function $W$ , they enable the use of a data-collection policy $\pi_D$ distinct from $\pi_F$ . This allows the integration of offline data and advanced exploration techniques (like local search) into policy-based training without breaking the theoretical guarantees.

3. Key Contributions

Theoretical Bridge: Established a rigorous mathematical connection between the state flow function $F$ (value-based) and the evaluation function $V$ (policy-based), proving that flow balance conditions yield a principled policy evaluator.
Sub-EB Objective: Proposed a new training objective that learns $V$ using sub-trajectory balance, offering superior stability and reliability compared to $\lambda$ -TD.
Flexibility: Demonstrated that Sub-EB naturally supports parameterized backward policies and offline data collection, overcoming major limitations of existing policy-based GFlowNet methods.
Comprehensive Evaluation: Validated the method across synthetic (Hypergrids) and real-world tasks (Biological sequence design, Molecular graph design, Bayesian Network structure learning).

4. Experimental Results

The authors compared Sub-EB against state-of-the-art baselines:

Baselines: Value-based (Sub-TB, Q-Much), Policy-based (RL with $\lambda$ -TD, CV), and variants with local search (Sub-TB-B, Sub-EB-B).
Metrics: Total Variation Distance (DTV), Jensen-Shannon Divergence (DJSD), Average Reward, Diversity, and Mode Accuracy.

Key Findings:

Stability & Convergence: On Hypergrid tasks (256x256, 128x128x128), Sub-EB significantly outperformed the standard RL ( $\lambda$ -TD) method in terms of convergence speed and training stability, achieving lower DTV and DJSD.
Parameterized $\pi_B$ : In ablation studies, Sub-EB with a parameterized backward policy (Sub-EB-P) achieved the best performance and stability, confirming its ability to jointly optimize $\pi_B$ and $V$ .
Offline Learning: The offline Sub-EB variant (Sub-EB-B) successfully integrated local search techniques. In molecular and sequence design, it discovered more high-reward modes (better mode discovery) than standard methods, though with a slight trade-off in distribution modeling accuracy (expected due to the focus on high-reward regions).
Scalability: In Bayesian Network structure learning (up to 15 nodes, $10^{35}$ states) and molecular graph design, Sub-EB achieved the highest average rewards and fastest convergence while maintaining competitive diversity. It outperformed Sub-TB and Q-Much in large-scale combinatorial spaces.

5. Significance

This work resolves a critical bottleneck in policy-based GFlowNet training: the reliable estimation of the evaluation function $V$ . By deriving the Sub-EB condition, the authors provide a theoretically grounded objective that:

Stabilizes Training: Reduces the variance and bias issues inherent in previous policy-gradient estimators.
Unifies Frameworks: Allows for the seamless integration of offline data, parameterized backward policies, and advanced exploration strategies within a single, coherent policy-based framework.
Enables Real-World Application: Demonstrates that policy-based methods, when equipped with Sub-EB, can scale effectively to massive combinatorial spaces (e.g., drug discovery, network structure learning) where value-based methods might struggle with exploration or where off-policy techniques are necessary.

The paper concludes that Sub-EB represents a significant step forward, making policy-based GFlowNet training more robust, flexible, and applicable to complex, real-world generative tasks.