Learning Shortest Paths with Generative Flow Networks

Imagine you are standing in a massive, dark, and confusing maze. Your goal is to find the quickest way out. Usually, to solve this, you might try to memorize the entire map, or you might use a flashlight to check every single path one by one until you find the exit. This is how traditional computer algorithms work: they are like careful, slow explorers checking every turn.

But what if you could train a super-intelligent guide who doesn't just know the map, but intuitively knows the shortest route to the exit from anywhere in the maze, without ever needing to check the wrong turns?

That is exactly what this paper proposes using a new AI tool called GFlowNets (Generative Flow Networks). Here is the breakdown of their idea using simple analogies.

1. The Problem: The "Infinite Loop" Trap

In many complex puzzles (like a Rubik's Cube), you can make a move, then undo it, then make it again. This creates a "cycle."

Old AI methods often treat these puzzles like a straight line (a tree). They struggle when you can go in circles.
The New Idea: The authors realized that if you teach an AI to be "lazy" about how long its journey takes, it will naturally learn to take the shortest path.

2. The Core Insight: The "Lazy Traveler"

Imagine you are training a robot to walk from the start of a maze to the finish.

Standard Training: You tell the robot, "Go to the finish!" It might wander around, take a detour, go in a circle, and eventually get there. It learns a way, but maybe not the best way.
The Paper's Trick: The authors add a "penalty" for walking too far. They tell the robot: "You get a reward for reaching the goal, but you lose points for every single step you take."

The Magic:
If the robot wants to maximize its score (get the reward while losing the fewest points), it is forced to stop wandering. It realizes that the only way to win is to never take a step that doesn't bring it closer to the goal.

If it takes a wrong turn, it wastes energy.
If it loops back, it wastes energy.
Result: The robot's brain (the AI policy) eventually learns to assign zero probability to any path that isn't the shortest one. It literally stops "thinking" about the long, winding roads.

3. The "Backwards" Approach

Here is the cleverest part of their method.
Usually, when you solve a maze, you start at the beginning and try to find the exit.

The Paper's Strategy: They teach the AI to work backwards.
- Imagine the "Exit" is a special "Sink" (a black hole) that sucks everything in.
- The AI learns to start at the Exit and walk backwards to the Start.
- Because the AI is penalized for taking too many steps, it learns the most direct route backwards.
- Once it learns this, you just flip the map over. Now, when you are at the Start, the AI knows exactly which way to go to reach the Exit in the fewest steps.

4. Real-World Tests: Rubik's Cubes and Swapping

The authors tested this on two very hard problems:

The Swap Puzzle: Imagine a line of people in random order. You can only swap neighbors. How do you get them in order? The AI learned the perfect strategy instantly.
The Rubik's Cube: This is the ultimate maze. There are billions of possible positions.
- The Competition: Other AI methods (like DeepCubeA) use a "search" strategy. They look ahead at many possibilities, like a chess player calculating 10 moves ahead. This takes a lot of computer power and time.
- The GFlowNet Result: Their AI learned the "intuition" of the shortest path.
- The Win: When solving a Rubik's Cube, their method found solutions just as short as the best existing methods, but it did so much faster and required less computer power. It was like comparing a hiker who checks every trail (other AIs) to a hiker who has a GPS that only shows the direct path (this new AI).

5. Why This Matters

This paper changes how we think about finding the shortest path.

Before: We thought we needed complex math or massive search trees to find the best route.
Now: We can just train a probabilistic model to be "efficient" (minimize steps), and the math guarantees it will find the shortest path automatically.

In a nutshell:
The authors discovered that if you teach an AI to hate wasting time, it becomes a master of finding the shortest path. They turned a complex navigation problem into a simple lesson on efficiency, proving that sometimes the best way to get somewhere is to stop taking detours.

1. Problem Statement

Finding shortest paths in large, discrete graphs is a fundamental challenge in AI, particularly in domains like robotics, planning, and combinatorial optimization (e.g., Rubik's Cubes, permutation puzzles).

Limitations of Classical Methods: Algorithms like Dijkstra's or A* require explicit graph exploration and accurate heuristics. In high-dimensional spaces (like Cayley graphs of puzzles), the state space grows factorially, making full graph storage and heuristic design infeasible.
Limitations of Existing ML Approaches: Current deep reinforcement learning (RL) methods typically learn value functions to guide heuristic search (e.g., A* or Beam Search). They approximate distance-to-goal but do not inherently guarantee finding exact shortest paths without extensive search budgets.
The Gap: There is a need for a learning framework that can directly learn a policy to traverse exact shortest paths in non-acyclic (cyclic) environments without relying on complex, hand-crafted heuristics.

2. Methodology

The authors propose a novel framework utilizing Generative Flow Networks (GFlowNets) adapted for non-acyclic environments to solve shortest path problems.

Theoretical Foundation

The core theoretical contribution is establishing a link between minimizing expected trajectory length in GFlowNets and finding shortest paths.

Non-Acyclic GFlowNets: Standard GFlowNets are defined on Directed Acyclic Graphs (DAGs). The authors extend this to cyclic graphs where actions can be undone.
Flow Minimization Theorem: They prove that if a GFlowNet minimizes its expected trajectory length ( $E[n_\tau]$ ), its forward and backward policies will assign zero probability to any trajectory that is not a shortest path between the initial state ( $s_0$ ) and a terminal state.
Implication: Minimizing the "flow" (total expected visits to states) forces the model to concentrate probability mass exclusively on shortest paths.

Constructive Reduction

The authors propose a method to transform a general shortest-path problem on an unweighted graph $G$ into a GFlowNet training task:

Graph Transformation:
- Reverse the edges of the original graph $G$ .
- Designate the goal state $v_g$ as the initial state ( $s_0$ ) of the GFlowNet.
- Add a special sink state ( $s_f$ ) and transitions from every state to $s_f$ (representing the "stop" action).
Reward Function: Use a uniform reward distribution ( $R(x) = 1$ ) for all terminal states (which correspond to all vertices in the original graph).
Policy Roles:
- Backward Policy ( $P_B$ ): Learns to traverse from any state back to the goal ( $s_0$ ) along the shortest path.
- Forward Policy ( $P_F$ ): Learns to scramble the goal state to reach a random state (auxiliary for training).

Training Algorithm

To train the model efficiently, the authors introduce a Regularized Trajectory Balance objective:

Loss Function: They modify the standard Trajectory Balance (TB) loss [Malkin et al., 2022] by adding a state-flow regularization term ( $\lambda F_\theta(s)$ ).
Regularization: The term $\lambda / P_F(s_f | s)$ acts as a penalty for long trajectories, encouraging the model to minimize the expected path length.
Training Strategy: Instead of sampling full trajectories (which can be prohibitively long), they sample partial trajectories of fixed length $N_{max}$ and compute the TB loss on all prefixes. This improves sample efficiency.
Inference: At test time, they employ Beam Search using the learned backward policy to find the highest-probability (shortest) path.

3. Key Contributions

Theoretical Proof: Proved that minimizing expected trajectory length in non-acyclic GFlowNets is mathematically equivalent to learning a policy that traverses only shortest paths.
Constructive Reduction: Demonstrated how to reduce arbitrary shortest-path problems on unweighted graphs to training a non-acyclic GFlowNet with flow regularization.
Novel Training Objective: Developed a regularized Trajectory Balance loss that effectively trains GFlowNets in cyclic environments to minimize path length.
Direct Policy Learning: Unlike previous methods that learn value functions to guide search, this approach learns a policy that directly recovers exact shortest paths.

4. Experimental Results

The method was evaluated on synthetic permutation puzzles (Swap Puzzle) and Rubik's Cubes (2x2x2 and 3x3x3).

Swap Puzzle (Synthetic):
- Tested on graphs with up to $2.4 \times 10^{18}$ states ( $n=20$ ).
- The model successfully generalized to unseen states, finding exact shortest paths using greedy evaluation ( $W=1$ ) and beam search ( $W=4$ ) after sufficient training.
Rubik's Cubes:
- Comparison: Compared against CayleyPy Cube (a state-of-the-art ML approach by Chervov et al., 2025).
- Solution Length: The proposed method achieved competitive solution lengths (e.g., 10.64 moves for 2x2x2, which is optimal).
- Search Efficiency: The GFlowNet approach significantly outperformed CayleyPy Cube in terms of search budget. It found optimal solutions with a beam width of 16 times smaller than the baseline for the 2x2x2 cube.
- Runtime: Despite using a larger neural network (25M parameters vs. 4M), the GFlowNet model was 3.5x faster (1.74s vs. 6.19s) per solve on a 3x3x3 cube. This is because the GFlowNet outputs logits for all neighbors in a single forward pass, whereas baselines require 12 separate forward passes (one per neighbor) to estimate values.

5. Significance and Conclusion

Paradigm Shift: The paper reframes shortest-path optimality in probabilistic terms, showing that flow minimization naturally leads to shortest-path discovery.
Efficiency: The approach offers a more sample-efficient and computationally faster alternative to value-based RL methods for pathfinding in combinatorial spaces.
Generality: The framework is applicable to any unweighted graph, providing a principled, general solution for discrete pathfinding without requiring domain-specific heuristics.
Future Work: The authors suggest extending the framework to weighted graphs and cost-sensitive settings, as well as scaling to even larger state spaces.

In summary, this work demonstrates that Generative Flow Networks, when regularized to minimize trajectory length, serve as a powerful and efficient tool for learning exact shortest paths in complex, cyclic environments like Rubik's Cubes, outperforming current state-of-the-art methods in both solution quality and inference speed.