Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

This paper introduces RapTB, a Rooted absorbed prefix Trajectory Balance objective that enhances credit assignment for early prefixes, and SubM, a submodular replay strategy to mitigate distribution shift, collectively addressing mode collapse and improving diversity in GFlowNet-based LLM fine-tuning.

Xi Wang, Wenbo Lu, Shengjie Wang

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are teaching a very talented but slightly confused robot chef to invent new recipes. Your goal isn't just to find one perfect dish; you want the robot to explore a huge variety of delicious, unique, and valid recipes, giving more attention to the ones that taste better.

This is what GFlowNets do: they are AI systems designed to generate diverse, high-quality solutions (like new molecules or sentences) rather than just picking the single "best" one.

However, the paper explains that these robot chefs often get stuck in a rut. They either:

  1. Stop too early: They decide the first few words of a sentence are the whole story (Prefix Collapse).
  2. Get obsessed with length: They only write very short or very long sentences, ignoring the middle ground (Length Bias).
  3. Only learn from their favorites: If they accidentally make one great dish, they keep making that exact same dish over and over, forgetting to try new things (Replay Bias).

The authors propose two new tools to fix this: RapTB and SubM.

1. RapTB: The "Rooted Guide" (Fixing the Learning Signals)

The Problem:
Imagine the robot chef is building a tower of blocks. The reward (a gold star) only comes at the very top when the tower is finished. If the tower falls, the robot gets no feedback on which block it placed wrong. It's like guessing in the dark. The robot learns to just copy the first few blocks of the few towers that happened to stand, leading to "mode collapse" (everyone building the same short tower).

The Solution (RapTB):
RapTB is like a wise mentor who doesn't wait until the tower is finished to give feedback.

  • Rooted: The mentor checks the tower starting from the very bottom (the root) every time.
  • Absorbed: If the robot builds a great top section, the mentor "absorbs" that success and says, "Hey, the bottom part you built earlier was also good because it led to this great top!"
  • Trajectory Balance: It ensures that every step the robot takes is consistent with the final goal, but it does so in a way that doesn't confuse the robot about when to stop building.

In Simple Terms: Instead of waiting for the final grade to tell the student they did well, RapTB gives them a "partial credit" score at every step, based on how well that step could lead to a great ending. This stops the robot from just copying the first few words of a lucky guess.

2. SubM: The "Curated Library" (Fixing the Memory)

The Problem:
Imagine the robot has a notebook (a replay buffer) where it writes down the recipes it tried. Usually, it just writes down the "best" recipes. But if it finds one amazing chocolate cake, it writes that down 100 times and forgets about the lasagna or the salad. The robot stops learning because its notebook is full of duplicates.

The Solution (SubM):
SubM is a smart librarian who curates the notebook.

  • Submodular: This is a fancy math word for "diminishing returns." It means the librarian knows that having 100 copies of the same chocolate cake adds zero value.
  • The Strategy: The librarian looks at the new recipes and asks: "Is this new cake different from what we already have? Does it have a good score? Is it a different length?"
  • The Result: The librarian keeps a mix of high-scoring recipes, but ensures they are all different from each other. If the notebook is full of short cakes, the librarian makes room for a long lasagna, even if the cake was slightly better.

In Simple Terms: SubM forces the robot's memory to be diverse. It prevents the robot from getting stuck in a loop of repeating the same few "winning" ideas, ensuring it explores the whole kitchen.

The Big Picture: What Happens When You Combine Them?

When you use RapTB (the smart mentor) and SubM (the diverse librarian) together:

  • The Robot Stops Collapsing: It stops making the same short, boring sentences over and over.
  • It Explores More: It tries longer, more complex structures (like long molecules or full stories) without getting confused.
  • It Finds Better Solutions: Because it's exploring more ground, it finds more unique, high-quality recipes that it would have missed otherwise.

The Analogy Summary:
Think of training an AI like training a jazz band.

  • Old Way: The band only plays the one song that got the biggest applause last night, over and over again, getting worse and worse because they aren't practicing anything new.
  • RapTB: The conductor gives feedback on every single note played, not just at the end of the song, helping the musicians understand why a specific note worked.
  • SubM: The setlist is managed by a DJ who ensures the band plays a mix of fast, slow, loud, and quiet songs, rather than just playing the same hit song 50 times in a row.

The result? A band that can improvise, play complex solos, and keep the audience entertained with a fresh, diverse set of music.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →