Training Large Language Models To Reason In Parallel With Global Forking Tokens

Imagine you are trying to solve a very difficult puzzle, like a complex math problem or a tricky coding challenge. You have a smart assistant (the AI) who is trying to figure it out.

The Problem: "Overthinking" and Getting Stuck

Usually, when we ask an AI to solve hard problems, we tell it to "think harder" by letting it generate more text. But there's a catch: if you just let it ramble on, it often gets stuck in a loop. It might start down a wrong path and keep digging deeper into that mistake, a phenomenon the authors call "overthinking."

To get a better answer, we usually try to make the AI generate many different attempts at once (like asking a room full of people to solve the puzzle and picking the best answer). But here's the problem: if the AI isn't trained well, all those attempts end up looking the same. They all take the exact same wrong path. It's like asking 100 people to solve a maze, but they all start walking in the exact same direction and hit the same dead end.

The Solution: The "Magic Switches" (Global Forking Tokens)

The authors of this paper came up with a clever way to force the AI to explore different paths without it getting confused. They introduced something they call "Global Forking Tokens."

Think of these as special "Magic Switches" or "Command Buttons" that you press before the AI starts thinking.

Button A tells the AI: "Think like a strict mathematician."
Button B tells the AI: "Think like a creative artist."
Button C tells the AI: "Think like a cautious engineer."

In the past, the AI had to randomly guess which path to take, often failing to find the right "button" deep inside its thought process. This new method teaches the AI that Button A always leads to Strategy A, Button B always leads to Strategy B, and so on.

How They Taught the AI: The "Set Supervised Fine-Tuning" (SSFT)

How do you teach an AI to respect these buttons? You can't just show it one answer. You have to show it a whole set of different, correct solutions.

Imagine you are a teacher with a class of students (the AI) and a stack of 4 different, correct ways to solve a math problem (the "traces"). You also have 6 different colored pens (the "buttons").

The Old Way (Standard Training): You show the students the 4 solutions and say, "Here are some answers." The students get confused. They might mix up the styles, or they might all decide that "Red Pen" is the best way to solve everything, ignoring the other colors. They collapse into one boring style.
The New Way (SSFT): The teacher uses a special matching game.
- The teacher looks at the 4 solutions and the 6 pens.
- They figure out the perfect match: "Solution 1 goes with the Red Pen," "Solution 2 goes with the Blue Pen," etc.
- They then teach the AI: "When you see the Red Pen, you must write like Solution 1. When you see the Blue Pen, you must write like Solution 2."
- Crucially, they do this for every possible combination to ensure the AI learns that different buttons trigger different, unique thinking styles.

This process is called Set Supervised Fine-Tuning (SSFT). It forces the AI to learn that "Button A" and "Button B" are not just random words; they are distinct keys that unlock completely different rooms in the AI's brain.

The Result: A Super-Organized Brain

Once the AI is trained with these "Magic Switches":

Diversity: If you press "Button A," the AI thinks in a long, detailed way. If you press "Button B," it thinks in a short, punchy way. They don't look alike anymore.
Accuracy: Because the AI isn't guessing which path to take, it can reliably access the "best" way to solve a problem for that specific question.
Efficiency: You don't need to wait for the AI to "overthink" and wander off. You just press the right button, and it goes straight to the right strategy.

The "Global Forking Policy Optimization" (GFPO)

Finally, the authors added a tiny bit of extra training (like a coach giving a pep talk) to teach the AI which button to press for a specific problem.

If the problem is a geometry puzzle, the coach says, "Press Button 3!"
If it's a logic riddle, the coach says, "Press Button 1!"

In Summary

This paper is about teaching AI models to stop guessing and start intentionally choosing different thinking styles. Instead of hoping the AI randomly finds a good way to solve a problem, they gave it a remote control with specific buttons, each one guaranteed to trigger a unique, high-quality reasoning style. This makes the AI smarter, more diverse in its thinking, and much better at solving hard problems without getting stuck.

1. Problem Statement

Large Language Models (LLMs) have improved reasoning capabilities by scaling test-time compute, either through sequential scaling (generating longer chains of thought) or parallel scaling (sampling multiple reasoning paths and aggregating them). However, both approaches face significant limitations:

Sequential Scaling: Extended generation often leads to "overthinking," where performance degrades after a certain sequence length due to error accumulation or loss of focus.
Parallel Scaling: Methods like Self-Consistency rely on generating diverse yet correct reasoning paths. As problems become harder, the specific tokens that trigger distinct, correct reasoning modes ("forking tokens") are often located deep within the sampling tree.
The Diversity-Accuracy Trade-off: Common strategies to increase diversity, such as temperature scaling, often degrade accuracy. Furthermore, simply increasing temperature does not guarantee coverage of diverse reasoning modes unless the model is explicitly trained to ensure such coverage.
Mode Collapse: Standard Supervised Fine-Tuning (SFT) on multiple diverse reasoning traces often causes the model to "collapse" these distinct modes into a single average reasoning path, failing to learn how to trigger specific modes on demand.

2. Methodology

The authors propose a framework to treat parallel reasoning as a set-of-next-token-prediction problem, introducing two core components: Set Supervised Fine-Tuning (SSFT) and Global Forking Policy Optimization (GFPO).

A. Global Forking Tokens

The authors introduce a reserved set of special tokens, denoted as $g := \{g^{(i)}\}_{i=1}^N$ (e.g., <think 1>, <think 2>, ...), which act as global forking tokens. When conditioned on a specific token from this set, the model is expected to generate a distinct, correct reasoning trace aligned with a specific ground-truth path.

B. Set Supervised Fine-Tuning (SSFT)

Standard SFT minimizes the negative log-likelihood of a single trace. SSFT extends this to handle a set of $M$ diverse reasoning traces $\{r^{(j)}\}_{j=1}^M$ for a given input $x$ .

Bipartite Matching: Instead of arbitrarily assigning traces to forking tokens, SSFT formulates the problem as finding the optimal bipartite matching between the set of forking tokens $\{g^{(i)}\}$ and the set of reasoning traces $\{r^{(j)}\}$ .
Loss Function: The loss is defined as the minimum total Next-Token Prediction (NTP) cost over all possible permutations (matchings).
$\hat{\sigma} = \arg \min_{\sigma} \sum_{j=1}^M L_{\text{matching}}(g^{(\sigma(j))}, r^{(j)})$
Where $L_{\text{matching}}$ is the NTP loss of trace $r^{(j)}$ conditioned on the matched token $g^{(\sigma(j))}$ . The Hungarian algorithm is used to efficiently find the optimal matching $\hat{\sigma}$ .
Training Objective: The model parameters are updated to minimize the NTP loss under this optimal matching configuration. This ensures that the model learns to associate specific forking tokens with specific reasoning modes without collapsing them.
Efficiency: To manage computational cost, the matching cost is computed using only the first $L$ tokens (e.g., 1,000) where reasoning modes typically diverge, while backpropagation uses the full sequence length.

C. Global Forking Policy Optimization (GFPO)

Once the model is fine-tuned via SSFT, it possesses distinct reasoning modes triggered by specific tokens. GFPO is a lightweight Reinforcement Learning (RL) step applied only to the selection of the global forking token $g^{(i)}$ given an input $x$ .

Mechanism: It uses policy gradients to optimize the probability distribution over the forking tokens, incentivizing the model to select the token that leads to the most complex and accurate reasoning path for a specific problem.
Efficiency: Unlike full-sequence RL (e.g., GRPO), GFPO only backpropagates gradients through the selection of the initial token, making it extremely compute-efficient.

3. Key Contributions

Global Forking Tokens: The introduction of reserved control tokens that allow a model to globally steer its reasoning into distinct, pre-learned modes, reducing reliance on stochastic sampling mid-generation.
SSFT Framework: A novel fine-tuning method using set-based loss and bipartite matching that prevents the "mode collapse" seen in standard multi-trace SFT. It successfully learns unique correlations between control tokens and diverse reasoning traces.
GFPO: An efficient RL variant that leverages the learned forking tokens to dynamically select the best reasoning mode for a given problem, outperforming standard GRPO.
Empirical Validation: Demonstration that SSFT preserves distinct reasoning modes (visualized via token counts and accuracy differences across tokens) and improves both single-path accuracy (Pass@1) and parallel aggregation performance (Cons@k).

4. Experimental Results

The authors evaluated their method on Qwen2.5-32B-Instruct across several benchmarks: AIME 2024/2025, MATH-500, GPQA-Diamond, and LiveCodeBench (LCB).

Pass@1 Performance: SSFT-32B achieved 64.06% on AIME 2024 and 58.13% on AIME 2025, outperforming the best multi-trace SFT baseline (SFT-mixed-distill-32B) by significant margins (e.g., +8.33% on AIME 2024).
Parallel Scaling (Cons@k): Under Cons@6 (majority voting of 6 parallel generations), SSFT achieved 75.45% on AIME 2024 and 73.94% on AIME 2025, consistently outperforming baselines.
Diversity and Coverage:
- Mode Preservation: Unlike standard SFT, which collapsed reasoning modes, SSFT models showed distinct reasoning lengths and strategies for different <think i> tokens.
- Coverage: SSFT achieved higher Pass@k coverage across all $k$ values compared to temperature-scaled SFT baselines, which required higher temperatures (and thus lower Pass@1) to achieve similar diversity.
Robustness:
- OOD Generalization: SSFT showed strong generalization to out-of-distribution tasks like LiveCodeBench (code generation), even when trained on math data, and vice versa.
- Model Scale: The method was effective on smaller models (Qwen3-4B, Llama3.1-8B) as well as the 32B model.
Ablation Studies:
- Optimal Matching: Using random bipartite matching (SSFT-random) resulted in performance close to standard SFT, confirming that the optimal matching step is crucial for learning distinct modes.
- GFPO: Adding GFPO further improved Pass@1, demonstrating that the model can learn to select the optimal forking token dynamically.

5. Significance and Conclusion

This paper addresses a critical bottleneck in scaling LLM reasoning: the inability to reliably trigger diverse, high-quality reasoning paths without sacrificing accuracy or relying on inefficient sampling.

Paradigm Shift: It moves parallel reasoning from a purely inference-time sampling strategy to a training-time objective, explicitly teaching the model to associate specific control tokens with specific reasoning strategies.
Efficiency: By decoupling the selection of the reasoning mode (via GFPO) from the generation of the reasoning content, the method offers a highly efficient way to scale test-time compute.
Practical Impact: The approach enables smaller models to achieve reasoning performance comparable to or exceeding larger models trained with standard SFT, while providing a mechanism to control the "thinking style" (e.g., exhaustive vs. heuristic) via simple token prompts.

In summary, SSFT and GFPO provide a robust, scalable framework for training LLMs to reason in parallel, effectively solving the diversity-accuracy trade-off and preventing reasoning mode collapse.