Weak-to-Strong Generalization with Failure Trajectories: A Tree-based Approach to Elicit Optimal Policy in Strong Models

Imagine you are trying to teach a brilliant but inexperienced student (the Strong Model) how to solve complex puzzles, like navigating a virtual house or shopping online. Usually, you would need a world-class expert (a human) to show the student exactly what to do. But what if the expert is too busy, too expensive, or simply doesn't exist for these new, super-hard tasks?

This paper introduces a clever solution called Weak-to-Strong Generalization (W2SG). Instead of waiting for a human expert, we let a less capable student (the Weak Model) try to solve the problems first. Then, we teach the brilliant student by analyzing everything the weaker student did—both the times they succeeded and, crucially, the times they failed.

Here is the breakdown of their method using simple analogies:

1. The Problem: The "Expert" is Missing

In the past, to train AI, we needed humans to label data or give feedback. But as AI gets smarter than humans in some areas, we can't rely on humans to supervise them anymore. We need a way for a "strong" AI to learn from a "weak" AI without human help.

2. The Solution: Learning from the "Clumsy" Apprentice

The authors propose a three-step process:

Step A: The "Messy" Exploration (Trajectory Exploration)

Imagine the Weak Model is a clumsy apprentice sent into a giant maze (the environment).

The apprentice tries to find the exit.
Sometimes they find the door (Success).
Sometimes they run into a wall, get lost, or pick up the wrong key (Failure).
Because the apprentice isn't perfect, they take many different paths, creating a huge pile of "try-and-error" logs.

Step B: Building the "Tree of Mistakes and Wins" (Trajectory Trees)

This is the paper's biggest innovation. Instead of just looking at the final result (Did they win or lose?), they organize all the apprentice's attempts into a Tree.

The Analogy: Imagine a family tree, but instead of ancestors, it's a map of decisions.
The Magic: The tree merges paths that look the same. If the apprentice walked down the hallway and turned left in 10 different attempts, the tree records that as one branch.
The Divergence: The tree highlights exactly where the paths split. For example, "In 5 attempts, the apprentice turned left and found a treasure. In 5 other attempts, they turned right and hit a wall."
Why it matters: This structure captures the relationship between actions. It shows the strong model: "Turning left after seeing the red door is good. Turning right after seeing the red door is bad." It's much smarter than just saying "Win/Lose."

Step C: The "Smart Coach" (MCTS & Fine-Tuning)

Now, the Strong Model (the brilliant student) looks at this Tree.

The Coach's Tool (MCTS): The authors use a technique called Monte Carlo Tree Search. Think of this as a super-efficient coach who scans the entire Tree of the apprentice's attempts. The coach doesn't just pick the "best" path; they calculate the probability of success for every single branch.
The Lesson: The coach tells the Strong Model: "Don't just copy the winning path. Learn why the winning path worked and why the losing path failed. Notice that the only difference between the win and the loss was one specific action at step 3."
The Result: The Strong Model learns to avoid the specific mistakes the Weak Model made and adopts the successful strategies, effectively "distilling" the wisdom from the weak attempts.

3. The Surprising Outcome

Usually, you expect a student to learn less from a teacher who is worse than them. But here, the Strong Model actually performed better than if it had been trained directly by a human expert in some cases!

Why?

Human experts often only show the "perfect" path. They hide their mistakes.
The Weak Model shows everything: the dead ends, the wrong turns, and the confusion.
By studying the Failure Trajectories (the dead ends), the Strong Model learns what not to do, which is often more valuable than just knowing what to do. It's like learning to drive by watching a video of every car crash in the city, not just the videos of people driving perfectly.

Summary

The paper argues that to build super-intelligent AI, we don't need to wait for human teachers. We can use a "weak" AI to generate a massive library of attempts (a Trajectory Tree). By analyzing the structure of these attempts—specifically where the good paths and bad paths diverge—we can train a "strong" AI to be smarter than the human experts who originally trained the weak AI.

In a nutshell: It's about turning a pile of "failed attempts" by a novice into a structured textbook that teaches a genius how to succeed.

Here is a detailed technical summary of the paper "Weak-to-Strong Generalization with Failure Trajectories" (ICLR 2026).

1. Problem Statement

The paper addresses the challenge of Weak-to-Strong Generalization (W2SG) in the context of complex interactive decision-making tasks involving Large Language Model (LLM) agents.

Context: Traditional W2SG research focuses on simple tasks (e.g., binary classification) where a strong model learns from noisy labels generated by a weak model. However, real-world LLM agents operate in environments where solutions are trajectories of actions (sequences of steps), not single labels.
The Gap: Existing alignment methods (like RLHF) rely on human supervision, which becomes a bottleneck as models approach or exceed human intelligence. Furthermore, current W2SG approaches often ignore failure trajectories, focusing only on successful outcomes.
Core Question: Can a strong agent ( $\pi_s$ ) learn to outperform its own supervised fine-tuning (SFT) baseline by learning from the entirety of a weak agent's ( $\pi_w$ ) exploration, including both successful and failed action trajectories, without additional human intervention?

2. Methodology

The authors propose a framework that extends W2SG to sequential decision-making using Trajectory Trees and Monte Carlo Tree Search (MCTS). The workflow consists of three main stages:

A. Trajectory Exploration

Weak Model Training: A base weak model ( $\pi_w^{base}$ ) is first fine-tuned via Supervised Fine-Tuning (SFT) on expert demonstrations to create $\pi_w^{SFT}$ .
Diverse Generation: The $\pi_w^{SFT}$ agent interacts with the environment (e.g., WebShop, ScienceWorld) multiple times with varied sampling parameters (temperature, top-p) to generate a diverse set of $M$ trajectories.
Data Collection: This process captures a mix of successful, suboptimal, and failed trajectories, providing a rich dataset of "failure experiences" alongside successes.

B. Trajectory Tree Construction

Instead of treating trajectories as independent linear chains (like Chain-of-Thought), the authors construct a Hierarchical Trajectory Tree ( $T = (V, E)$ ):

Structure: Nodes represent execution steps (observation, thought, action). Edges represent transitions.
Merging: Paths are merged when they share the same action from semantically similar observations (using sentence embeddings and cosine similarity).
Significance: This structure explicitly captures the shared prefixes between successful and failed paths. The "divergence point" (where a successful path and a failed path split) is identified as the critical decision point for learning.
Reward Aggregation: Terminal scores ( $G(e)$ ) from the environment are associated with the terminal nodes of the tree.

C. Weak-to-Strong Generalization Algorithms

Two methods are proposed to fine-tune the strong model ( $\pi_s$ ) using the constructed tree:

W2SG with Structural Contrastive Pairs (TreeDPO):
- Instead of random preference pairs (as in standard DPO), the method extracts pairs $(\tau^+, \tau^-)$ from the tree.
- These pairs share a common prefix $h$ but diverge at a specific node, leading to a successful outcome ( $\tau^+$ ) and a failed outcome ( $\tau^-$ ).
- The strong model is optimized using a DPO loss function that maximizes the probability of the successful continuation over the failed one, regularized against the weak model.
W2SG with Monte Carlo Tree Search (MCTS):
- To handle the computational complexity of optimizing all pairs, the authors use MCTS to search the static trajectory tree offline.
- Selection: Uses the Upper Confidence Bound (UCB) formula to balance exploration and exploitation based on node visit counts and cumulative rewards.
- Synthesis: MCTS identifies the optimal path $e^*$ through the tree.
- Fine-tuning: The strong model is fine-tuned via SFT on these synthesized high-quality trajectories ( $e^*$ ).

3. Theoretical Analysis

The paper provides a formal theoretical guarantee (Theorem 1) for the effectiveness of Tree-guided DPO.

Assumptions: It assumes the existence of a policy $\pi^*$ that outperforms the SFT baseline and that the tree-derived preference pairs are "informative" (i.e., the divergence points in the tree correlate with true performance differences).
Result: The analysis proves that the strong model trained via TreeDPO can surpass the SFT-trained strong model ( $\pi_s^{SFT}$ ), provided the weak model's exploration provides sufficient informative gaps. The performance gap is bounded by the estimation error, which decreases as the number of preference pairs increases.
Key Insight: The method acts as a "failure-safe" mechanism; if the weak model's exploration is uninformative, the KL-regularized objective ensures the strong model does not degrade below its SFT baseline.

4. Key Contributions

Extension to Interactive Tasks: First work to extend W2SG from simple classification to complex, multi-step interactive decision-making environments where solutions are action trajectories.
Failure Trajectory Utilization: Proposes a novel framework that explicitly leverages failure trajectories to teach strong models what not to do, mimicking human learning from ancestors' mistakes.
Trajectory Trees: Introduces a hierarchical data structure that organizes weak model explorations, capturing structural relationships and shared prefixes between success and failure paths, which random contrastive pairs miss.
MCTS Integration: First application of MCTS in the W2SG paradigm to synthesize optimal training signals from weak explorations.
Empirical & Theoretical Validation: Demonstrates that weak supervision can elicit capabilities in strong models that exceed their own SFT performance, backed by theoretical bounds.

5. Experimental Results

Experiments were conducted on three benchmarks: WebShop (shopping), ScienceWorld (science experiments), and AlfWorld (household tasks). Models used include Llama-2/3 and Qwen families.

Performance Gains:
- The W2SG with MCTS method consistently outperformed the SFT Strong Model (trained on expert data) and the SFT Weak Model.
- On WebShop, W2SG (MCTS) achieved an average reward of 56.9 vs. 51.0 for the SFT Strong Model.
- On AlfWorld, it achieved 57.5 vs. 51.5 for the SFT Strong Model.
- In some cases (e.g., ScienceWorld), the W2SG model even outperformed a "Ceiling Model" (a strong model trained with DPO using expert-preferred trajectories).
Statistical Significance: T-tests confirmed the improvements were statistically significant ( $p < 0.001$ ).
Ablation Studies:
- Tree Structure: Using unstructured random pairs (Unstructured DPO) yielded significantly lower performance than TreeDPO, proving the value of shared-prefix divergence.
- Weak Model Quality: Even when the weak model was significantly weaker (e.g., Llama-2-7B vs. Llama-3-8B), the W2SG update provided stable, positive gains over the strong SFT baseline.
- Trajectory Count: Performance improved with more trajectories up to an optimal point (e.g., 6 trajectories), after which diminishing returns or noise occurred.

6. Significance

This paper represents a significant step toward scalable AI alignment.

Reducing Human Dependency: It demonstrates that high-performing agents can be trained using supervision derived entirely from weaker AI models, reducing the reliance on expensive and potentially scarce human feedback.
Learning from Failure: By formalizing the learning from "failure trajectories," the framework aligns more closely with human learning processes, potentially leading to more robust and safer agents.
Future of Superintelligence: The work addresses the "supervision bottleneck" for future superintelligent systems, suggesting a pathway where weaker AI models can supervise and guide the development of stronger, more capable AI systems.