Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

Imagine you are teaching a group of students how to solve a very difficult math problem. In the traditional classroom (which is how most AI models are trained today), each student works alone. They are given a problem, they think through it step-by-step on their own, and then they write down the answer. If they get it right, they get a gold star.

This paper asks a different question: What happens if we put these students in a group project where they have to share their thought process in real-time?

The researchers call this "Off-Trajectory Reasoning." It's like asking: Can a smart student look at a friend's messy, confusing, or even wrong notes, ignore the bad parts, and still solve the problem? Or, can a struggling student look at a genius's notes and actually learn from them to solve a problem they couldn't do alone?

To test this, the researchers created two fun "games" (tests) to see how well 15 different AI models (ranging from small, simple ones to massive, complex ones) handle these group dynamics.

The Two Games

1. The "Red Herring" Game (Recoverability)

Imagine you are solving a math problem. You are doing great! Then, suddenly, a classmate whispers a completely wrong idea into your ear, like, "Wait, actually, the answer is 350 years old because of carbon dating!" (even though you are solving an algebra equation).

The Test: Can you ignore that weird distraction, realize it doesn't make sense, and go back to your original correct path?
The Shocking Result: The researchers found that the smartest students (the biggest AI models) were actually the most fragile. When a smart model got distracted, it often got confused and gave up or followed the wrong path. The "smaller," less famous models were surprisingly better at shaking off the distraction and staying on track. It's like a genius student getting so flustered by a silly comment that they forget how to do basic math, while a regular student just shrugs it off and keeps working.

2. The "Mentor" Game (Guidability)

Now, imagine you are stuck on a problem you can't solve. A genius student hands you their first few steps of the solution. They show you the right way to start.

The Test: Can you take those correct first steps and finish the rest of the problem on your own?
The Shocking Result: Almost none of the models could do this effectively. Even when the "genius" gave them the perfect start, the struggling models couldn't build on it. They seemed to hit a "ceiling." It's like giving a student the first three lines of a poem written by Shakespeare; they still can't finish the poem in Shakespeare's style. They just couldn't leverage the help to go beyond their own limits.

The "Why" Behind the Failure

The researchers didn't just stop at the results; they played detective to find out why this was happening. They looked at how these AI models were trained.

The "Bad Teacher" Effect: When a small model is trained by copying a "Teacher" model (a process called distillation), it doesn't just copy the answers; it copies the habits. If the Teacher model is easily distracted or fragile, the Student model inherits those bad habits, even if the Student is only shown the correct answers during training.
The "Practice Makes Perfect" vs. "Practice Makes Robust" Problem: Most AI models are trained using a method called Supervised Fine-Tuning (SFT), which is like showing them a textbook of perfect solutions. This makes them great at solo tests. But the researchers found that using Reinforcement Learning (RL)—which is like letting the model try, fail, get corrected, and try again—made them much better at recovering from mistakes. It taught them how to fix errors, not just what the right answer looks like.
The "Less is More" Trap: Some recent trends suggest that training AI on a tiny amount of "super high-quality" data is better than using a lot of data. The researchers found that while this made the models good at solo tests, it made them very unstable in group settings. They became unpredictable; sometimes they worked great, sometimes they failed miserably.

The Big Takeaway

The paper concludes that being good at taking a solo test does not mean you are good at collaborating.

Currently, we are building AI models that are incredible at working alone but terrible at working with others (or even with humans). They are easily confused by wrong turns and can't really learn from a partner's help.

The Lesson for the Future:
If we want AI to work in teams—where a human and an AI, or a big AI and a small AI, work together to solve problems—we can't just train them to be "smart" in isolation. We need to specifically train them to be resilient (able to bounce back from distractions) and coachable (able to learn from others' hints). We need to teach them that it's okay to get confused, and that the best way to learn is not just by memorizing the right answer, but by practicing how to recover when things go wrong.

1. Problem Statement

While Large Language Models (LLMs) have demonstrated strong reasoning capabilities through "solo-reasoning" (verbalizing their own thought processes), their ability to function in collaborative settings remains unexplored. In real-world agentic systems, reasoning trajectories are often interleaved with external inputs (tool outputs, code execution, human interventions) or contributions from other models.

The core question addressed is: Can standard solo-reasoning LLMs effectively handle "off-trajectory" tokens—reasoning traces generated by other models or humans that interrupt their own thought process? Specifically, can they:

Recover from misleading or distracting inputs injected into their trajectory?
Leverage correct but incomplete guidance from stronger collaborators to solve problems they cannot solve alone?

The authors term this capability Off-Trajectory Reasoning.

2. Methodology: The "Twin Tests"

To evaluate off-trajectory reasoning, the authors propose a systematic framework consisting of two complementary tests, simulated by injecting "steers" (interventions) into a model's reasoning trajectory.

A. Recoverability Test

Goal: Measure the model's ability to backtrack from a distracting steer and return to its original correct reasoning path.
Setup:
- The model generates a partial correct reasoning trajectory ( $r_{og}$ ).
- A "distracting steer" ( $r_{steer}$ ) is injected. Crucially, this steer is sampled from the same model but conditioned on a different question ( $q'$ ). This ensures the steer is coherent but logically incorrect for the current problem, acting as a strong distractor.
- The model must continue reasoning from the combined trajectory ( $[r_{og}, r_{steer}]$ ) and produce the correct answer.
Metric: Success rate of recovering the correct answer compared to the baseline solo performance.

B. Guidability Test

Goal: Measure the model's ability to build upon correct guidance from a stronger collaborator to solve problems beyond its inherent capability.
Setup:
- The model is given a problem it cannot solve alone (solo solve rate $\le 1/8$ ).
- A "guiding steer" ( $r_{steer}$ ) is provided at the very beginning of the reasoning (or early in the process) by a stronger model (higher benchmark performance).
- The steer contains a partial correct derivation.
- The model must complete the reasoning to reach the correct answer.
Metric: Improvement in solve rate compared to the solo baseline.

3. Experimental Setup

Models: Evaluated 15 open-weight LLMs (ranging from 1.5B to 32B parameters) across four families: DeepSeek-R1, Qwen3, QwQ, and Community models.
Domains: Mathematics (AIME, MATH, OlympiadBench) and Coding (HumanEval, MBPP, CruxEval).
Control Studies: The authors conducted controlled experiments to isolate the effects of three post-training factors:
1. Choice of distillation teacher.
2. Use of Reinforcement Learning (RL) vs. Supervised Fine-Tuning (SFT).
3. Data selection strategies (specifically "Less-Is-More" vs. larger datasets).

4. Key Results

Finding 1: Benchmark Performance $\neq$ Off-Trajectory Robustness

Counterintuitive Discovery: Models with the highest solo-reasoning benchmark scores often exhibit the worst recoverability.
- Example: AM-Thinking-32B (82.6% math benchmark) has a recoverability of only 33.4%.
- Contrast: Qwen3-1.7B (59.9% benchmark) achieves 98.4% recoverability.
Conclusion: Optimization for standard benchmarks does not guarantee robustness against off-distribution reasoning tokens. Stronger models appear more "fragile" when their internal reasoning flow is disrupted.

Finding 2: The Invisible Guidability Ceiling

Math Domain: Models fail to effectively leverage guidance for problems beyond their capability.
- No model exceeded 9.2% guidability on math problems (shared subset).
- Even when guided by their own distillation teacher, models could not surpass their inherent limits.
- Failure Mode: Models often fail to recognize correct reasoning in the steer, rejecting the valid path and pivoting to an incorrect one.
Coding Domain: Guidability is higher (up to 47.3%) but often trivial; analysis shows that in many cases, the guiding steer already contained the answer, and models were simply copying rather than reasoning.

Finding 3: Critical Role of Reasoning Start

Distractions injected at the very beginning (0%) of a trajectory cause the most severe degradation in performance.
Ablation: Preserving the first paragraph of the original reasoning (where the model restates the problem) significantly improves recoverability (average >83.5%), suggesting that "anchoring" the problem context is vital for off-trajectory reasoning.

5. Control Studies: Drivers of Behavior

The authors isolated specific training decisions that shape off-trajectory behaviors:

Teacher Distillation:
- Models distilled from AM-Thinking-32B inherited its poor recoverability, even when trained only on the teacher's correct trajectories.
- Implication: Vulnerabilities are encoded in the reasoning style, not just the correctness of the solution. Distillation transfers hidden fragility.
Reinforcement Learning (RL):
- RL training (GRPO) significantly improved recoverability where SFT had plateaued.
- Mechanism: While SFT teaches "what correct reasoning looks like," RL exposes models to failures and rewards recovery, teaching them "what to do when reasoning goes wrong."
Data Selection (Less-Is-More):
- Models trained on small, high-quality datasets (e.g., LIMO) showed high variance in recoverability across checkpoints.
- Models trained on larger, mixed-quality datasets showed stable, consistent recoverability. Over-optimizing for data quality can introduce instability in off-trajectory scenarios.

6. Significance and Contributions

New Evaluation Framework: Introduces Recoverability and Guidability as orthogonal metrics to standard benchmarks, essential for evaluating LLMs in multi-agent or human-in-the-loop systems.
Limitation of Current SOTA: Reveals that the strongest solo-reasoning models are often the most fragile collaborators, challenging the assumption that benchmark optimization leads to general reasoning robustness.
Training Insights: Provides actionable guidance for training "collaborative" LLMs:
- Avoid distilling from teachers with poor off-trajectory robustness.
- Incorporate RL to teach recovery mechanisms.
- Use larger, diverse datasets to ensure stability rather than aggressive filtering.
Safety & Efficiency: Highlights the potential for safer AI systems where overseers can steer reasoning away from unsafe paths without terminating the process, provided the models are trained for off-trajectory robustness.

In summary, the paper argues that off-trajectory reasoning is a distinct capability that does not emerge naturally from standard solo-reasoning training. It requires explicit architectural or training interventions to ensure models can collaborate effectively in dynamic, multi-agent environments.