Can RL Improve Generalization of LLM Agents? An Empirical Study

The Big Picture: The "Super-Student" vs. The "Real World"

Imagine you have a brilliant student (the AI Agent) who is being trained to solve complex problems. To make them smarter, you use a method called Reinforcement Fine-Tuning (RFT). Think of RFT as a strict coach who gives the student immediate feedback: "Good job!" (a reward) when they solve a puzzle correctly, and "Try again" (a penalty) when they fail.

The big question this paper asks is: If we train this student to be a master at solving puzzles in a specific classroom, will they still be smart enough to walk into a completely different, unfamiliar classroom and solve puzzles there?

Most previous studies only tested the student in the same classroom they practiced in. This paper is the first to send the student out into the wild to see if they can actually generalize their skills.

The Three Experiments (The "Axes")

The researchers tested the AI agents in five different "worlds" (environments), ranging from a virtual web store to a text-based video game. They looked at three specific scenarios:

1. The "Easy vs. Hard" Test (Within-Environment)

The Analogy: Imagine training a chess player on easy puzzles, then testing them on hard puzzles.
The Finding: It works great! If you train the agent on a mix of easy and hard tasks within the same environment, it learns the "rules of the game" so well that it can handle harder versions of the same task.
The Secret Sauce: They found that Curriculum Learning works best. This means teaching the student the easy stuff first, then moving to the hard stuff. It's like learning to ride a bike with training wheels before trying to ride on a mountain trail. This method made the agents significantly smarter and faster.

2. The "New City" Test (Cross-Environment)

The Analogy: You train a taxi driver to navigate the streets of New York City perfectly. Then, you drop them off in Tokyo. Do they know how to drive there?
The Finding: It's a mixed bag.
- Success: If the new city is similar (e.g., training on a "Search" task and testing on a "Web Shopping" task), the agent does well. They learned the concept of searching, not just the specific buttons.
- Failure: If the new city is too different (e.g., training on a game where you have a list of allowed moves, then testing on a game where you have to guess), the agent gets confused.
- The Trap: Some agents became "lazy." In one environment (BabyAI), the agent was given a list of valid moves at every step. It learned to just pick from the list. When dropped into a new environment where it had to think of the moves itself, it crashed because it never learned to think independently.

3. The "Back-to-School" Test (Sequential Training)

The Analogy: You train the student in New York, then immediately send them to Tokyo, then to Paris. Do they remember how to drive in New York while learning Tokyo?
The Finding: This is the most exciting part.
- No Amnesia: Unlike traditional training methods where learning a new skill often makes you forget the old one (Catastrophic Forgetting), these RL-trained agents didn't forget. They could learn a new environment and still remember how to handle the old one.
- The Order Matters: Just like in the "Easy vs. Hard" test, the order matters. If you train them on the "easy" environment first, then the "hard" one, they generalize better. If you start with the hard one, they struggle to adapt later.

Why Do Agents Fail? (The "Failure Modes")

The researchers looked closely at why the agents failed when they moved to new environments. They found a few common personality flaws:

The "Overconfident Expert" (Confirmation Bias): Once the agent thinks it has the answer, it stops checking. It's like a student who guesses "A" on a test and refuses to double-check their math, even when the teacher hints they are wrong.
The "Gambler" (Guessing/Fabrication): Instead of using tools (like a search engine) to find the truth, the agent just makes things up because it's too confident in its own memory.
The "Memory Loss" (State Inconsistency): In complex, long tasks, the agent forgets what it did five steps ago. It's like trying to solve a maze but forgetting you already turned left three times.

The Takeaway for the Future

This paper tells us that Reinforcement Learning is a powerful tool for making AI agents more adaptable, but we have to be careful about how we train them.

Don't just train on one thing: To build a truly general AI, we need to train it on a mix of environments.
Teach them to think, not just follow: If we give the AI a "cheat sheet" of allowed moves during training, it might fail when the cheat sheet is gone.
Start easy, then get hard: A step-by-step curriculum is the best way to build a robust, general-purpose agent.

In short: RL can turn a narrow specialist into a generalist, but only if we train them like a human student—starting with the basics, mixing up the challenges, and teaching them to verify their own work rather than just guessing.

1. Problem Statement

While Reinforcement Fine-tuning (RFT) has shown promise in enhancing Large Language Model (LLM) agents for multi-turn decision-making tasks (e.g., web navigation, software engineering), existing evaluations are predominantly in-domain. Most studies train and test agents within the same environment or on overlapping tasks.

However, real-world deployment requires agents to operate in unseen environments characterized by shifts in:

Background Knowledge: Different domain facts or rules.
Observation Spaces: Different formats of state information.
Action Interfaces: Different sets of available actions and validation mechanisms.

The core research question is: Do the performance improvements gained from RFT generalize beyond the training distribution to unseen environments, or do they suffer from catastrophic forgetting and over-specialization?

2. Methodology

The authors conducted a systematic empirical study using five diverse agent environments (WebShop, SearchQA, TextCraft, AlfWorld, BabyAI) and two LLM backbones (Qwen2.5-3B and Qwen2.5-7B). The study is structured along three distinct axes:

A. Experimental Setup

Framework: Used the AgentGym-RL framework with the GRPO (Group Relative Policy Optimization) algorithm.
Interaction Paradigm: ReAct (Reasoning + Acting), where agents generate reasoning traces and actions based on environment feedback.
Metrics: Primary metric is avg@8 (average success rate over 8 runs), alongside efficiency metrics (interaction turns, generated tokens).

B. Three Axes of Investigation

Axis 1: Intra-Environment Generalization (Task Difficulty)
- Goal: Determine if RFT trained on easy tasks can solve hard tasks (and vice versa) within the same environment.
- Method: Tasks were categorized into U_easy and U_hard. Experiments included training on easy only, hard only, mixed, and curriculum learning (easy $\to$ hard).
Axis 2: Inter-Environment Generalization (Zero-Shot Transfer)
- Goal: Assess performance when an agent trained in one environment is tested on a completely different environment.
- Method: Trained agents on a single source environment and evaluated them on all other environments (Held-Out).
- Metrics: $\Delta_{Held-In}$ (improvement on same env), $\Delta_{Held-Out}$ (improvement on different env), and $\Delta_{Overall}$ .
Axis 3: Sequential Multi-Environment Training
- Goal: Analyze the dynamics of transfer vs. forgetting when training sequentially across multiple environments compared to joint training (mixture).
- Method: Two-stage sequential training (Env A $\to$ Env B) and five-stage sequential training. Compared against a baseline of joint training on a random mixture of all data.

3. Key Contributions & Findings

Finding 1: Strong Intra-Environment Generalization

Result: RFT-trained agents generalize significantly well across task difficulties within the same environment.
Curriculum Learning: An "Easy-to-Hard" curriculum (training on U_easy then U_hard) consistently outperformed training on hard tasks alone or mixed batches.
Efficiency: RFT not only improved success rates but also drastically reduced interaction turns and token generation, indicating more efficient exploration strategies.

Finding 2: Variable Inter-Environment Transfer

Result: Generalization to unseen environments is highly sensitive to the specific characteristics of the target environment.
- Positive Transfer: Agents trained on search-based environments (WebShop, SearchQA) transferred well to each other due to shared "search query formulation" skills.
- Negative Transfer/Forgetting: Agents trained on BabyAI (which provides explicit valid action lists at every step) suffered severe performance drops when tested on other environments. The agents became dependent on the provided action lists and lost the ability to reason about valid actions independently.
- Sparse Feedback Challenge: Environments with strict action validation and sparse feedback (e.g., AlfWorld saying "Nothing happens" for invalid moves) were difficult to generalize to, as agents struggled to learn from negative signals without explicit guidance.

Finding 3: Sequential Training vs. Joint Training

Anti-Forgetting: Sequential training (learning Env A then Env B) largely preserved performance on the upstream environment (Env A) while gaining competence in the downstream environment (Env B).
Order Matters: The training order significantly impacted generalization. An "Easy-to-Hard" sequence (e.g., BabyAI $\to$ SearchQA) yielded better results than the reverse, mimicking the benefits of curriculum learning across environments.
Parity with Joint Training: Sequential training achieved performance comparable to joint training (mixture of all data) but with the advantage of preserving specific capabilities without the computational cost of re-training on massive mixed datasets.

Finding 4: Failure Mode Analysis

Confirmation Bias: A prevalent failure mode (>10% across all scenarios) where agents became overconfident in their initial reasoning and failed to verify against environment feedback.
Context-Dependent Errors:
- In WebShop (unseen), agents struggled with "State or Memory Inconsistency" due to the volume of HTML information.
- In SearchQA, "Guessing or Fabrication" was common, indicating a lack of robust tool utilization when the search interface changed.

4. Significance and Implications

Redefining RFT Evaluation: The paper argues that evaluating RFT solely on in-domain benchmarks is insufficient. Real-world deployment requires testing across distribution shifts (different observation/action spaces).
Curriculum Learning is Critical: The study validates that training strategies matter. "Easy-to-Hard" curricula, both within and across environments, are essential for maximizing generalization and minimizing catastrophic forgetting.
Mechanism of Generalization: The research identifies that generalization is not uniform; it depends on the semantic priors (e.g., search logic) and interface constraints (e.g., action lists). Agents can learn abstract strategies (searching) but fail when specific interface dependencies (action lists) are removed.
Practical Deployment Strategy: For deploying generalist agents, the authors suggest Sequential Multi-Environment Training as a viable, efficient alternative to joint training, provided the training order is optimized (e.g., starting with environments that provide rich feedback before moving to sparse-feedback environments).

Conclusion

The paper concludes that while RFT significantly improves LLM agent capabilities, its generalization is context-dependent. It excels at transferring skills within similar semantic domains (like search) but struggles when the fundamental interaction mechanics (like action validation) change drastically. The work provides a roadmap for developing more robust, generalizable agents through careful curriculum design and sequential training strategies.