GEM: A Gym for Agentic LLMs

Imagine you want to teach a robot how to play chess, solve math problems, or write computer code. In the past, we taught robots by showing them thousands of static examples, like flashcards. But the real world isn't a stack of flashcards; it's a dynamic, messy playground where you have to try things, fail, learn, and try again.

This paper introduces GEM (General Experience Maker), a new "playground" designed specifically to help Large Language Models (LLMs) learn by doing, rather than just by reading.

Here is a breakdown of the paper using simple analogies:

1. The Problem: The "Flashcard" Trap

Currently, most AI training is like studying for a test using only flashcards. You see a question, you give an answer, and you get a score. It's a single-turn interaction.

The Limitation: Real life is a multi-turn conversation. If you are playing a game of chess, you make a move, the opponent responds, you think, you make another move. If you are debugging code, you write a line, it crashes, you fix it, and run it again.
The Issue: Many current AI training methods (like GRPO) are great at flashcards but terrible at long, complex games. They struggle to figure out which specific move in a long chain of events was the "good" one and which was the "bad" one.

2. The Solution: GEM (The Gym)

The authors built GEM, which is to AI what OpenAI Gym was to traditional robotics.

The Analogy: Think of OpenAI Gym as a standardized gym with treadmills, weights, and punching bags that all look the same so you can test any robot on them. GEM is that same gym, but for "Agent" AIs (AIs that can act, not just talk).
What's inside? GEM comes pre-loaded with over 100 different "exercises":
- Games: Like Guess the Number or Sudoku, where the AI has to think step-by-step.
- Tools: The AI can use a calculator (Python), search the web, or type commands into a fake computer terminal.
- Reasoning: Math problems and logic puzzles.

3. The Secret Sauce: "Return Batch Normalization" (ReBN)

The paper introduces a new way to teach the AI, which they call REINFORCE with ReBN.

The Analogy: Imagine a student taking a long exam.
- Old Way (GRPO): The teacher waits until the very end of the exam to give a single grade (A, B, or C). The student doesn't know which specific answer was wrong.
- New Way (ReBN): The teacher gives feedback after every question. But here's the trick: instead of just saying "Right" or "Wrong," the teacher looks at how the student did on all the questions in the whole batch and says, "You did better than average on this one, but worse than average on that one."
Why it matters: This helps the AI understand exactly which step in a long chain of reasoning was the key to success. It allows the AI to learn complex, multi-step tasks much faster and more accurately than previous methods.

4. The "Discount Factor" (The Patience Meter)

The paper shows that you can tune how "patient" the AI is using a setting called the Discount Factor ( $\gamma$ ).

The Analogy: Imagine you are playing a game of "Guess the Number."
- If you set the AI to be impatient (low discount), it learns to solve the puzzle in the fewest moves possible (using a smart strategy called "binary search"). It realizes, "I want the reward now, so I'll guess efficiently."
- If you set the AI to be patient (high discount), it doesn't care how many turns it takes, as long as it eventually wins. It might guess randomly and take 50 turns to get the answer.
The Discovery: The authors found that by adjusting this "patience meter," they could force the AI to learn the most efficient strategies, something older methods couldn't do easily.

5. Why This is a Big Deal

It's a Universal Adapter: GEM works with five different major AI training frameworks. It's like a universal power plug that fits into any wall socket. Researchers don't have to rebuild their tools every time they want to test a new idea.
It's a Benchmark: Before this, everyone tested their AI on different, custom-made games, making it impossible to compare who was actually the best. GEM provides a standard "scoreboard" so we can finally see which algorithms are truly superior.
It's Ready for the Real World: By testing the AI on things like using a database, searching the web, and writing code, GEM prepares AI for the kind of complex, multi-step jobs we actually want them to do in the future.

Summary

GEM is a standardized, open-source playground that lets AI agents learn by interacting with complex, multi-step environments. It introduces a smarter teaching method (ReBN) that helps AI figure out which specific actions led to success in long chains of events. By providing a common ground for researchers to test and compare their ideas, GEM aims to accelerate the development of truly autonomous, intelligent agents that can plan, reason, and use tools just like humans do.

1. Problem Statement

The current paradigm for training Large Language Models (LLMs) is shifting from static datasets to experience-based learning via Reinforcement Learning (RL). However, existing research and infrastructure face significant limitations:

Oversimplification of Interactions: Most RL for LLMs focuses on single-turn tasks (e.g., solving a math problem in one go). This fails to capture the complexity of multi-turn, long-horizon interactions required for true agentic behavior (e.g., iterative debugging, tool use, strategic planning).
Algorithmic Mismatch: Popular algorithms like GRPO (Group Relative Policy Optimization) are highly effective for single-turn tasks but are fundamentally incompatible with full multi-turn RL settings. GRPO relies on trajectory-level rewards and assumes a discount factor of $\gamma=1$ , which prevents agents from learning efficiency (minimizing turns) and lacks fine-grained credit assignment for intermediate steps.
Lack of Standardized Infrastructure: Unlike traditional RL which benefited from OpenAI Gym, there is no unified, open-source framework for agentic LLMs. Researchers often build bespoke environments that are tightly coupled with specific training code, making fair comparison and reproducibility difficult.

2. Methodology: The GEM Framework

The authors introduce GEM (General Experience Maker), an open-source environment simulator designed to bridge the gap between traditional RL infrastructure and modern agentic LLMs.

A. Environment Design & Interface

Standardized API: GEM mimics the OpenAI Gym interface (reset(), step()) but is optimized for LLMs.
Diverse Task Suite: It includes over 100 environments across 7 categories:
- Games: Multi-turn text games (e.g., Minesweeper, Sudoku, Hangman).
- ReasoningGym: Single-turn verifiable logic tasks.
- Math & Code: Problems requiring Chain-of-Thought (CoT) and code generation.
- QA: Knowledge-intensive questions requiring search tools.
- Terminal: Containerized environments for executing shell commands.
Tool Integration: GEM supports modular tools (Python execution, Search, MCP-compatible external tools). Crucially, it treats tool usage as part of the multi-turn interaction loop, allowing agents to learn when and how to call tools.
Vectorized Execution: Supports asynchronous parallel execution with autoreset mechanisms. This streamlines data collection by automatically resetting terminated episodes, allowing for continuous batch generation without complex user-side logic.

B. Algorithmic Innovation: REINFORCE + ReBN

The paper proposes a baseline algorithm that addresses the limitations of GRPO in multi-turn settings:

Action Definition: Treats a full response (sequence of tokens) as a single action, rather than individual tokens. This avoids the "context explosion" of token-level RL while maintaining multi-turn capability.
Return Batch Normalization (ReBN):
- Standard REINFORCE can suffer from high variance and poor convergence in multi-turn settings.
- The authors introduce ReBN, which normalizes the returns ( $G_t$ ) across the entire batch of transitions before computing the policy gradient.
- Formula: $A_{ReBN, t} = (G_t - \text{mean}(G)) / \text{std}(G)$ .
- Advantage: Unlike GRPO, ReBN is compatible with dense per-turn rewards and arbitrary discount factors ( $\gamma < 1$ ). This allows the agent to learn not just what to do, but how efficiently to do it (e.g., solving a problem in fewer turns).

3. Key Contributions

GEM Framework: A unified, decoupled library providing a standardized interface for agentic LLMs, supporting asynchronous vectorization, modular tool wrappers, and compatibility with five major RL training frameworks (Oat, Verl, OpenRLHF, ROLL, RL2).
Algorithmic Baseline: Introduction of REINFORCE with ReBN, a simple yet robust algorithm that outperforms or matches PPO and GRPO in multi-turn settings without requiring a learned critic (value function) or expensive tree-based sampling.
Comprehensive Benchmarking: The first "apples-to-apples" comparison of PPO, GRPO, and REINFORCE variants across 24 diverse environments, revealing that GRPO fails in dense-reward multi-turn scenarios while REINFORCE+ReBN excels.
Evaluation Toolkit: GEM serves as a unified evaluation platform for strong LLMs (GPT-5, Gemini-2.5-Pro, Claude-Sonnet-4) on complex tasks like MCP database operations and Terminal-Bench interactions.

4. Experimental Results

The authors conducted extensive experiments using Qwen3-based models (1.7B and 4B parameters):

Algorithm Comparison (Figure 4):
- Single-turn tasks: GRPO performs well.
- Multi-turn tasks (Games, Tool-use): GRPO struggles due to constant advantage estimation across steps. PPO performs well but requires learning a critic. REINFORCE + ReBN consistently achieves the best or comparable performance across all environments, offering a superior trade-off between performance and computational cost.
Impact of Discount Factor ( $\gamma$ ) (Figure 5a):
- In the GuessTheNumber task, setting $\gamma < 1$ (e.g., 0.9) successfully incentivizes the agent to learn binary search (optimal strategy with ~5.6 turns).
- With $\gamma \approx 1$ (GRPO style), agents fail to minimize turns, often exhausting the trial budget. This proves that $\gamma < 1$ is critical for learning efficiency in agentic tasks.
Tool Integration (Tables 1 & 2):
- RL training significantly improves performance on Math and QA tasks compared to base models.
- Agents equipped with tools (Python for Math, Search for QA) achieve the highest accuracy, demonstrating GEM's ability to facilitate tool-augmented learning.
Generalization (Figure 6): Agents trained on Sudoku showed positive transfer to other ReasoningGym tasks (e.g., Circuit Logic, Needle-in-Haystack), indicating cross-task generalization capabilities.
Multi-Agent Evaluation (Figure 12): In TAU-bench retail simulations, stronger user agents (simulated by stronger LLMs) consistently improved the success rates of assistant agents, highlighting the importance of realistic user simulation in multi-agent RL.

5. Significance and Future Impact

Paradigm Shift: GEM facilitates the transition from "static dataset" training to "experience-based" agentic learning, enabling LLMs to master long-horizon planning and iterative refinement.
Democratization: By providing a decoupled, open-source framework with single-file integration scripts, GEM lowers the barrier to entry for researchers, allowing them to focus on algorithmic innovation rather than environment engineering.
Algorithmic Insight: The paper establishes that for complex agentic tasks, dense rewards and discount factors < 1 are essential. It challenges the dominance of GRPO in multi-turn settings and promotes REINFORCE-based approaches with normalization techniques.
Standardization: GEM aims to become the "OpenAI Gym" for the era of agentic LLMs, ensuring that future research can be compared fairly and reproducibly across diverse, complex environments.