EnterpriseBench Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Imagine you are trying to teach a brilliant but inexperienced intern how to run a busy customer support desk for a computer parts store.

The Old Way (Current AI Training):
Most researchers train these AI interns using "textbooks" or "video games." They give the intern a list of 1,000 fake scenarios like, "If a customer asks for a refund, say 'Yes'." The intern memorizes the answers to these specific questions.

The Problem: When the intern gets to the real job, a customer asks a weird, multi-step question that wasn't in the textbook. The intern freezes because they only learned to follow a script, not how to think or adapt. They are like a student who memorized the answers to a practice test but fails the real exam because the questions were slightly different.

The New Way (This Paper's Approach):
The authors of this paper built a hyper-realistic simulation called Corecraft. Instead of a textbook, they built a fully functioning digital universe that acts exactly like a real computer parts company.

It has 2,500 fake customers, real order histories, inventory systems, and even messy data (like missing receipts or conflicting dates).
The AI agent isn't just reading a prompt; it's actually logging into a fake database, searching for orders, checking warranty policies, and writing emails to customers.

The Experiment: One Day of "On-the-Job" Training

The researchers took a smart AI model (GLM 4.6) and put it through one single day of training in this realistic simulation. They didn't just tell it "do better." They gave it a strict checklist (a rubric) written by human experts.

Did you find the right order?
Did you check the warranty dates correctly?
Did you write the email politely?

If the AI missed a step, it got a "bad grade." If it nailed every detail, it got a "good grade."

The Results: From "Intern" to "Pro"

After just one day of this realistic training, something magical happened:

It Got Much Better at the Simulation: The AI's success rate jumped from 25% to 36%. That sounds small, but in the world of AI, that's a massive leap. It went from being worse than a human intern to beating some of the world's most advanced AI models (like Claude Opus 4.5) on these specific tasks.
The "Superpower" Transfer: This is the most important part. The AI was never trained on other tests. But when they took this newly trained AI and put it into completely different jobs—like calling APIs for a travel app or managing a database for a school—it got better at those too!
- It improved by 7.4% on a customer service test it had never seen.
- It improved by 6.8% on a complex tool-use test.

The Analogy: The Chess Player vs. The General

Think of the old training method like teaching a chess player only the opening moves of one specific game. They might win that one game, but if you change the board or the rules, they lose.

The Corecraft method is like sending that chess player to a real war zone (a realistic simulation) where they have to:

Navigate confusing terrain (messy data).
Talk to different people (customers).
Solve problems they've never seen before.

By surviving the chaos of the "real" simulation, the AI learned general survival skills:

How to plan ahead (Multi-step workflows).
How to handle rules (Constraint handling).
How to communicate clearly (Response quality).

Because it learned how to think rather than what to say, it could apply those skills to any new job, whether it was selling computer parts or managing a school's grading system.

The Big Takeaway

The paper argues that the quality of the training environment matters more than just the size of the AI.

If you want an AI that can actually work in the real world, don't just feed it millions of fake, simple questions. Put it in a messy, realistic, high-fidelity simulation where it has to do the actual work, make mistakes, and learn from expert feedback. That's how you turn a brittle robot into a reliable employee.

1. Problem Statement

Despite rapid improvements in AI agent capabilities on research benchmarks, real-world deployment remains limited. A survey of practitioners indicates that 68% of deployed agents execute fewer than 10 steps before requiring human intervention, with reliability cited as the primary bottleneck.

The authors hypothesize that this gap stems from the characteristics of current training environments:

Simplification: Many benchmarks use synthetic data, simplified simulations, or contrived task structures that fail to capture real-world complexity.
Overfitting: Agents trained on such environments learn environment-specific heuristics rather than generalizable problem-solving strategies.
Lack of Realism: Existing environments often lack the "noise," interdependencies, and multi-step reasoning required in professional workflows.

2. Methodology

A. The Corecraft Environment

The authors introduce Corecraft, the first environment in the EnterpriseBench suite by Surge AI. It is a high-fidelity simulation of a customer support organization at a fictional PC components retailer.

Scale & Complexity: Contains 2,500+ entities across 14 types (customers, orders, products, tickets, policies, etc.) and 23 unique tools exposed via the Model Context Protocol (MCP).
Design Principles:
1. Task-Centric: Entities and tools exist to support diverse, challenging tasks, not just to maximize world size.
2. Expert-Authored Rubrics: Domain experts design tasks and detailed evaluation rubrics to enable reliable, automated reward computation.
3. Realistic Workflows: Tasks mirror professional patterns, including multi-step reasoning, constraint handling, and structured communication.
Statefulness: The environment runs in a self-contained Docker container, maintaining transactional consistency (e.g., order updates, inventory changes) across multi-turn agent interactions.

B. Training Approach

Base Model: GLM 4.6 (357B parameters, 32B active).
Algorithm: Group Relative Policy Optimization (GRPO) with Adaptive Clipping (inspired by DAPO). This eliminates the need for a separate critic network by estimating baselines from group scores, reducing memory overhead.
Reward Signal: Instead of a learned reward model, rewards are computed using an LLM Judge that evaluates agent trajectories against expert-authored rubrics. The reward is the proportion of satisfied criteria ( $r = \frac{1}{|C|} \sum 1[c \text{ satisfied}]$ ).
Data Split: 1,000 tasks for training, 150 tasks for held-out evaluation.

C. Training Pipeline

Rollout: The agent generates 16 trajectories per prompt, interacting with the stateful Corecraft Docker container via MCP.
Evaluation: Completed trajectories are graded by an LLM judge against the rubric.
Update: Trajectories and rewards are fed into the Megatron training loop to compute policy gradients and update weights.

3. Key Contributions

Demonstration of High-Fidelity Training: Proved that training on a realistic, complex enterprise environment yields significant performance gains on held-out tasks.
Generalization Evidence: Showed that skills learned in Corecraft transfer to out-of-distribution (OOD) benchmarks, suggesting the model learns general agentic workflows rather than environment-specific shortcuts.
Qualitative Analysis: Identified specific behavioral improvements:
- Multi-step workflow execution: Correct task decomposition and sequencing.
- Constraint handling: Accurate application of business rules and temporal constraints.
- Response quality: Structured, professional communication aligned with real-world expectations.
Benchmarking Frontier Models: Established that even top-tier models (e.g., Claude Opus 4.6, GPT-5.2) solve <35% of Corecraft tasks under strict rubric criteria, highlighting a massive capability gap.

4. Results

A. In-Distribution Performance

After a single epoch of training, GLM 4.6 improved its task pass rate on the held-out Corecraft set from 25.37% to 36.76% (+11.39 percentage points).

This trained model outperformed Claude Opus 4.5 (33.49%) and approached the performance of GPT-5.1 High (36.86%).
The improvement exceeds the capability gap between major model generations (e.g., +7.05 pp between Claude Sonnet 4.5 and Opus 4.5).

B. Out-of-Distribution (OOD) Generalization

The training transferred effectively to three external benchmarks:

Benchmark	Domain	Baseline	Trained	Improvement
BFCL Parallel	Function Calling (Parallel)	91.0%	95.5%	+4.5%
$\tau^2$ -Bench Retail	Customer Service	68.7%	76.1%	+7.4%
Toolathlon	Long-horizon Tool Use	18.8%	25.6%	+6.8%

Toolathlon Analysis: The model showed the most significant gains in categories requiring structured data manipulation (Data Analysis/Finance, Web/Forms). The Pass@3 metric (reliability across 3 runs) nearly doubled from 9.3% to 17.6%, indicating improved consistency.

C. Failure Mode Analysis

The paper identifies three recurring failure modes in frontier models that Corecraft training addresses:

Poor Search Strategy: Defaulting to generic keyword searches instead of retrieving specific context first.
Pagination Failure: Accepting truncated results (e.g., hitting a 10-item limit) without inferring the need for further queries.
Incomplete Tool Exploration: Anchoring on the first viable tool rather than exploring alternatives (e.g., checking pre-built vs. custom builds).

5. Significance and Implications

Environment Quality as a Driver: The paper argues that environment quality, diversity, and realism are critical factors in developing generalizable agent capabilities, potentially more so than model scale alone.
Bridging the Gap: The results suggest that high-fidelity environments can bridge the gap between benchmark performance and production readiness by teaching robust, transferable professional patterns.
Methodological Shift: It validates the use of expert-authored rubrics as a reliable reward signal for RL, offering a scalable alternative to learned reward models which can be brittle or misaligned.
Future Directions: The authors propose that future agent training should utilize multi-domain curricula of high-fidelity environments to further enhance generalization and reliability.

In conclusion, the paper demonstrates that training agents on realistic, complex, and expertly evaluated enterprise simulations produces agents that are not only better at the specific task but also possess transferable, robust skills applicable to diverse, unseen domains.