AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

Imagine you are trying to teach a brilliant but inexperienced apprentice how to win a complex cooking competition (like a high-stakes MasterChef). The goal isn't just to cook one dish; it's to keep tweaking the recipe, fixing mistakes, and improving the flavor over and over again until it's perfect.

This is what Autonomous Machine Learning Engineering (MLE) is: an AI agent trying to build and improve computer models by writing code, running tests, seeing what fails, and trying again.

The paper introduces a new method called AceGRPO to teach these AI agents how to learn faster and better. Here is the breakdown using simple analogies:

The Problem: The "Frozen Brain" and the "Endless Wait"

Currently, most AI agents are like students who have read a textbook but can't take notes.

The Frozen Brain: If an AI tries a recipe and it burns, it remembers the failure for that specific conversation, but once the chat ends, it forgets everything. It doesn't actually learn how to avoid burning the next time. It just keeps making the same mistakes over and over.
The Endless Wait: In real-world coding, running a test isn't instant. It might take hours to see if the code works. If you try to teach the AI by making it run a full 10-hour experiment, fail, and then start over, you'd never finish. It's too slow and expensive.

The Solution: AceGRPO

The authors created AceGRPO, which acts like a genius coach with two superpowers:

1. The "Recycling Bin" (Evolving Data Buffer)

Imagine a chef who, instead of throwing away a burnt cake, cuts off the burnt part and saves the good middle layer to use as a base for a new cake.

How it works: When the AI tries to write code and fails (or gets a mediocre score), AceGRPO doesn't throw that attempt away. It takes that "failed" moment, saves it as a new starting point, and turns it into a mini-lesson.
The Benefit: Instead of needing brand new, expensive experiments every time, the AI recycles its own past mistakes and partial successes to create a massive library of practice problems. It turns "wasted time" into "training data."

2. The "Smart Coach" (Adaptive Sampling)

Imagine a coach who has a huge pile of practice drills.

The Bad Coach: Picks drills randomly. Sometimes they pick drills the student has already mastered (boring, no learning) and sometimes drills that are impossible (frustrating, no learning).
The AceGRPO Coach: Uses a special radar called Learnability Potential. This radar scans the practice drills and asks: "Which of these is just right for the student right now?"
- It ignores the easy stuff (the student already knows it).
- It ignores the impossible stuff (the student isn't ready).
- It focuses entirely on the "Goldilocks Zone": The tasks that are hard enough to be challenging but solvable enough to teach something new.

By only spending time on these "Goldilocks" tasks, the AI learns much faster because it isn't wasting energy on things it already knows or things it can't do yet.

The Results: A Small Model Beating Giants

The team trained a model called Ace-30B (which is about 30 billion "neurons" in size).

The Competition: They pitted it against massive, expensive, closed-source models (like GPT-5 or Claude-4.5) and other huge open-source models.
The Outcome: Even though Ace-30B is smaller than some of its rivals, it performed better.
- It managed to submit valid solutions 100% of the time (it never gave up or crashed).
- It won more "medals" (top rankings) than the much larger models.
- It kept improving over time, whereas the others hit a wall and stopped getting better.

The Big Picture

Think of AceGRPO as the difference between a student who just reads a book and a student who has a personal tutor.

The tutor (AceGRPO) takes every mistake the student makes, saves it, and creates a custom practice plan that focuses only on the specific things the student needs to learn next.
This allows a smaller, more efficient AI to outperform massive, expensive giants by learning how to learn, rather than just memorizing answers.

In short: AceGRPO teaches AI agents to stop repeating mistakes and start focusing on the exact right challenges to solve, turning a slow, expensive process into a fast, self-improving engine.

1. Problem Statement

Autonomous Machine Learning Engineering (MLE) requires agents to perform sustained, iterative optimization over long horizons (e.g., Kaggle competitions). Unlike standard software engineering tasks where success is binary (passing unit tests), MLE is an empirical science requiring continuous refinement of architectures, data pipelines, and hyperparameters based on noisy feedback.

Key Challenges:

Behavioral Stagnation: Current prompt-based agents rely on inference-time search with frozen parameters. They cannot internalize trial-and-error experiences, leading to repeated suboptimal patterns and performance plateaus.
Prohibitive Latency: MLE tasks involve running full training pipelines, where a single execution step can take minutes to hours. This makes end-to-end full-trajectory Reinforcement Learning (RL) computationally intractable.
Inefficient Data Selection: Standard RL approaches suffer from two sampling issues:
1. Where to sample: Collecting diverse states is expensive and often yields limited diversity.
2. How to sample: Uniform sampling often selects states that are either "mastered" (deterministic high rewards) or "beyond capability" (deterministic failures). In both cases, the within-group reward variance vanishes, leading to ineffective policy updates (gradient collapse).

2. Methodology: AceGRPO

The authors propose AceGRPO (Adaptive Curriculum Enhanced Group Relative Policy Optimization), a framework that reformulates long-horizon MLE optimization as step-wise learning over a dynamically evolving task distribution. It consists of two tightly coupled components:

A. Evolving Data Buffer

Instead of treating MLE interactions as static datasets, AceGRPO maintains a dynamic buffer ( $B_t$ ) that continuously repurposes execution traces into reusable training tasks.

Streaming Expansion: Every intermediate state (whether from a failed debug attempt or a suboptimal solution) is treated as a valid starting point for a new single-step RL task.
State Transition: When an agent executes code $y$ in state $x$ and receives feedback $f$ , a new derivative state $x' = \Phi(x, y, f)$ is immediately appended to the buffer.
Task Types: States are categorized into three types based on the context:
1. Draft: Initial code generation.
2. Debug: Fixing errors based on execution logs.
3. Improve: Optimizing metrics on valid but suboptimal code.
Reward Shaping: The reward function combines absolute performance (HumanRank score) and relative improvement over the previous baseline, incentivizing valid solutions early and incremental gains later.

B. Adaptive Sampling via Learnability Potential

To maximize the efficiency of the limited execution budget, AceGRPO employs an adaptive sampling strategy ( $Q_t$ ) that prioritizes states at the agent's "learning frontier."

Learnability Potential ( $P(x)$ ): A metric quantifying the expected informativeness of a state, defined as a function of:
- Uncertainty: The standard deviation ( $\sigma$ ) of rewards within a GRPO group (high variance indicates the agent is in a "learning zone").
- Headroom: The potential for improvement ( $1 - \text{mean reward}$ ), penalizing states that are already mastered or impossible.
Curriculum Scheduling:
- Rank-based Prioritization: States are ranked by $P(x)$ within their task type. A focusing coefficient $\rho(t)$ is annealed over time, shifting from broad exploration to concentrating compute on the top percentile of high-potential tasks.
- Diversity-preserving Cooling: A multiplicative cooling factor suppresses recently visited states to prevent overfitting to local optima and ensure long-term coverage.
Asynchronous Architecture: The system decouples Rollout Workers (which sample from the buffer and execute tasks) from Learner Actors (which perform GRPO updates), allowing real-time curriculum updates without blocking optimization.

3. Key Contributions

AceGRPO Framework: An adaptive RL framework that transforms long-horizon MLE into step-wise optimization over an Evolving Data Buffer, enabling continuous self-evolution.
Learnability Potential: A novel mechanism acting as a proxy for gradient magnitude. It dynamically prioritizes tasks where the agent has high uncertainty and significant room for improvement, solving the "vanishing gradient" problem in large state pools.
State-of-the-Art Performance: Demonstrated that a 30B parameter model trained with AceGRPO can outperform significantly larger open-source models and approach the performance of proprietary frontier models.

4. Experimental Results

The model, Ace-30B (based on Qwen3-30B), was evaluated on MLE-Bench-Lite (22 Kaggle tasks).

Performance Metrics:
- Valid Submission Rate: Achieved 100%, matching the best proprietary models (Claude-4.5-Sonnet) and outperforming the base model (84.85%).
- Any Medal Rate: Achieved 51.52%, a 24.25% absolute improvement over the untrained baseline and surpassing larger open-source baselines like DeepSeek-V3.2 (39.39%) and Qwen3-235B (37.88%).
- HumanRank Score: Achieved 0.7114, outperforming DeepSeek-V3.2 (0.6592) and matching GPT-5.2 (0.7105).
Efficiency & Dynamics:
- Ace-30B reached valid solutions significantly faster (average 3.67 steps vs. 18.48 for the baseline) and showed sustained improvement over time, whereas the baseline plateaued early.
- Ablation Studies: Removing the Evolving Data Buffer reduced performance by ~4%, while removing Adaptive Sampling reduced performance by ~7%, confirming both components are critical.

5. Significance

Bridging the Gap: AceGRPO demonstrates that open-source models can achieve performance comparable to proprietary frontier models in complex, long-horizon tasks through efficient RL training, rather than just scaling parameters.
Solving Latency Constraints: By shifting from full-trajectory RL to step-wise optimization with an evolving buffer, the method makes RL feasible for high-latency domains like MLE.
Self-Evolution: The framework proves that agents can internalize strategies and sustain improvement trajectories over time, moving beyond the limitations of static prompt engineering.
Generalizability: The concepts of "Learnability Potential" and adaptive curriculum sampling offer a blueprint for training agents in other domains characterized by sparse feedback and high execution costs.