AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

The paper proposes AceGRPO, an adaptive curriculum-enhanced Group Relative Policy Optimization framework featuring an Evolving Data Buffer and Learnability Potential-guided sampling, which enables a 30B-parameter model to achieve sustained iterative optimization in autonomous Machine Learning Engineering, outperforming larger open-source baselines and approaching proprietary frontier models.

Yuzhu Cai, Zexi Liu, Xinyu Zhu, Cheng Wang, Siheng Chen

Published 2026-03-03
📖 4 min read☕ Coffee break read

Imagine you are trying to teach a brilliant but inexperienced apprentice how to win a complex cooking competition (like a high-stakes MasterChef). The goal isn't just to cook one dish; it's to keep tweaking the recipe, fixing mistakes, and improving the flavor over and over again until it's perfect.

This is what Autonomous Machine Learning Engineering (MLE) is: an AI agent trying to build and improve computer models by writing code, running tests, seeing what fails, and trying again.

The paper introduces a new method called AceGRPO to teach these AI agents how to learn faster and better. Here is the breakdown using simple analogies:

The Problem: The "Frozen Brain" and the "Endless Wait"

Currently, most AI agents are like students who have read a textbook but can't take notes.

  1. The Frozen Brain: If an AI tries a recipe and it burns, it remembers the failure for that specific conversation, but once the chat ends, it forgets everything. It doesn't actually learn how to avoid burning the next time. It just keeps making the same mistakes over and over.
  2. The Endless Wait: In real-world coding, running a test isn't instant. It might take hours to see if the code works. If you try to teach the AI by making it run a full 10-hour experiment, fail, and then start over, you'd never finish. It's too slow and expensive.

The Solution: AceGRPO

The authors created AceGRPO, which acts like a genius coach with two superpowers:

1. The "Recycling Bin" (Evolving Data Buffer)

Imagine a chef who, instead of throwing away a burnt cake, cuts off the burnt part and saves the good middle layer to use as a base for a new cake.

  • How it works: When the AI tries to write code and fails (or gets a mediocre score), AceGRPO doesn't throw that attempt away. It takes that "failed" moment, saves it as a new starting point, and turns it into a mini-lesson.
  • The Benefit: Instead of needing brand new, expensive experiments every time, the AI recycles its own past mistakes and partial successes to create a massive library of practice problems. It turns "wasted time" into "training data."

2. The "Smart Coach" (Adaptive Sampling)

Imagine a coach who has a huge pile of practice drills.

  • The Bad Coach: Picks drills randomly. Sometimes they pick drills the student has already mastered (boring, no learning) and sometimes drills that are impossible (frustrating, no learning).
  • The AceGRPO Coach: Uses a special radar called Learnability Potential. This radar scans the practice drills and asks: "Which of these is just right for the student right now?"
    • It ignores the easy stuff (the student already knows it).
    • It ignores the impossible stuff (the student isn't ready).
    • It focuses entirely on the "Goldilocks Zone": The tasks that are hard enough to be challenging but solvable enough to teach something new.

By only spending time on these "Goldilocks" tasks, the AI learns much faster because it isn't wasting energy on things it already knows or things it can't do yet.

The Results: A Small Model Beating Giants

The team trained a model called Ace-30B (which is about 30 billion "neurons" in size).

  • The Competition: They pitted it against massive, expensive, closed-source models (like GPT-5 or Claude-4.5) and other huge open-source models.
  • The Outcome: Even though Ace-30B is smaller than some of its rivals, it performed better.
    • It managed to submit valid solutions 100% of the time (it never gave up or crashed).
    • It won more "medals" (top rankings) than the much larger models.
    • It kept improving over time, whereas the others hit a wall and stopped getting better.

The Big Picture

Think of AceGRPO as the difference between a student who just reads a book and a student who has a personal tutor.

  • The tutor (AceGRPO) takes every mistake the student makes, saves it, and creates a custom practice plan that focuses only on the specific things the student needs to learn next.
  • This allows a smaller, more efficient AI to outperform massive, expensive giants by learning how to learn, rather than just memorizing answers.

In short: AceGRPO teaches AI agents to stop repeating mistakes and start focusing on the exact right challenges to solve, turning a slow, expensive process into a fast, self-improving engine.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →