Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

This paper demonstrates that while classical hyperparameter optimization algorithms outperform LLM-based agents in fixed search spaces, a hybrid approach called Centaur—which combines a classical optimizer's internal state with an LLM's code-editing capabilities—achieves superior results, showing that even small, cost-effective LLMs can excel when paired with robust classical methods.

Fabio Ferreira, Lucca Wobbe, Arjun Krishnakumar, Frank Hutter, Arber Zela

Published 2026-03-27
📖 5 min read🧠 Deep dive

Imagine you are trying to bake the perfect loaf of bread. You have a recipe (the training code), but the recipe has many knobs you can turn: how much flour, how hot the oven is, how long to knead the dough, etc. These are hyperparameters. Your goal is to find the exact combination that makes the best bread, but you only have a limited amount of time and electricity (a compute budget) to experiment.

This paper is a race to see who can find that "perfect bread" recipe faster and better: Old-school math algorithms or AI agents (LLMs)?

Here is the breakdown of the study, explained with simple analogies.

1. The Contestants

The researchers set up a tournament with three types of bakers:

  • The Math Wizards (Classical HPO): These are algorithms like CMA-ES and TPE. Think of them as super-smart statisticians who don't know anything about baking. They just look at the results of previous loaves and use pure math to guess the next set of ingredients. They are incredibly efficient at navigating the "search space" (the list of possible ingredient combinations).
  • The AI Chefs (LLM Agents): These are Large Language Models (like the 27B parameter model used here). They are like a chef who has read every cookbook in the world. They have "intuition" about baking.
    • The Strict Chef: One version of the AI is only allowed to turn the knobs on the existing recipe (Fixed Search Space).
    • The Creative Chef: Another version (the Karpathy Agent) is allowed to rewrite the recipe itself, changing the instructions or adding new steps (Unconstrained Code Editing).
  • The Hybrid (Centaur): This is the paper's star invention. It's a "Centaur" (half-human, half-horse). It combines the Math Wizard's internal map with the AI Chef's intuition.

2. The Results: Who Won?

Scenario A: The "Strict" Kitchen (Fixed Search Space)

When the AI was forced to just turn knobs on a fixed recipe, the Math Wizards won easily.

  • Why? The AI chefs got confused. Even though they had read all the cookbooks, they struggled to remember the results of the last 20 loaves they baked. They kept making mistakes that caused the oven to explode (Out-of-Memory errors).
  • The Lesson: In a strict, mathematical game, raw intuition isn't enough. You need a system that remembers its state perfectly. The Math Wizards were like a GPS that never loses signal, while the AI was like a tourist with a paper map who keeps getting lost.

Scenario B: The "Creative" Kitchen (Rewriting the Code)

When the AI was allowed to rewrite the recipe itself, the game changed.

  • The AI Chef (Karpathy Agent) started doing much better, narrowing the gap with the Math Wizards.
  • Why? Because sometimes the best way to improve isn't just tweaking the temperature; it's realizing the recipe needs a new ingredient entirely. The AI's ability to understand code and "think outside the box" gave it an advantage here.
  • The Catch: This required a very smart AI (the 27B model). A smaller, cheaper AI (0.8B) couldn't handle the complexity of rewriting code and failed miserably.

Scenario C: The "Centaur" Solution (The Winner)

The researchers combined the two: Centaur.

  • How it works: The Math Wizard (CMA-ES) does the heavy lifting, calculating the best path. But, 30% of the time, it asks the AI Chef, "Hey, based on your intuition and the current map, do you see a better spot?"
  • The Magic: The AI Chef almost always says, "Yes, I see a better spot," and tweaks the suggestion.
  • The Surprise: The small, cheap AI (0.8B) actually worked better in this hybrid team than the big, expensive one!
    • Analogy: You don't need a Nobel Prize-winning chef to give a small nudge to a GPS. You just need someone who knows the local shortcuts. The big AI was overkill and sometimes got distracted; the small AI was just enough to make the Math Wizard even sharper.

3. Key Takeaways (The "Moral of the Story")

  1. Reliability > Exploration: In the strict search space, avoiding disasters (like the oven exploding/OOM errors) was more important than trying wild, new ideas. The Math Wizards were safer and more consistent.
  2. Context Matters: If you just want to tweak numbers, use a Math Algorithm. If you want to rewrite the rules of the game, you need an AI that can edit code.
  3. The Hybrid is King: The best approach wasn't "AI vs. Math." It was "AI + Math." By giving the AI the Math Wizard's internal map (the "state"), the AI could make smarter suggestions without getting lost.
  4. Bigger isn't Always Better: For this specific hybrid task, a tiny, cheap AI was actually more effective than a massive, expensive one.

Summary

The paper asks: "Can AI replace the old math tools for tuning machine learning?"

The answer is: Not yet, on its own. If you just let an AI guess numbers, it's often worse than a math algorithm. But if you let the AI rewrite the code, it gets competitive. And the absolute best result comes from a team-up: a math algorithm that does the heavy lifting, guided occasionally by a small, smart AI that knows the "tricks of the trade."

It's like having a super-accurate GPS (the Math) that occasionally asks a local taxi driver (the AI) for a shortcut. The result is a faster, smoother ride than either could achieve alone.