Can LLMs Beat Classical Hyperparameter Optimization Algorithms? A Study on autoresearch

Imagine you are trying to bake the perfect loaf of bread. You have a recipe (the training code), but the recipe has many knobs you can turn: how much flour, how hot the oven is, how long to knead the dough, etc. These are hyperparameters. Your goal is to find the exact combination that makes the best bread, but you only have a limited amount of time and electricity (a compute budget) to experiment.

This paper is a race to see who can find that "perfect bread" recipe faster and better: Old-school math algorithms or AI agents (LLMs)?

Here is the breakdown of the study, explained with simple analogies.

1. The Contestants

The researchers set up a tournament with three types of bakers:

The Math Wizards (Classical HPO): These are algorithms like CMA-ES and TPE. Think of them as super-smart statisticians who don't know anything about baking. They just look at the results of previous loaves and use pure math to guess the next set of ingredients. They are incredibly efficient at navigating the "search space" (the list of possible ingredient combinations).
The AI Chefs (LLM Agents): These are Large Language Models (like the 27B parameter model used here). They are like a chef who has read every cookbook in the world. They have "intuition" about baking.
- The Strict Chef: One version of the AI is only allowed to turn the knobs on the existing recipe (Fixed Search Space).
- The Creative Chef: Another version (the Karpathy Agent) is allowed to rewrite the recipe itself, changing the instructions or adding new steps (Unconstrained Code Editing).
The Hybrid (Centaur): This is the paper's star invention. It's a "Centaur" (half-human, half-horse). It combines the Math Wizard's internal map with the AI Chef's intuition.

2. The Results: Who Won?

Scenario A: The "Strict" Kitchen (Fixed Search Space)

When the AI was forced to just turn knobs on a fixed recipe, the Math Wizards won easily.

Why? The AI chefs got confused. Even though they had read all the cookbooks, they struggled to remember the results of the last 20 loaves they baked. They kept making mistakes that caused the oven to explode (Out-of-Memory errors).
The Lesson: In a strict, mathematical game, raw intuition isn't enough. You need a system that remembers its state perfectly. The Math Wizards were like a GPS that never loses signal, while the AI was like a tourist with a paper map who keeps getting lost.

Scenario B: The "Creative" Kitchen (Rewriting the Code)

When the AI was allowed to rewrite the recipe itself, the game changed.

The AI Chef (Karpathy Agent) started doing much better, narrowing the gap with the Math Wizards.
Why? Because sometimes the best way to improve isn't just tweaking the temperature; it's realizing the recipe needs a new ingredient entirely. The AI's ability to understand code and "think outside the box" gave it an advantage here.
The Catch: This required a very smart AI (the 27B model). A smaller, cheaper AI (0.8B) couldn't handle the complexity of rewriting code and failed miserably.

Scenario C: The "Centaur" Solution (The Winner)

The researchers combined the two: Centaur.

How it works: The Math Wizard (CMA-ES) does the heavy lifting, calculating the best path. But, 30% of the time, it asks the AI Chef, "Hey, based on your intuition and the current map, do you see a better spot?"
The Magic: The AI Chef almost always says, "Yes, I see a better spot," and tweaks the suggestion.
The Surprise: The small, cheap AI (0.8B) actually worked better in this hybrid team than the big, expensive one!
- Analogy: You don't need a Nobel Prize-winning chef to give a small nudge to a GPS. You just need someone who knows the local shortcuts. The big AI was overkill and sometimes got distracted; the small AI was just enough to make the Math Wizard even sharper.

3. Key Takeaways (The "Moral of the Story")

Reliability > Exploration: In the strict search space, avoiding disasters (like the oven exploding/OOM errors) was more important than trying wild, new ideas. The Math Wizards were safer and more consistent.
Context Matters: If you just want to tweak numbers, use a Math Algorithm. If you want to rewrite the rules of the game, you need an AI that can edit code.
The Hybrid is King: The best approach wasn't "AI vs. Math." It was "AI + Math." By giving the AI the Math Wizard's internal map (the "state"), the AI could make smarter suggestions without getting lost.
Bigger isn't Always Better: For this specific hybrid task, a tiny, cheap AI was actually more effective than a massive, expensive one.

Summary

The paper asks: "Can AI replace the old math tools for tuning machine learning?"

The answer is: Not yet, on its own. If you just let an AI guess numbers, it's often worse than a math algorithm. But if you let the AI rewrite the code, it gets competitive. And the absolute best result comes from a team-up: a math algorithm that does the heavy lifting, guided occasionally by a small, smart AI that knows the "tricks of the trade."

It's like having a super-accurate GPS (the Math) that occasionally asks a local taxi driver (the AI) for a shortcut. The result is a faster, smoother ride than either could achieve alone.

1. Problem Statement

The paper investigates the comparative efficacy of Large Language Model (LLM) agents versus classical Hyperparameter Optimization (HPO) algorithms in tuning the hyperparameters of a small language model (approx. 50M parameters).

The study addresses two primary questions:

How do classical HPO methods (e.g., CMA-ES, TPE) perform on the autoresearch task compared to LLM agents?
Can LLM-based HPO methods outperform classical ones, particularly when operating in an unconstrained search space (direct code editing) versus a fixed search space (tuning predefined parameters)?

The benchmark uses the autoresearch repository, where an agent iteratively edits training code to improve a model's validation bits-per-byte (val_bpb). The study aims to determine if LLMs can replace or augment classical optimizers given fixed compute budgets.

2. Methodology

Experimental Setup

Task: Training a small decoder-only transformer (nanochat) on the FineWeb dataset, optimizing for validation bits-per-byte (val_bpb).
Budget: 24-hour GPU training window (NVIDIA H200) with 3 random seeds.
Search Space:
- Fixed: 14 hyperparameters (e.g., learning rates, batch sizes, attention patterns) automatically extracted from the training script via Abstract Syntax Tree (AST) parsing to minimize human bias.
- Unconstrained: Direct editing of the train.py source code by the LLM agent.
LLM Infrastructure: All LLM-based methods use Qwen3.5 (self-hosted open-weight models: 0.8B and 27B variants) running on the same GPU as the training model. Inference overhead is excluded from wall-time comparisons to isolate optimization quality.

Evaluated Methods (9 Total)

The study benchmarks four categories of methods:

Classical HPO: TPE, CMA-ES, SMAC, and Random Search.
LLM-based (Fixed Space):
- LLAMBO: Uses LLM as a surrogate model (two variants: Optuna port and paper reimplementation).
- Karpathy Agent (14 HPs): LLM suggests configurations within the fixed space based on trial history.
LLM-based (Unconstrained):
- Karpathy Agent (Code): LLM directly edits source code.
Hybrid:
- Centaur: A novel hybrid method combining CMA-ES and an LLM.

The Centaur Hybrid Approach

Centaur is designed to leverage the complementary strengths of classical optimizers (learning the landscape) and LLMs (domain intuition).

Mechanism: On 30% of trials, the LLM receives the full internal state of CMA-ES: the mean vector ( $\mu$ ), step-size ( $\sigma$ ), and covariance matrix ( $C$ ), along with the top-5 configurations and the last 20 trials.
Action: The LLM can override the CMA-ES proposal. Crucially, CMA-ES updates its internal state based on all trial results, including those where the LLM overrode the proposal, ensuring the optimizer learns from the full trajectory.
Rationale: CMA-ES state is highly interpretable for LLMs (concrete vectors/matrices), unlike the complex density estimators of TPE or Gaussian Processes.

3. Key Results

A. Fixed Search Space Performance

Classical Dominance: Within a fixed hyperparameter space, classical methods (CMA-ES, TPE) consistently outperform pure LLM-based agents.
- CMA-ES achieved a val_bpb of 0.9785.
- TPE achieved 0.9768.
- Pure LLM methods (e.g., LLAMBO, Karpathy Agent 14 HPs) generally performed worse than random search or failed to converge effectively.
Reliability vs. Diversity: High search diversity did not correlate with success. Methods with high Out-of-Memory (OOM) rates (e.g., LLAMBO variants with ~48-61% OOM) performed poorly.
- Key Insight: OOM avoidance is a stronger predictor of performance than search diversity. Classical methods like CMA-ES and TPE maintain explicit optimization states that allow them to learn safe regions of the search space, keeping OOM rates low (~11-16%). Small/mid-sized LLMs struggle to track optimization state across trials, leading to high failure rates similar to random search.

B. Unconstrained Code Editing

LLM Viability: The Karpathy Agent (Code), which directly edits source code, is the only pure LLM method competitive with classical approaches, achieving 0.9814 val_bpb.
Model Scaling: Scaling from 0.8B to 27B parameters is essential for unconstrained code editing.
- 27B model: 0.9814
- 0.8B model: 0.9910 (worse performance, indicating the smaller model cannot reliably generate valid code edits).
Speed: Classical methods found similar configurations roughly 4x faster than the code-editing agent.

C. Hybrid Optimization (Centaur)

Best Performance: Centaur achieved the best overall result in the experiments (0.9763 for the 27B variant, 0.9766 for the 0.8B variant).
Stability: Centaur significantly reduced the cross-seed variance compared to CMA-ES alone (std dropped from 0.0036 to 0.0005), suggesting the LLM stabilizes the optimizer by injecting domain knowledge.
Surprising Scaling Result: The 0.8B Centaur variant outperformed the 27B variant.
- Implication: When paired with a strong classical optimizer, a "cheap" LLM is sufficient to refine promising candidates. The heavy lifting of the search trajectory is handled by CMA-ES, reducing the need for the LLM's generative capabilities.
Optimal Ratio: Using the LLM for only 30% of trials yielded the best results. Higher ratios (e.g., 80%) degraded performance, confirming that the LLM is most effective as an occasional, informed perturbation rather than the primary search driver.

4. Key Contributions

Comprehensive Benchmark: Established a rigorous comparison of 9 HPO methods (4 classical, 4 LLM, 1 hybrid) on the autoresearch task under identical 24-hour budgets.
Automated Search Space: Introduced a method to automatically extract 14 hyperparameters via AST parsing, removing manual search space curation and human priors.
Centaur Hybrid: Proposed and validated a hybrid architecture that shares CMA-ES's full internal state ( $\mu, \sigma, C$ ) with an LLM, achieving state-of-the-art results in this benchmark.
Insight on LLM Limitations: Demonstrated that small/mid-sized LLMs struggle with state tracking and OOM avoidance in fixed spaces, while showing that model scaling is critical for code-editing tasks but less critical for hybrid optimization.

5. Significance and Conclusion

The paper concludes that classical HPO methods remain superior for fixed search spaces, primarily due to their ability to reliably avoid infeasible regions (OOM) and efficiently learn the optimization landscape. LLMs, when restricted to fixed hyperparameter tuning, fail to leverage their strengths and often perform worse than random search due to poor state tracking.

However, LLMs excel in unconstrained code editing, narrowing the gap with classical methods, though this requires significant model scale (27B+). The most significant finding is the Centaur hybrid approach, which proves that combining a strong classical optimizer with a small, cheap LLM (0.8B) yields the best results. This suggests a future direction for AutoML where classical algorithms handle the heavy lifting of the search trajectory, while LLMs provide high-level domain intuition and refinement, rather than acting as standalone optimizers.

The study highlights that for current open-weight models, reliability (avoiding OOM) matters more than exploration breadth, and that the "best" LLM for optimization depends heavily on the task: large models for code generation, but small models are sufficient when guided by classical state.