An Open-Source Training Dataset for Foundation Models… — Plain-Language Explanation

Original authors: Aaron Klein, Herilalaina Rakotoarison, Luca Thale-Bombien, David Salinas

Published 2026-05-25✓ Author reviewed ⓘ

📖 5 min read🧠 Deep dive

Original authors: Aaron Klein, Herilalaina Rakotoarison, Luca Thale-Bombien, David Salinas

Original paper licensed under CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). ✨ This is an AI-generated explanation of the paper below. It is not written by the authors. For technical accuracy, refer to the original paper. Read full disclaimer

The Big Problem: The "Black Box" Mystery

Imagine you are trying to bake the perfect cake, but you have a magical oven that is completely sealed. You can't see inside, you don't know the recipe, and you can't measure the temperature. The only way to learn is to put a cake in, wait for it to bake, take it out, and taste it.

The Cake: This is the "objective function" (the problem you want to solve).
The Ingredients: These are the "hyperparameters" (settings like learning rate, number of layers, etc.).
The Taste: This is the "score" (how good the result is).

This is called Black-Box Optimization. It happens everywhere: tuning AI models, designing new drugs, or configuring robots. The problem is that finding the perfect "cake" usually requires a human expert to guess, tweak, and taste thousands of times. It's slow, expensive, and the expert's tricks often don't work if you switch from baking a cake to baking bread.

The Old Way vs. The New Idea

The Old Way: Scientists have built many different "tasting experts" (algorithms) over the years. One expert is great at finding cake recipes, but terrible at finding bread recipes. They are specialized tools.

The New Idea (Foundation Models): What if we could train a single, super-smart AI to learn the general principles of baking? Instead of being a cake expert or a bread expert, it would be a "Master Baker" that understands how to optimize any recipe just by looking at thousands of past baking attempts.

The Missing Ingredient: A Giant Cookbook

To train this "Master Baker," you need a massive library of past baking attempts (data).

The Problem: Previous attempts to do this relied on secret data (which no one else could see) or made-up data (which didn't reflect real life). It was like trying to teach a chef using a cookbook written in a language no one speaks, or using fake ingredients.
The Solution (BBO-Pile): The authors created BBO-Pile, the first open-source "Cookbook" for this task.
- It contains 557,100 different baking attempts (trajectories).
- These attempts cover 3,095 different types of problems (from tuning AI models to chemical design).
- It includes data from 6 different "tasting experts" (algorithms) so the AI can learn different strategies.
- It is massive: about 2.5 billion words (tokens) of data.

How They Trained the "Master Baker"

The authors didn't just give the AI the cookbook; they trained a family of AI models (like different-sized chefs) to read it.

The Models: They built models ranging from small (2 million parameters) to large (80 million parameters).
The Training: They fed the models the data and asked them to predict the next step in a baking process.
- Input: "Here is the recipe so far, and here is how the last cake tasted."
- Output: "Here is the next ingredient mix you should try."
The Result: The AI learned to mimic the behavior of the original human experts. If you told the AI to act like "Expert A," it acted like Expert A. If you told it to act like "Expert B," it switched strategies.

What They Discovered

Bigger is Better (but with limits): As they made the AI models bigger and fed them more data, the models got better at mimicking the experts. However, the improvement wasn't as explosive as it is with chatbots (LLMs); it was a steady, predictable climb.
Generalization: The AI didn't just memorize the recipes in the book. When they tested it on a new type of problem it had never seen before (like a completely new type of bread), it still performed surprisingly well. It had learned the logic of optimization, not just the specific answers.
Speed: Once trained, the AI can suggest the next step almost instantly, much faster than running complex mathematical simulations from scratch.

The Bottom Line

This paper is like building the first public library of "optimization stories." By sharing this massive dataset (BBO-Pile), the authors have allowed other researchers to train their own "Master Baker" AI.

They proved that you can train a general-purpose AI to understand how to solve complex, unknown problems by simply showing it how other methods solved similar problems in the past. It's a step toward an AI that doesn't just solve one puzzle, but knows how to figure out any puzzle.

Important Note: The paper focuses entirely on creating this dataset and training these models to mimic existing optimization methods. It does not claim to have solved specific real-world problems (like curing a disease or designing a specific rocket) yet, nor does it discuss future clinical applications. The goal was simply to prove that this "Foundation Model" approach works and to provide the data so others can try it.

Technical Summary: BBO-Pile and Foundation Models for Black-Box Optimization

Problem Statement
Black-box optimization (BBO) is a fundamental challenge across scientific and engineering domains, including robotics, chemical design, and machine learning hyperparameter tuning. The core difficulty lies in optimizing an objective function $f(x)$ without access to its structural information or gradients, relying solely on query outputs. Existing BBO methods, such as Bayesian Optimization (BO) and evolutionary algorithms, are often specialized, performing well only within narrow problem classes. They typically require extensive manual tuning and fail to generalize across diverse domains. While foundation models have succeeded in vision and natural language processing, their application to BBO has been hindered by a lack of large-scale, public, real-world pre-training data. Prior attempts, such as OptFormer, relied on non-public datasets or purely synthetic data, limiting reproducibility and the ability to learn generalizable optimization principles.

Methodology
The authors introduce BBO-Pile, the first open-source dataset designed to train foundation models for black-box optimization. The methodology encompasses dataset construction, tokenization, and model training:

Dataset Construction (BBO-Pile): The dataset aggregates 557,100 optimization trajectories across 3,095 distinct black-box tasks spanning 102 search spaces. These tasks are drawn from seven benchmark families, including hyperparameter optimization (HPO-B, LC-Bench, PD1, TabRepo), neural architecture search (FC-Net, NAS-Bench-201), and synthetic global optimization problems. The data was generated by running six different optimizers (including BORE, CQR, HEBO, TPE, Regularized Evolution, and Random Search) with a budget of 100 evaluations per task, repeated 30 times with different seeds.
Data Augmentation: To expand the token count and mitigate overfitting, the authors employ permutation of hyperparameter order (preserving numerical-before-categorical conventions) and sample trajectories of varying lengths ( $T \in \{5, 10, 20, 50, 100\}$ ) prior to quantization. This results in a final dataset of approximately 2.5 billion tokens.
Encoding and Tokenization: Optimization trajectories are encoded as sequences of tokens. Metadata (optimizer name, search space) is encoded first. Numerical configurations and objective values are min-max scaled to $[0, 1]$ , discretized into $Q=1000$ bins, and converted to strings. Categorical parameters are encoded by index. Special characters denote the end of configurations and observed metrics. A Byte-Pair Encoding (BPE) tokenizer is trained on these strings.
Model Architecture and Training: The authors train decoder-only transformer models based on the Qwen3 architecture, utilizing Rotary Position Embeddings, Grouped Query Attention, and Root Mean Square Normalization. The models are trained using a standard causal language modeling objective ( $L(\theta) = -\sum \log p_\theta(s_i | s_{<i})$ ).
Inference: During inference, the model samples a completion string based on the encoded search space and historical observations. Constrained decoding ensures all generated values are valid and decodable.

Key Contributions

BBO-Pile Dataset: The release of the largest public dataset for black-box optimization, comprising over 500K trajectories from 3,095 tasks and 6 optimizers, totaling ~2.5B tokens.
Foundation Model Training: The training of a family of foundation models ranging from 2M to 80M parameters and 200M to 2B training tokens.
Scaling Analysis: A systematic analysis of how decoder-based transformers imitate state-of-the-art BBO methods as parameter count and token budget scale.
Open-Source Release: Full availability of the dataset, model checkpoints, and code for training, generation, and evaluation on GitHub and HuggingFace.

Results

Scaling Behavior: The models exhibit predictable scaling behavior similar to Large Language Models (LLMs). Validation loss follows a power law with respect to compute ( $L \propto C^{-0.0157}$ ), though the exponent is shallower than typical LLM pre-training, suggesting modest improvements from increased compute.
Imitation of Optimizers: The trained models successfully imitate the optimization trajectories of the original optimizers (e.g., CQR and Random Search).
- Parameter Scaling: Larger models (e.g., 80M parameters) more closely match the performance and sampling distribution of the original optimizers compared to smaller models (e.g., 2M parameters), particularly in early iterations.
- Token Scaling: Models trained on token budgets exceeding 1B tokens closely match the original performance, whereas budgets below 800M tokens are insufficient to fully capture complex sampling distributions.
Generalization: The models demonstrate generalization capabilities:
- They perform well on unseen tasks within seen search spaces.
- They show competitive performance on tasks from unseen search spaces (e.g., TabRepo CatBoost tasks), though performance gaps widen on global optimization problems with highly variable loss landscapes.
- The models can distinguish between different optimization strategies (e.g., CQR vs. Random Search) and reproduce their specific behaviors, including marginal hyperparameter densities.

Significance and Claims
The paper claims that large-scale pre-training on BBO-Pile is a viable and effective approach to imitate black-box optimization methods. The work establishes that foundation models can learn optimization principles from data, potentially overcoming the specialization and lack of generalization inherent in manually designed methods. By providing the first large-scale, open-source dataset and demonstrating scaling laws, the authors pave the way for future research into more powerful, generalizable optimization agents. The authors modestly note that while the models show promise, they currently imitate existing strategies rather than inventing new ones, and future work is needed to address limitations in generalizing to domains with different characteristics (e.g., chemical design) and to explore reasoning-based or test-time scaling approaches.

An Open-Source Training Dataset for Foundation Models for Black-box Optimization