OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

OptiMer is a post-hoc optimization framework that decouples data mixture ratio selection from continual pre-training by extracting distribution vectors from individual dataset-trained models and using Bayesian optimization to find optimal composition weights, thereby achieving superior performance with significantly lower search costs compared to traditional data mixing.

Haiyue Song, Masao Utiyama

Published 2026-04-01
📖 4 min read☕ Coffee break read

Imagine you are a master chef trying to create the perfect "Global Fusion" dish. You have four distinct, high-quality ingredients in your pantry:

  1. Japanese Cuisine (for flavor and nuance)
  2. Chinese Cuisine (for depth and history)
  3. Math (for precision and logic)
  4. Coding (for structure and function)

Your goal is to mix these ingredients into one giant pot of soup (a Large Language Model) that tastes amazing in all four categories.

The Old Way: The "Guess-and-Check" Soup

Traditionally, chefs (AI researchers) had to decide the recipe before they even turned on the stove. They had to guess: "Maybe I'll use 50% Japanese, 20% Chinese, 20% Math, and 10% Code."

They would then spend weeks cooking this massive pot of soup.

  • The Problem: If they guessed wrong (e.g., too much Math ruined the Japanese flavor), the whole pot was ruined.
  • The Cost: They couldn't just taste it halfway through. They had to wait until the end to realize, "Oh no, this tastes like burnt code!" Then, they'd have to throw it away, buy new ingredients, and start cooking for another few weeks.
  • The Result: A lot of wasted time, money, and electricity.

The New Way: OPTIMER (The "Taste-Test" Lab)

The authors of this paper, Haiyue Song and Masao Utiyama, came up with a brilliant new method called OPTIMER. Instead of mixing the ingredients in one giant pot, they changed the strategy completely.

Step 1: The "Single-Flavor" Trials

Instead of mixing everything at once, they cook four tiny, separate pots:

  • One pot with only Japanese.
  • One pot with only Chinese.
  • One pot with only Math.
  • One pot with only Code.

They also have a "Master Sauce" (Instruction Tuning) that makes the soup edible and polite.

Step 2: Extracting the "Flavor Essence"

Once these tiny pots are done, they don't serve the soup. Instead, they use a special machine to extract the Flavor Essence (called a Distribution Vector) from each pot.

  • Think of this as taking a tiny vial of pure "Japanese-ness" and a vial of pure "Math-ness."
  • Crucially, these essences are like orthogonal colors (like Red and Blue). They don't clash; they sit side-by-side without ruining each other.

Step 3: The "Magic Blender" (Bayesian Optimization)

Now, they have a high-tech blender (the OPTIMER algorithm).

  • They don't guess the recipe. Instead, the blender runs a super-fast simulation.
  • It tries mixing the essences in thousands of different ways in minutes.
  • It asks: "What if I use 60% Japanese essence, 10% Math, and 30% Code? Does that score high on the taste test?"
  • It quickly finds the perfect ratio that makes the final soup taste great in all categories.

Step 4: The Final Dish

Once the blender finds the perfect mix, they simply pour the essences together. No new cooking is required. They instantly have a perfect "Global Fusion" soup.

Why is this a Game-Changer?

  1. Speed: The old way took weeks to test one recipe. OPTIMER finds the best recipe in minutes. It's 15 to 35 times faster.
  2. Flexibility: If you suddenly decide you want a "Math-Heavy" soup instead of a "Japanese-Heavy" one, you don't need to cook again! You just take the same four vials of essence and ask the blender to find a new mix. You get a custom soup instantly.
  3. Better Taste: The paper shows that this method actually tastes better than the old "guess-and-check" method. The old method often ruined the delicate flavors (like making the code output look like gibberish), but OPTIMER keeps everything balanced.

The Big Takeaway

This paper proves that you don't need to be a fortune teller to mix data for AI. You don't have to guess the recipe before you start cooking.

Instead, you can cook small, separate experiments, extract the "lessons" (essences) from them, and then use a smart computer to mix those lessons together after the fact. It turns a slow, expensive, high-stakes gamble into a fast, flexible, and precise science.

In short: Stop guessing the recipe before you cook. Cook the parts separately, extract the magic, and let the computer mix the perfect potion for you.