OptiMer: Optimal Distribution Vector Merging Is Better than Data Mixing for Continual Pre-Training

Imagine you are a master chef trying to create the perfect "Global Fusion" dish. You have four distinct, high-quality ingredients in your pantry:

Japanese Cuisine (for flavor and nuance)
Chinese Cuisine (for depth and history)
Math (for precision and logic)
Coding (for structure and function)

Your goal is to mix these ingredients into one giant pot of soup (a Large Language Model) that tastes amazing in all four categories.

The Old Way: The "Guess-and-Check" Soup

Traditionally, chefs (AI researchers) had to decide the recipe before they even turned on the stove. They had to guess: "Maybe I'll use 50% Japanese, 20% Chinese, 20% Math, and 10% Code."

They would then spend weeks cooking this massive pot of soup.

The Problem: If they guessed wrong (e.g., too much Math ruined the Japanese flavor), the whole pot was ruined.
The Cost: They couldn't just taste it halfway through. They had to wait until the end to realize, "Oh no, this tastes like burnt code!" Then, they'd have to throw it away, buy new ingredients, and start cooking for another few weeks.
The Result: A lot of wasted time, money, and electricity.

The New Way: OPTIMER (The "Taste-Test" Lab)

The authors of this paper, Haiyue Song and Masao Utiyama, came up with a brilliant new method called OPTIMER. Instead of mixing the ingredients in one giant pot, they changed the strategy completely.

Step 1: The "Single-Flavor" Trials

Instead of mixing everything at once, they cook four tiny, separate pots:

One pot with only Japanese.
One pot with only Chinese.
One pot with only Math.
One pot with only Code.

They also have a "Master Sauce" (Instruction Tuning) that makes the soup edible and polite.

Step 2: Extracting the "Flavor Essence"

Once these tiny pots are done, they don't serve the soup. Instead, they use a special machine to extract the Flavor Essence (called a Distribution Vector) from each pot.

Think of this as taking a tiny vial of pure "Japanese-ness" and a vial of pure "Math-ness."
Crucially, these essences are like orthogonal colors (like Red and Blue). They don't clash; they sit side-by-side without ruining each other.

Step 3: The "Magic Blender" (Bayesian Optimization)

Now, they have a high-tech blender (the OPTIMER algorithm).

They don't guess the recipe. Instead, the blender runs a super-fast simulation.
It tries mixing the essences in thousands of different ways in minutes.
It asks: "What if I use 60% Japanese essence, 10% Math, and 30% Code? Does that score high on the taste test?"
It quickly finds the perfect ratio that makes the final soup taste great in all categories.

Step 4: The Final Dish

Once the blender finds the perfect mix, they simply pour the essences together. No new cooking is required. They instantly have a perfect "Global Fusion" soup.

Why is this a Game-Changer?

Speed: The old way took weeks to test one recipe. OPTIMER finds the best recipe in minutes. It's 15 to 35 times faster.
Flexibility: If you suddenly decide you want a "Math-Heavy" soup instead of a "Japanese-Heavy" one, you don't need to cook again! You just take the same four vials of essence and ask the blender to find a new mix. You get a custom soup instantly.
Better Taste: The paper shows that this method actually tastes better than the old "guess-and-check" method. The old method often ruined the delicate flavors (like making the code output look like gibberish), but OPTIMER keeps everything balanced.

The Big Takeaway

This paper proves that you don't need to be a fortune teller to mix data for AI. You don't have to guess the recipe before you start cooking.

Instead, you can cook small, separate experiments, extract the "lessons" (essences) from them, and then use a smart computer to mix those lessons together after the fact. It turns a slow, expensive, high-stakes gamble into a fast, flexible, and precise science.

In short: Stop guessing the recipe before you cook. Cook the parts separately, extract the magic, and let the computer mix the perfect potion for you.

1. Problem Statement

Continual Pre-Training (CPT) is a standard method for adapting Large Language Models (LLMs) to new languages or domains. However, a critical bottleneck exists in data mixture ratio selection:

The Challenge: To train a model on multiple datasets (e.g., Japanese, Math, Code), practitioners must fix the mixing ratios of these datasets before training begins.
The Cost: These ratios are sensitive hyperparameters. A suboptimal choice leads to degraded performance. Because training large models (e.g., 27B parameters) takes weeks of GPU time, finding the optimal ratio via trial-and-error or grid search is prohibitively expensive and inefficient.
Limitations of Current Solutions: Existing methods (e.g., DoReMi, RegMix) estimate ratios using proxy models or small-scale experiments but still require fixing these ratios before the main training run. If the estimate is wrong, the compute resources are wasted.

2. Methodology: OPTIMER

The authors propose OPTIMER, a framework that decouples data ratio selection from model training. Instead of mixing data during training, OPTIMER mixes the results of training after the fact using distribution vectors.

Core Concepts

Distribution Vectors:
- Instead of mixing datasets, the authors train separate CPT models ( $\theta_{CPT_i}$ ) for each dataset $D_i$ independently from a base model ( $\theta_{pt}$ ).
- They extract a distribution vector $\tau_i = \theta_{CPT_i} - \theta_{pt}$ for each dataset. This vector represents the parameter shift induced by that specific data distribution.
- Similarly, an Instruction-Tuning (IT) vector ( $\tau_{it} = \theta_{it} - \theta_{pt}$ ) is extracted to restore instruction-following capabilities lost during CPT.
Linear Composition:
- The final model is constructed via a linear combination of these vectors:
  $\theta_{merge} = \theta_{pt} + \alpha_{it} \cdot \tau_{it} + \sum_{i=1}^{n} \alpha_i \cdot \tau_i$
- Here, $\alpha$ represents the merge weights. The authors hypothesize that because vectors from distinct datasets are approximately orthogonal (low cosine similarity), they can be combined linearly with minimal interference.
Optimization via Bayesian Optimization (TPE):
- Finding the optimal weights $\alpha^*$ is treated as a black-box optimization problem.
- Instead of grid search (which scales exponentially with the number of datasets), OPTIMER uses Tree-structured Parzen Estimator (TPE) via the Optuna library.
- TPE efficiently searches the weight space by modeling the probability density of high-performing configurations versus low-performing ones, converging in significantly fewer trials.

Workflow

Train $N$ independent CPT models (one per dataset) and one IT model.
Extract distribution vectors ( $\tau$ ) for each.
Use TPE to search for optimal merge weights ( $\alpha$ ) on a development set.
Construct the final model by merging the vectors with the optimized weights.

3. Key Contributions

Post-Hoc Optimization: Reframed data mixture ratio selection from a pre-training decision to a post-hoc optimization problem over distribution vectors.
Efficiency: Achieved 15–35× lower search costs compared to traditional data mixing approaches. While data mixing requires retraining the entire model for every ratio attempt, OPTIMER only requires merging vectors (minutes) and evaluating.
Flexibility: A single pool of distribution vectors can be re-optimized for different objectives (e.g., prioritizing Math vs. Code) without any retraining, allowing for on-demand model customization.
Interpretability: The optimized merge weights can be interpreted as effective data mixture ratios. Retraining with these ratios improves performance, validating the vector approach.
Negative Weights: The framework supports negative weights, allowing the model to actively subtract the influence of a specific distribution if it causes interference (e.g., removing English data influence when optimizing for Japanese).

4. Experimental Results

The method was evaluated on Gemma 3 27B across languages (Japanese, Chinese) and domains (Math, Code).

Performance: OPTIMER consistently outperformed:
- DataMix Baselines: Models trained with fixed data mixtures (even with optimized ratios found by other methods).
- Model Averaging: Uniform averaging methods like Task Arithmetic, TIES, and DARE.
- Gains: OPTIMER improved average scores by 2.1 to 6.7 points over the best DataMix baselines across various dataset combinations.
Specific Findings:
- Code Generation: Uniform averaging methods often failed catastrophically on code tasks (generating syntax errors), while OPTIMER maintained high performance.
- Truthfulness: OPTIMER maintained high scores on TruthfulQA, whereas other methods degraded significantly, suggesting better preservation of the base model's calibration.
- Vector Orthogonality: Analysis showed CPT vectors are nearly orthogonal to the IT vector (cosine $\approx 0.03$ ) and have low similarity among themselves (0.29–0.31), validating the linear composition assumption.
- Search Dynamics: The optimization landscape is "sharp," with high-performing weights concentrated in a narrow region. TPE successfully navigated this, whereas grid search would be impractical.

5. Significance and Implications

Paradigm Shift: OPTIMER challenges the traditional "train-on-mixed-data" paradigm. It suggests that for CPT, it is more efficient to train specialized models and merge them later.
Resource Efficiency: By reducing the search cost from weeks of GPU time to hours (or minutes for the merge step), it makes hyperparameter tuning for data ratios accessible and scalable for large models.
Generalizability: The approach was successfully transferred to the SEA-LION model (a Southeast Asian language model), demonstrating that the method is not specific to the Gemma architecture.
Future Directions: The authors note that for extremely large-scale CPT, iterative merging (train-merge-train) might be necessary to prevent divergence, and further work is needed to compare against advanced ratio optimization methods (like DoReMi) at the 27B scale.

In conclusion, OPTIMER offers a computationally efficient, flexible, and high-performing alternative to data mixing for continual pre-training, leveraging the geometric properties of model parameter spaces to optimize data composition post-training.