Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

Imagine you are a head chef trying to create the world's best soup (a Large Language Model). You have a pantry full of different ingredients: some are spicy (Math), some are savory (Code), some are sweet (General Knowledge), and some are salty (Chinese language).

The big question is: How much of each ingredient should you put in the pot to make the soup taste the best?

If you guess wrong, the soup might be too bland or too spicy. If you try to figure this out by cooking a giant 50-gallon pot for every single recipe variation, you'll run out of money and time before you ever serve a bowl.

This paper introduces a new method called CAMEL (Capacity-Aware Mixture Law) that acts like a super-smart sous-chef. It helps you figure out the perfect recipe without having to cook the giant pot a thousand times.

Here is how it works, broken down into three simple steps:

1. The Problem: Small Pots Don't Predict Big Pots

In the past, chefs tried to find the perfect recipe by cooking tiny tasting spoons (small models) and assuming that what worked for the spoon would work for the giant pot.

The Issue: Sometimes, a recipe that tastes great in a small spoon tastes terrible in a giant pot. A small model might need more "Code" to learn, but a giant model might need more "Knowledge." The needs change as the pot gets bigger.

2. The Solution: The "Capacity-Aware" Recipe Book

The authors realized that the "size" of the pot changes how the ingredients interact. They created a new mathematical law (CAMEL) that understands this relationship.

The Analogy: Imagine you are building a house.
- Small House: You need a lot of bricks (Code/Math) to build the walls because the structure is fragile.
- Mansion: You still need bricks, but now you have so much space that you need a lot of furniture and art (General Knowledge) to fill the rooms and make it livable.
- CAMEL's Job: It doesn't just look at the ingredients; it looks at the size of the house and tells you exactly how the ingredient mix needs to shift as the house grows. It predicts that as your model gets bigger, you should actually increase the amount of general knowledge and decrease the amount of raw math/code.

3. The "Hourglass" Strategy: Cooking Smarter, Not Harder

Even with a smart recipe book, you still need to test some recipes. But you have a limited budget for gas and ingredients. How do you spend that budget?

The Old Way (The Rectangle): You cook 10 small pots, 10 medium pots, and 10 big pots. This is expensive and wastes time on the "medium" pots, which don't teach you as much.
The New Way (The Hourglass): The authors discovered the best strategy is to focus your energy on the extremes.
- Cook a few tiny pots (to see the basics).
- Cook a few giant pots (to see the limits).
- Skip the middle sizes.
- Why? It's like trying to guess the shape of a hill. If you only look at the middle, you might think it's flat. If you look at the very bottom and the very top, you can draw the whole curve perfectly. This "Hourglass" strategy saves 50% of the computing cost (gas and ingredients) while giving a more accurate prediction.

4. The Magic Trick: Predicting the Taste Without Eating

Usually, to know if a soup is good, you have to taste it (run a benchmark test). But CAMEL has a shortcut.

It measures the "flavor profile" (validation loss) while the soup is cooking.
It has a special formula that says: "If the flavor profile looks like X, the final taste score on the 'Math Test' will be Y."
This allows them to predict the final performance of the giant model just by looking at the data from the smaller tests.

The Results

When they tested this on a massive model (55 Billion parameters, which is huge!):

Cost: They used less than half the computing power of previous methods.
Performance: The resulting model was 3% better at tasks like math, coding, and reasoning than models trained with "human guesswork" or older methods.
Speed: They found the perfect recipe with less effort than it takes to cook the giant pot just once.

Summary

CAMEL is a smart system that tells AI developers: "Don't just guess the recipe based on small tests. Look at how the model size changes the needs, focus your testing on the very small and very large models, and use a special formula to predict the final taste."

It's the difference between a chef who cooks 100 pots of soup to find the right recipe, and a chef who uses a scientific formula to find the perfect recipe after cooking just a few, saving time, money, and energy.

1. Problem Statement

Large Language Models (LLMs) are typically trained on mixtures of diverse data domains (e.g., code, math, general knowledge). Determining the optimal mixture ratios is critical for downstream performance, particularly during mid-training phases where data quality is prioritized over quantity.

Current Limitations:
- Direct Search: Exhaustively searching for optimal mixtures on large target models is computationally prohibitive.
- Proxy Model Transfer: Optimizing mixtures on small models and transferring them to larger ones often fails because mixture effects are not scale-invariant.
- Existing Scaling Laws: Prior methods (e.g., Data Mixing Laws) often treat model size and data mixture as separable factors or fail to extrapolate effectively to very large models (e.g., >50B parameters). They also typically optimize for validation loss, which does not always correlate perfectly with downstream benchmark accuracy.

2. Methodology: The CAMEL Framework

The authors propose CAMEL (Capacity-Aware Mixture Law), a compute-efficient pipeline that models the interplay between model size, data mixture, and downstream performance. The framework consists of three core components:

A. Capacity-Aware Mixture Scaling Law

Instead of treating model size and mixture ratios independently, CAMEL models them as a joint optimization problem based on capacity allocation.

Theoretical Basis: The authors view pretraining as a process where a model's parameter capacity ( $M$ ) is dynamically distributed across intrinsic data domains based on the data mixture ( $r$ ).
Mathematical Formulation:
- They assume training loss on intrinsic domain $i$ follows a power law dependent on allocated capacity $\tilde{m}_i$ .
- They solve a constrained optimization problem to minimize the mixture-weighted sum of intrinsic losses subject to the total capacity budget $M$ .
- Resulting Law: The validation loss is modeled as:
  $L_{val}(r, M) = C + \sum_{i=1}^{k} \frac{K_i}{\langle t_i, r \rangle^{\alpha_i} M^{\beta_i}}$
  Where $\langle t_i, r \rangle$ represents the effective weight of domain $i$ induced by the mixture, and $\alpha_i, \beta_i$ are learned exponents. This formulation captures the non-linear interaction between mixture and scale.

B. Loss-to-Benchmark Prediction Law

To bridge the gap between validation loss and actual task performance, the authors introduce a secondary law.

Approach: They model downstream benchmark accuracy ( $Acc_b$ ) as a logistic function of multiple validation losses ( $L$ ):
$Acc_b(L) = C_b + \frac{A_b}{1 + \exp(k_b^\top L + B_b)}$
Integration: This allows the system to predict benchmark accuracy directly from the fitted mixture law, enabling end-to-end optimization of the target metric rather than just minimizing loss.

C. Compute-Aware Sampling Strategy (Hourglass)

To fit these laws efficiently under a fixed compute budget, the authors analyze sampling strategies across different model scales.

Finding: Uniform sampling (Rectangle) is suboptimal.
Proposed Strategy: The Hourglass strategy allocates more training samples to the smallest and largest model scales while reducing samples at intermediate scales.
Rationale: This distribution minimizes extrapolation error by capturing both the base behavior of small models and the scaling trends of large models, providing the most informative data points for fitting the scaling law.

3. Key Contributions

Capacity-Aware Mixture Law (CAMEL): A novel scaling law that unifies data mixture and model size into a single expression, derived from a capacity-allocation perspective. It outperforms prior laws (DML, SODM) in prediction accuracy.
End-to-End Performance Prediction: A two-stage mapping (Mixture $\to$ Loss $\to$ Benchmark Accuracy) that allows for direct optimization of downstream benchmarks without training the target model.
Optimal Sampling Design: The discovery and validation of the "Hourglass" sampling strategy, which significantly reduces prediction error under fixed compute budgets compared to uniform or triangular sampling.
Large-Scale Verification: Successful extrapolation and verification on a 55B-A1.2B parameter model (DeepSeek V3 architecture), demonstrating that mixtures derived from small models (up to 7B) can effectively guide large-scale training.

4. Experimental Results

The authors evaluated CAMEL against baselines (Human-designed mixtures, Model-size-agnostic methods, DML, and SODM) on a 55B-parameter target model.

Performance Gains:
- Benchmark Accuracy: CAMEL achieved the highest Weighted Average Score across multiple objectives (Balanced, Math-specialized, Code-specialized, Knowledge-specialized).
- Specific Improvements: On the Balanced objective, CAMEL improved the weighted average score by up to 3% compared to baselines.
- Generalization: The method showed strong generalization on held-out benchmarks not used during optimization, indicating it learns robust mixture principles rather than overfitting to specific tasks.
Efficiency:
- Cost Reduction: CAMEL reduced mixture optimization costs by 50% compared to baseline methods.
- Training Passes: High-quality mixtures were identified using less than one full training pass on the target model.
Scaling Insights: The study revealed that as model size increases, the optimal mixture shifts to favor Knowledge data more heavily, while the relative weight of Math and Code data decreases.

5. Significance

This work addresses a critical bottleneck in LLM development: the high cost of data curation for large-scale models.

Scalability: It provides a principled, mathematically grounded method to predict optimal data mixtures for massive models (50B+) using only small-scale experiments.
Efficiency: By reducing the compute cost of data optimization by half, it makes the iterative refinement of LLM training data feasible for more organizations.
Theoretical Insight: The "Capacity-Aware" perspective offers a new understanding of how model capacity interacts with data composition, suggesting that effective parameter allocation is dynamic and scale-dependent.

In summary, CAMEL transforms data mixture optimization from an expensive, trial-and-error process into a predictable, compute-efficient scaling law, enabling more effective training of next-generation large language models.