Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions

Imagine you are a chef trying to perfect a recipe for a specific dish: Math Problems. You have a basic, untrained kitchen (the "Base Model") that knows how to cook but doesn't know the specific rules of your restaurant.

Over the last two years, dozens of new "cooking techniques" (algorithms like DPO, SimPO, SFT, GRPO) have been invented. Each paper claims their technique makes the food taste better. But here's the problem: every chef tested their technique on a different stove, with different ingredients, and different judges. No one knew which technique was actually the best, or if the results were just luck.

This paper, OXRL, is like a massive, controlled "Cook-Off" where every chef uses the exact same stove, the exact same ingredients, and the exact same judges. They tested 51 different techniques across 4 different sizes of kitchens (from a tiny food truck to a massive banquet hall).

Here are the four big discoveries, explained simply:

1. The "Size Matters" Surprise (Ranking Inversions)

The Analogy: Imagine you are teaching a dog to fetch.

Small Dog (1.5 Billion parameters): The best way to teach it is to run around with it, throw the ball, and praise it immediately when it gets it right. This is Online RL (SGRPO). It works great for small dogs.
Huge Dog (7 Billion parameters): Now, imagine a massive, powerful dog. If you try to run around with it, it gets confused. Instead, the best way to teach it is to show it a picture of a perfect fetch and say, "Do it like this." This is SimPO.

The Finding: The paper found that what works for a small model is the exact opposite of what works for a big model.

At the small scale, the "run-and-praise" method was the winner.
At the big scale, the "show-and-tell" method became the champion, while the "run-and-praise" method actually got worse.
Lesson: You cannot judge a technique based on small tests. A method that is #1 for a small model can be #10 for a big model.

2. The "New Sauce" Myth (Loss Functions)

The Analogy: Imagine you have a perfect steak recipe (Vanilla DPO). Then, 20 other chefs come along and say, "I added a secret spice!" or "I changed the marinating time!" or "I used a different type of salt!" They claim their version is 10% better.

The Finding: The researchers tested all 20 of these "secret spice" versions.

Result: None of them worked. In fact, one of them (SimPO) made the steak taste terrible at the small scale.
The only time a "new sauce" mattered was when it was actually a different cooking method entirely, not just a tweak to the recipe.
Lesson: Stop obsessing over tiny tweaks to the math formulas (loss functions). They don't make a real difference.

3. The "Specialist vs. Generalist" Trap

The Analogy: You train a student to be a Math Whiz.

On Math Tests: The difference between the "best" training method and the "worst" is huge (almost 20 points).
On History Tests: The difference between the "best" and "worst" training method is almost zero (less than 1 point).

The Finding: The algorithm you choose only matters if you are testing the model on the exact same type of problem it was trained on.

If you train a model on Math, the choice of algorithm changes its Math score drastically.
But if you ask that same model to write a poem or answer a history question, it doesn't matter which algorithm you used. They all perform the same.
Lesson: Don't pick a training method because you think it's "smarter" generally. Pick it only if you need to solve a very specific type of problem.

4. The Hierarchy of Success (What Actually Matters)

The authors created a "Power Ranking" of what actually makes a model better. Think of it like building a house:

🏗️ The Foundation (Model Scale): This is the biggest factor. Making the model bigger (from 1.5B to 7B parameters) improves performance by 50 points. This is like building a skyscraper instead of a shed.
🧱 The Blueprint (Training Paradigm): Whether you use "Online RL" (learning by doing) or "Offline" (learning from a book) matters about 10 points.
🔨 The Tools (Online vs. Offline): The specific way you run the training matters about 9 points.
🎨 The Paint Color (Loss Function): Tweaking the math formula matters only 1 point.

The Takeaway for Practitioners

If you are an AI developer trying to build a better model, here is your cheat sheet:

Don't waste time trying to invent a new "math formula" (loss function). It won't help.
Do focus on making your model bigger or choosing the right training style for your specific task.
Be careful with small tests: If a method looks great on a small model, it might fail miserably on a big one. Always test at the size you plan to deploy.
Use the "Vanilla" recipe: Unless you have a very specific reason, the standard DPO method is just as good as any of the 20 fancy variations.

In short: The paper tells us that in the world of AI, bigger is better, and the specific "secret sauce" you choose matters much less than you think. The biggest gains come from scale and the right training strategy, not from tweaking the fine print.

1. Problem Statement

Post-training alignment (RLHF, DPO, etc.) has produced dozens of competing algorithms (e.g., DPO, SimPO, KTO, GRPO). However, practitioners lack reliable, controlled comparisons to guide algorithm selection because:

Existing studies use different base models, datasets, and evaluation suites, making cross-method comparison unreliable.
It is unclear how algorithm rankings change across different model scales.
The impact of specific loss function modifications (over 20 DPO variants exist) is unknown.
The trade-offs between online (RL) and offline methods are not well quantified.

2. Methodology: The OXRL Framework

The authors introduce OXRL, a unified framework designed to eliminate implementation confounds and enable "apples-to-apples" comparisons.

Unified Infrastructure: All 51 implemented algorithms share identical model loading (HuggingFace), data pipelines, distributed training (DeepSpeed ZeRO-3), and evaluation harnesses. The only variable is the loss function computation.
Experimental Design:
- Model Scales: 4 scales (0.5B, 1.5B, 3B, 7B) using the Qwen 2.5 Instruct family.
- Algorithms: 8 core algorithms (SFT, PPO, GRPO variants, DPO, IPO, SimPO, KTO, ORPO) and a taxonomy of 20 DPO variants.
- Data: Preference pairs generated via self-play on GSM8K training data; SFT uses gold responses.
- Training: AdamW optimizer, cosine LR schedule, 3 epochs. LoRA (Rank 16) used for 7B and specific 3B factorial experiments; Full Fine-Tuning (FT) for smaller scales.
- Statistical Rigor: 100 runs for the DPO variant study (20 variants × 5 seeds). Pairwise Welch's t-tests with Bonferroni correction ( $\alpha \approx 0.0026$ ) were applied to control for multiple comparisons.
- Bug Discovery: The authors identified and fixed a hidden determinism bug in PyTorch's DistributedSampler that previously eliminated seed-dependent variance, ensuring the validity of their multi-seed results.

3. Key Contributions

First Large-Scale Controlled Comparison: The study spans ~240 training runs on H100 GPUs, covering 8 algorithms across 4 scales and 20 DPO variants.
Discovery of Scale-Dependent Ranking Inversions: The paper demonstrates that algorithm performance rankings are not static; they invert completely as model scale increases.
Quantification of Leverage: The authors establish a hierarchy of impact factors for post-training, showing that loss function engineering is the least impactful factor compared to model scale and training paradigm.
Living Benchmark: Release of OXRL as a community benchmark (code, configs, raw data) analogous to GLUE or HELM but focused on training algorithms.

4. Key Results

A. Scale-Dependent Ranking Inversions

The most striking finding is that the "best" algorithm changes drastically based on model size:

At 1.5B: Online RL (SGRPO) is the clear winner (58.0% accuracy on GSM8K), outperforming SFT by 3.6 pp and DPO by 8.9 pp. SimPO performs poorly (38.7%).
At 7B: The ranking inverts completely. SimPO becomes the best performer (85.8%), while SFT collapses to near-baseline performance (+0.6 pp).
Cause: A 2×2 factorial experiment (3B/7B × Full FT/LoRA) confirmed that this inversion is driven by model scale, not LoRA regularization. SimPO's success at 7B is linked to superior format compliance (strict-match accuracy), whereas SFT produces more format-noncompliant outputs at large scales.

B. DPO Variants: "Zero Significant Winners"

Result: Out of 20 DPO variants tested at 1.5B, none significantly outperformed vanilla DPO after Bonferroni correction.
Outlier: SimPO was the only statistically significant outlier, but it performed worse than vanilla DPO by 11.5 pp ( $p < 10^{-4}$ ).
Implication: Most loss function modifications are "cosmetic." The community has over-invested in loss function engineering (low leverage) while ignoring higher-leverage factors like data and scale.

C. Task-Specific Algorithm Leverage

Algorithm choice matters primarily within the training distribution and for format-sensitive tasks:

GSM8K (Training Task): Large performance spread (19.3 pp).
MATH (Harder Reasoning): Spread collapses to 0.54 pp (36× compression). Rankings invert again (SGRPO drops to 4th, SimPO rises to 2nd).
General Domain (ARC, HellaSwag): Spread collapses to 0.47 pp (41× compression). No method meaningfully outperforms the base model.
Conclusion: Post-training algorithms do not transfer advantages to out-of-distribution tasks; they are highly task-specific.

D. Hierarchy of Leverage

The authors quantify the impact of different design decisions on performance (measured in percentage points):

Model Scale: $\sim$ 50 pp (Dominant factor)
Training Paradigm: $\sim$ 10 pp
Online vs. Offline: $\sim$ 9 pp (Highly task-dependent)
Loss Function: $\sim$ 1 pp (Negligible)

5. Significance and Recommendations

For Practitioners:
- Do not trust small-scale benchmarks: Rankings at 1.5B or 3B do not predict 7B behavior.
- Prioritize Scale and Data: Investing in model scale or better data yields far greater returns than tuning loss functions.
- Algorithm Selection:
  - At $\le$ 1.5B: Prefer SFT (cheapest, strongest offline).
  - At $\ge$ 7B: Prefer SimPO (best accuracy and efficiency).
  - For Online RL: Use SGRPO (token-level) on format-sensitive tasks.
  - Use Vanilla DPO; variants offer no significant gain.
For the Community:
- The field should shift focus from inventing new loss functions to understanding the interaction between algorithms, scale, and task structure.
- Future studies must verify seed propagation in distributed training to avoid false determinism.

This paper fundamentally challenges the assumption that newer post-training algorithms are universally superior, demonstrating instead that performance is a complex function of model capacity and task structure rather than just the optimization objective.