To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Here is an explanation of the paper "To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models" using simple language and creative analogies.

The Big Question: How Do We Build a "Super-Expert" AI?

Imagine you want to build the ultimate AI assistant. You don't just want it to be good at math; you want it to be a genius at coding, a scientist, a creative writer, and a helpful agent that can use tools.

The paper asks a simple question: What is the best way to train this AI?

There are two main ways to do it, like two different ways to train a team of athletes:

The "Mix" Method (Multi-Task Training): You put the math student, the coder, and the scientist in the same classroom at the same time. They all learn together, swapping notes and helping each other.
The "Merge" Method (Separate Training + Merging): You train the math student alone until they are a genius. You train the coder alone. You train the scientist alone. Then, you take their brains (weights) and physically combine them into one super-brain.

The researchers wanted to know: Which method creates a better, more reliable AI?

The Experiment: The "Gym" of AI Training

The researchers took a base AI (Qwen3-4B) and put it through a rigorous training camp called RLVR (Reinforcement Learning with Verifiable Rewards). Think of this as a gym where the AI gets a "gold star" (reward) only if it solves a problem correctly. There is no guessing; the answer is either right or wrong.

They trained the AI in five different "sports":

Math (Solving complex equations)
Coding (Writing software)
Science (Answering tricky physics/biology questions)
Instruction Following (Doing exactly what you say, like "write a poem in the shape of a cat")
Agent (Using tools, like searching the web or running code)

The Results: Surprising Discoveries

1. The "Mix" Method is a Time-Saver

The researchers found that training the AI on all subjects at the same time (Mix) worked just as well as training them separately and then combining them (Merge).

The Analogy: Imagine a chef learning to cook Italian, Chinese, and Mexican food. You might think they need to master one cuisine before moving to the next. But the study shows that if you practice all three in the same kitchen, they actually help each other! The chef learns faster and uses less energy (GPU hours).
The Finding: The "Mix" method achieved the same expert-level performance but used 36% less computing power (time and electricity).

2. No "Traffic Jams" in the Brain

A common fear in AI is "gradient interference"—where learning math confuses the AI about how to code.

The Analogy: It's like worrying that learning to play the piano will make you forget how to play the guitar.
The Finding: The researchers looked inside the AI's "brain" (its weights) and found that learning math, coding, and science did not confuse each other. In fact, they helped each other! The "reasoning" sports (Math, Coding, Science) acted like a support group, making each other stronger.

3. The "Merge" Method is Like a Perfect Blend

When they did train the experts separately and then merged them, they found that simply averaging their brains worked surprisingly well.

The Analogy: It's like taking the best recipes from three different master chefs and blending them into one "Master Cookbook." The result isn't a messy mix; it's a cohesive, high-quality guide.
The Finding: The "Merge" method preserved the specific skills of each expert almost perfectly, proving that these skills live in different parts of the brain that don't clash.

The Twist: The "Self-Verification" Trap

This is the most interesting part of the paper. The researchers tested if the AI could judge its own work (Self-Verification). They asked: Can the AI tell if its own answer is right before it shows it to you?

They found a strange trade-off:

The "Outcome" Judge (Intuition): If you ask the AI, "Is the final answer right?", it gets better at this as it trains. It develops a good gut feeling.
The "Process" Judge (Reasoning): If you ask the AI, "Did you think through the steps correctly?", something weird happens.
- The Analogy: Imagine a student who studies so hard for a final exam that they memorize the answers perfectly but forget how to solve the problems. They can tell you the right answer (Outcome), but if you ask them to explain their logic step-by-step, they stumble.

The Finding:

Reasoning Tasks (Math/Coding): The AI gets great at checking the steps.
Instruction Tasks (Formatting): The AI gets bad at checking the steps. It starts hallucinating that its plan was perfect, even if the final output is messy.
The "Agent" Advantage: The AI trained specifically as an "Agent" (using tools) was the best at checking its own steps. Why? Because using tools forces you to check your work constantly. If you type the wrong command, the tool fails immediately. This "real-time feedback" taught the AI to be a strict self-critic.

The Conclusion: What Should We Do?

The paper concludes that both methods work, but they have different strengths:

If you want efficiency: Go with the "Mix" method. It's faster, cheaper, and the different skills (Math, Code, Science) naturally boost each other without fighting.
If you want stability: Go with the "Merge" method. It keeps the specific skills of each expert very sharp and prevents the AI from getting confused or "over-optimizing" for just one type of answer.

The Takeaway:
Building a general AI doesn't require choosing between "specialist" and "generalist." The paper shows that with the right training (Verifiable Rewards), an AI can learn to be a math genius, a coding wizard, and a helpful assistant all at once, without losing its mind. The key is understanding that while learning many things together is efficient, sometimes keeping the "specialist" brains separate and then blending them creates the most robust, reliable thinker.

Here is a detailed technical summary of the paper "To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models" (M2RL).

1. Problem Statement

While Reinforcement Learning with Verifiable Rewards (RLVR) has successfully driven Large Language Models (LLMs) to expert-level performance in specific domains (e.g., mathematics, coding), creating a general multi-domain expert model remains challenging.

The Core Dilemma: Current state-of-the-art models typically adopt one of two paradigms for multi-domain RLVR:
1. Mixed Multi-Task RLVR: Training simultaneously on heterogeneous data from different domains.
2. Separate Training + Merging: Training separate domain-specific experts and then merging them via weight averaging or distillation.
The Gap: There is a lack of detailed comparative analysis regarding the performance, efficiency, and internal mechanisms (e.g., interference, synergy, weight geometry) of these two paradigms. It is unclear if cross-domain training causes gradient interference or if merging experts is strictly necessary.

2. Methodology

The authors conducted extensive experiments using the Qwen3-4B-Base model as the foundation.

Experimental Setup

Target Domains: Five high-level tasks: Math, Coding, Science, Instruction Following (IF), and Agent.
Datasets: Open-source datasets from Nemotron 3 Nano were used for both Supervised Fine-Tuning (SFT) and RLVR.
- SFT: ~14M samples blended according to specific proportions (Math, Code, Science, Chat, Agent).
- RLVR: Specific subsets for each domain (e.g., DAPO/Skyworks for Math, CodeContests for Coding).
Training Algorithms:
- RL Algorithm: Group Relative Policy Optimization (GRPO).
- Paradigm 1 (Mixed): Single model trained on a blended dataset of all five domains.
- Paradigm 2 (Separate + Merge): Five separate domain-specific models trained, then merged using:
  - Weight Merging: Average, Task Arithmetic (TA), Ties-Merging, SCE, and DARE.
  - Distillation: Multi-Teacher On-Policy Distillation (MT-OPD).

Evaluation Metrics

Performance: Evaluated on 9 benchmarks (AIME'24/25, LiveCodeBench, GPQA-Diamond, HLE, IFEval, IFBench, BFCL v3).
Efficiency: Measured in GPU hours.
Internal Mechanisms: Analyzed via weight shift geometry (Jaccard overlap, cosine similarity), KL divergence (policy neighborhoods), and self-verification capabilities (outcome vs. process judging).

3. Key Contributions & Findings

A. Performance and Efficiency

Parity with Efficiency: The Mixed Multi-Task RLVR approach achieved performance comparable to the Separate Training + Merging approach but required only 63.7% of the GPU hours.
Synergy vs. Interference: Contrary to the fear of gradient interference, cross-domain RLVR exhibited minimal interference.
- Synergistic Effects: Reasoning-intensive domains (Math, Coding, Science) demonstrated mutual benefits; training on one improved performance on the others.
- Instruction Following: Improved reasoning domains but did not significantly boost Agent tasks.
- Agent Tasks: Reasoning domains did not naturally translate to tool-use capabilities, yet no negative interference was observed.

B. Weight Space Geometry

Overlapping Footprints: The weight updates (shifts) induced by RLVR in different domains show significant overlap (Jaccard overlap ~0.46–0.48, compared to ~0.18 for random masks).
Positive Correlation: Cosine similarity of weight shift vectors in overlapping regions is positive, particularly between reasoning domains. This suggests that the "direction" of learning for math, coding, and science is aligned in the parameter space.

C. Policy Neighborhoods & Mechanism

KL Divergence Analysis: The authors introduced the concept of Policy Neighborhoods. A domain $B$ is a neighbor of domain $A$ if the expert policy of $B$ has low KL divergence relative to the merged model when evaluated on $A$ 's data.
Neighbor-Driven Gains: Merging with "neighboring" policy experts (e.g., Coding is a neighbor of Math) yields performance gains, while merging non-neighbors does not. This explains why multi-task training works: it effectively performs a "neighborhood policy transfer" toward a global optimum.

D. Skill Acquisition (Emergent vs. Inherited)

Weight Merging: Primarily inherits the original capabilities of single-task models.
Multi-Task Training & Distillation: Exhibit larger divergence from single-task baselines, indicating the emergence of new, synergistic capabilities that are not simply the sum of individual skills.

E. Self-Verification Dynamics (Critical Finding)

The study revealed a complex trade-off in how models verify their own outputs:

Outcome vs. Process Verification:
- Reasoning Tasks (Math/Coding): Process-based verification (checking Chain-of-Thought) is superior.
- Constraint Tasks (Instruction Following): Outcome-based verification is superior; process traces often contain "intent" that differs from the final execution, leading to false positives in process judging.
The "Robustness Collapse":
- Mixed Multi-Task RL: While it improves generation quality, it causes a severe collapse in process verification (e.g., process judging scores dropped significantly on IFEval and GPQA). The model over-optimizes for superficial text patterns due to competing gradient signals.
- Decoupled Integration (Merging/Distillation): Methods that train experts separately and then merge them (especially MT-OPD) maintain a balanced trade-off, preserving robust process verification capabilities while achieving high generation scores.
- Agent Advantage: Models trained specifically on Agent tasks (multi-turn tool use) developed the strongest self-verification capabilities, acting as a catalyst for robust process monitoring.

4. Results Summary

Benchmarks: The best merged model (Ties-Merging) and the Mixed Multi-Task model both achieved state-of-the-art results on most benchmarks, often surpassing the official Qwen3-4B (Thinking mode) despite using open-source data.
Table 3 & 4: Show that Mixed RLVR and Merged models are competitive. For example, on AIME'24, Mixed RLVR scored 81.20, while the best Merged model (Ties) scored 81.15.
Table 7: Highlights the "Robustness Collapse" in Mixed RLVR, where the Process Judging score on IFEval dropped to 27.5 (vs. ~80+ for SFT or Merged models).

5. Significance

Efficiency: Demonstrates that for many reasoning tasks, Mixed Multi-Task RLVR is the most efficient path, saving significant compute resources without sacrificing performance.
Mechanism Insight: Provides a theoretical basis for why multi-domain RL works (overlapping weight shifts, policy neighborhoods) rather than just observing empirical results.
Strategic Guidance:
- Use Mixed RLVR for efficiency when the goal is general reasoning capability and compute is a constraint.
- Use Separate Training + Merging (specifically MT-OPD or Weight Merging) when robustness in process verification and strict adherence to complex constraints (like formatting or tool use) are critical, as this avoids the "collapse" seen in extended multi-task training.
Emergent Capabilities: Confirms that multi-task training can generate emergent skills that simple weight averaging cannot replicate, provided the domains are synergistic.

The project, named M2RL, is open-source, providing a framework for future research into scalable, multi-domain post-training for LLMs.