To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
This paper introduces M2RL, a comprehensive study comparing mixed multi-task training versus separate training with model merging for multi-domain Reinforcement Learning with Verifiable Rewards (RLVR), revealing that reasoning-intensive domains exhibit synergistic effects with minimal interference and providing mechanistic insights through extensive experiments.