One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

Imagine you are trying to teach a single robot to play 26 different video games at the same time. Some games are simple, like Pong (just moving a paddle left and right). Others are incredibly complex, like Seaquest (navigating a submarine, avoiding enemies, and managing oxygen).

If you try to teach this robot using a standard "one-size-fits-all" brain, something weird happens. The robot gets really good at Pong very quickly. But as it tries to learn the harder games, it starts to forget how to play the easy ones, or it gets so confused by the conflicting instructions that it stops learning entirely. It's like a student trying to memorize the entire dictionary while simultaneously learning to play the piano; the brain gets overwhelmed, and the learning process "collapses."

This paper, titled "One Model for All Tasks," introduces a new robot brain called ScaleZero that solves this problem. Here is how it works, explained through simple analogies.

The Problem: The "Crowded Classroom"

Think of a standard AI model as a single classroom where all students (tasks) sit at the same desk.

The Issue: When the teacher (the learning algorithm) tries to give instructions, the loud, simple students (easy tasks like Pong) shout their answers first. The quiet, complex students (hard tasks like Seaquest) can't get a word in.
The Result: The teacher gets confused by the shouting (this is called gradient conflict). Eventually, the teacher stops listening to anyone, and the complex students give up because their brains go "stiff" (this is called plasticity collapse). The robot learns nothing new.

The Solution: ScaleZero (The "Specialized Workshop")

The authors realized that instead of forcing everyone to sit at one desk, you need a Mixture of Experts (MoE).

Imagine the robot's brain is no longer one classroom, but a giant workshop with many specialized stations.

The Router (The Foreman): There is a smart foreman who looks at the task. If the robot needs to play Pong, the foreman sends the data to the "Paddle Station." If it needs to play Seaquest, it sends the data to the "Submarine Station."
The Experts: Each station has its own specialist who only works on that specific type of problem.
Why it works: The "Paddle Station" doesn't get distracted by the "Submarine Station." They don't shout over each other. This keeps the robot's brain flexible and able to learn new, difficult things without forgetting the old, easy things.

The Second Innovation: DPS (The "Smart Budget")

Even with specialized stations, there's a second problem: Wasting resources.
Imagine you have a budget of 100 hours to train the robot. If you spend 50 hours training the robot on Pong (which it already mastered in 10 hours), you are wasting 40 hours.

The authors introduced a strategy called Dynamic Parameter Scaling (DPS). Think of this as a smart project manager:

Watch and Wait: The manager watches the robot. As soon as the robot masters Pong, the manager says, "Great! Stop training on Pong."
Expand Only When Needed: The manager takes the saved time and money and immediately builds a new specialized station for the next hard game the robot is struggling with.
Freeze the Past: The manager "freezes" the Pong station so it doesn't accidentally get messed up while the robot learns the new game.

This means the robot learns faster and uses less energy (fewer interactions with the game world) because it only spends time on the things it hasn't figured out yet.

The Results: A True "Generalist"

The team tested this new robot (ScaleZero) on three very different worlds:

Atari Games: 26 classic video games (visual, fast-paced).
DMC: Robot control simulations (physics-based, continuous movement).
Jericho: Text-based adventure games (reading, logic, long planning).

The Outcome:

Performance: A single ScaleZero robot performed just as well as 26 different robots, each trained only on one specific game. It didn't just "get by"; it actually mastered the hardest games that previous robots failed at.
Efficiency: When using the "Smart Budget" (DPS) strategy, the robot achieved the same results using 28.5% less data. It was like finishing a marathon in record time by only running when necessary and resting when not.

The Big Picture

This paper is a major step toward creating Generalist AI—robots that aren't just good at one thing, but can learn anything on the fly.

Old Way: Build a new brain for every new job.
ScaleZero Way: Build one flexible brain with specialized tools that turn on only when needed, managed by a smart system that knows exactly when to stop and when to start.

It's the difference between hiring a different specialist for every single problem versus hiring one brilliant project manager who can instantly assemble the perfect team for whatever challenge arises.

1. Problem Statement

The paper addresses the challenges of Heterogeneous Multi-Task Reinforcement Learning (MTRL) within the context of Unified World Models. While unified models (like UniZero) excel in single-task settings by learning a shared latent space for planning (via Monte Carlo Tree Search, MCTS), they struggle when applied to diverse task suites.

The authors identify two critical, intertwined obstacles that prevent a single model from mastering diverse tasks:

Representational Bottlenecks and Plasticity Collapse: In diverse MTRL settings, simpler tasks often dominate the gradient updates, suppressing learning signals for complex tasks. This leads to "plasticity collapse," where the model loses the ability to adapt to new data. Empirically, this manifests as a high ratio of "dormant neurons" (inactive neurons) and uncontrolled inflation of the latent state norm, causing catastrophic performance drops on complex tasks (e.g., Seaquest in Atari) despite success on simpler ones (e.g., Pong).
Static Resource Allocation: Conventional architectures apply a uniform "one-size-fits-all" forward pass and update strategy. This wastes computational resources on tasks that have already converged while under-allocating resources to difficult, unsolved tasks, leading to poor sample efficiency.

2. Methodology

The authors propose ScaleZero, a unified world model, and a training strategy called Dynamic Parameter Scaling (DPS) to address these issues from architectural and procedural perspectives.

A. ScaleZero Architecture

ScaleZero is built upon the UniZero framework but introduces specific architectural modifications to mitigate gradient conflicts and plasticity collapse:

Mixture-of-Experts (MoE) Backbone: The core innovation is replacing the dense feed-forward networks in the Transformer backbone with sparse MoE layers.
- Instead of a single dense network processing all inputs, the model uses a gating mechanism to route task-specific tokens to a subset of specialized "expert" sub-networks (e.g., 1 shared expert + 8 specialized experts).
- Mechanism: This conditional computation allows different tasks to utilize distinct parameter subsets, effectively isolating gradient updates and reducing interference. Theoretical analysis shows this strictly lowers the upper bound of gradient conflict compared to dense architectures.
Vision Transformer (ViT) Encoder: Replaces the ResNet-like encoder to provide scalable and powerful feature extraction, particularly for visual domains.
Standard Layer Normalization: Replaces SimNorm (used in UniZero) to balance stability with representational expressiveness, avoiding the over-constraint that limits capacity in highly heterogeneous settings.

B. Dynamic Parameter Scaling (DPS)

To address static resource allocation, the authors introduce DPS, an online, adaptive training strategy that couples model capacity with learning progress:

Adaptive Task Curation: The system maintains an "active set" of unsolved tasks. Once a task's performance exceeds a predefined threshold, it is marked as "solved," and computational resources (data collection and gradient updates) are removed from it.
Progressive Capacity Expansion: The training proceeds in stages.
- Stage 0 (Warm-up): Trains a shared base model on all tasks.
- Expansion Stages ( $s \ge 1$ ): As tasks are solved or new complexity is encountered, new LoRA (Low-Rank Adaptation) adapters are injected into the model.
- Parameter Isolation: When a new stage begins, previous parameters (base model and prior adapters) are frozen. Only the newly added LoRA module and learnable scaling factors are optimized. This prevents catastrophic forgetting and negative transfer while allowing the model to grow its capacity only when necessary.

3. Key Contributions

Quantitative Diagnosis of Plasticity Collapse: The paper provides empirical evidence linking performance degradation in unified world models to internal dynamics: specifically, the correlation between performance collapse, a spike in the dormant neuron ratio, and latent state norm inflation.
ScaleZero Model: A novel unified world model architecture that integrates a sparse MoE backbone. It demonstrates that conditional computation is the most effective architectural prior for overcoming representational interference in MTRL.
Dynamic Parameter Scaling (DPS): A novel online training curriculum that dynamically allocates model capacity using LoRA adapters. It achieves competitive performance while significantly reducing the total number of environment interactions required.
Theoretical Analysis: The authors provide a theoretical proof (Theorem 5.1) demonstrating that the upper bound of gradient conflict in an MoE layer is strictly lower than that of a dense layer, provided the router effectively specializes tasks.

4. Experimental Results

The method was evaluated on three diverse benchmarks: Atari (26 games), DeepMind Control Suite (18 continuous control tasks), and Jericho (4 text-based adventure games).

Performance vs. Specialists:
- Atari: ScaleZero (single model) achieved a higher mean Human-Normalized Score (HNS) than the average of 26 individually trained single-task UniZero agents. It notably solved difficult exploration tasks where the baseline failed.
- DMC: ScaleZero achieved a superior median score across 18 tasks, indicating robust generalization rather than just excelling on easy tasks.
- Jericho: ScaleZero performed on par with specialized single-task agents and competitive with strong language-model-based baselines (CALM+OC), proving modality-agnostic efficacy.
Sample Efficiency (DPS):
- When augmented with DPS on the DMC benchmark, the model achieved performance comparable to single-task agents while using only 71.5% of the environment interactions (a 28.5% reduction in data sampling and training cost).
Ablation Studies:
- Replacing the dense backbone with MoE yielded the most significant performance gains.
- Explicit task conditioning (concatenating task embeddings) provided marginal benefits compared to the architectural changes.
- Gradient correction methods (like MoCo) introduced high computational overhead with inconsistent gains.

5. Significance

This work represents a significant step toward generalist agents capable of learning across heterogeneous domains (visual, proprioceptive, and linguistic) without task-specific retraining.

Architectural Insight: It establishes that sparsity (MoE) is critical for maintaining plasticity in multi-task world models, solving the "plasticity collapse" problem that plagues dense shared architectures.
Efficiency: The DPS strategy offers a practical solution to the "static resource" problem, demonstrating that models can be dynamically scaled up only when needed, leading to substantial savings in computational budget and sample efficiency.
Scalability: By combining MoE for specialization and LoRA for dynamic expansion, the paper provides a blueprint for building scalable, efficient, and robust generalist planning agents.

The code is available at https://github.com/opendilab/LightZero.

One Model for All Tasks: Leveraging Efficient World Models in Multi-Task Planning

The Problem: The "Crowded Classroom"

The Solution: ScaleZero (The "Specialized Workshop")

The Second Innovation: DPS (The "Smart Budget")

The Results: A True "Generalist"

The Big Picture

1. Problem Statement

2. Methodology

A. ScaleZero Architecture

B. Dynamic Parameter Scaling (DPS)

3. Key Contributions

4. Experimental Results

5. Significance

More like this

DyMRL: Dynamic Multispace Representation Learning for Multimodal Event Forecasting in Knowledge Graph

How unconstrained machine-learning models learn physical symmetries

Experiential Reflective Learning for Self-Improving LLM Agents

Learning Mesh-Free Discrete Differential Operators with Self-Supervised Graph Neural Networks

Physics-Informed Neural Network Digital Twin for Dynamic Tray-Wise Modeling of Distillation Columns under Transient Operating Conditions