Feed m Birds with One Scone: Accelerating Multi-task Gradient Balancing via Bi-level Optimization

Here is an explanation of the paper "Feed m Birds with One Scone" (MARIGOLD) using simple language and creative analogies.

The Big Picture: The "Too Many Cooks" Problem

Imagine you are a chef (the AI model) trying to cook a single meal that satisfies five different guests (the tasks).

Guest A wants it spicy.
Guest B wants it sweet.
Guest C wants it salty.
Guest D wants it bland.
Guest E wants it crunchy.

If you just follow Guest A's instructions, you ruin the meal for Guest B. If you try to please everyone at once without a plan, you end up with a flavorless, mushy disaster. This is the core problem of Multi-Task Learning (MTL): how do you train one AI to do many different things well without the instructions for one thing messing up the others?

The Old Way: The "All-Hands Meeting" (MGDA)

In the past, the best way to solve this was a method called MGDA (Multiple Gradient Descent Algorithm).

Imagine that to decide what to cook next, the chef calls a meeting with five sous-chefs (the gradients). Each sous-chef holds a clipboard with a detailed report on exactly how the current dish is failing their specific guest.

The chef has to read all five reports, compare them, calculate a complex compromise, and then decide on the next step.

The Problem: This is incredibly slow. If you have 100 guests (tasks), the chef has to read 100 reports before taking a single step. In the world of AI, this means the computer has to do a massive amount of math for every single update, making training take forever and requiring huge amounts of memory.

The New Solution: MARIGOLD (The "One Scone" Strategy)

The authors of this paper propose a new method called MARIGOLD. Their title, "Feed m Birds with One Scone," is a play on the phrase "kill two birds with one stone." They want to feed many birds (tasks) with just one scone (computation).

Here is how MARIGOLD works, using a simple analogy:

1. The Two-Level Game (Bi-Level Optimization)

Instead of treating the "cooking" and the "planning" as one giant, messy job, MARIGOLD splits them into two levels:

The Lower Level (The Cooking): The chef actually cooks the dish (updates the model) based on a current recipe.
The Upper Level (The Planning): A manager watches the cooking and asks, "Is this recipe making everyone happy? If not, how should we tweak the recipe weights?"

In the old days, the manager had to wait for the chef to finish cooking, then read all the sous-chefs' reports to adjust the recipe. MARIGOLD makes this a continuous loop where the manager and chef talk constantly.

2. The Magic Trick: "Zeroth-Order" (Feeling the Heat)

This is the most important part. The old methods required the manager to read the exact math reports from every single sous-chef (calculating all gradients).

MARIGOLD uses a trick called Zeroth-Order Optimization. Instead of reading the reports, the manager just feels the result.

The Analogy: Imagine the chef is cooking a soup. Instead of asking 5 people to taste it and write down exactly how much salt is needed, the manager just adds a tiny pinch of salt, tastes the soup, and asks, "Is it better or worse?"
If it's better, keep going that way. If it's worse, go the other way.

The manager doesn't need to know the exact chemical composition of the soup (the complex gradients of every task). They just need to know if the overall situation improved or got worse. This allows the computer to skip the heavy math of reading 100 reports and just take one quick "taste" (a single backward pass).

Why is this a Big Deal?

Speed: The old method was like reading a 100-page book before making a decision. MARIGOLD is like glancing at the cover. It reduces the computing work from being proportional to the number of tasks (100x work) to being proportional to just the size of the model (1x work).
Flexibility: The old methods were picky about how you cooked (they only worked with specific types of math updates). MARIGOLD works with any "chef" (optimizer), including the popular Adam optimizer used in most modern AI.
Real-World Results: The authors tested this on:
- Public Datasets: Like teaching an AI to recognize objects and depth in images at the same time. MARIGOLD was faster and often more accurate than the old methods.
- Industrial Data: They tested it on a massive Meta advertising system. Even with millions of users and complex goals, MARIGOLD improved the system's ability to predict clicks and conversions better than the standard "equal weight" approach.

Summary

The Problem: Training AI to do many things at once is slow because it has to check every single task's progress individually, like a teacher grading 100 essays one by one before moving to the next class.

The Solution (MARIGOLD): Instead of grading every essay individually, the teacher takes a quick "sample" of the class's overall performance to decide how to adjust the lesson plan.

The Result: You get the same (or better) quality of learning, but you do it 10 to 100 times faster and with much less computer memory. It's the difference between a slow, heavy truck and a nimble sports car that can carry the same load.

Here is a detailed technical summary of the paper "Feed m Birds with One Scone: Accelerating Multi-task Gradient Balancing via Bi-level Optimization" (MARIGOLD).

1. Problem Statement

Multi-Task Learning (MTL) aims to optimize multiple objective functions simultaneously to improve generalization and representation learning. A core challenge in MTL is gradient conflict, where the gradients of different tasks point in opposing directions ( $\langle \nabla f_i, \nabla f_j \rangle < 0$ ), leading to "negative transfer" where optimizing one task degrades another.

To address this, Gradient Balancing methods (e.g., MGDA, CAGrad) dynamically adjust task weights or manipulate gradients to find a descent direction that benefits all tasks. However, these methods suffer from a critical bottleneck:

Computational Inefficiency: They typically require computing and storing gradients for all $m$ tasks in every iteration.
Complexity: This results in $O(md)$ time and space complexity (where $m$ is the number of tasks and $d$ is the model dimension), making them infeasible for large-scale industrial applications or models with many tasks.
Optimizer Rigidity: Many theoretical frameworks assume Stochastic Gradient Descent (SGD), while practical implementations often require adaptive optimizers like Adam, creating a theory-practice gap.

2. Methodology: MARIGOLD

The authors propose MARIGOLD (Multi-task gRadIent balancinG via zerOth-order bi-leveL Differentiation), a unified framework that reformulates MTL gradient balancing as a Bi-level Optimization problem and solves it efficiently using Zeroth-Order methods.

A. Bi-level Optimization Formulation

The authors reveal that MTL gradient balancing has an intrinsic hierarchical structure:

Lower-Level (LL): Model training. Given task weights $\lambda$ , the model parameters $\theta$ are optimized to minimize the weighted loss:
$\theta^*(\lambda) = \arg\min_{\theta} \sum_{i=1}^m \lambda_i f_i(\theta)$
Upper-Level (UL): Gradient balancing. The task weights $\lambda$ are optimized to minimize a worst-case decrement metric (e.g., CAGrad's objective) based on the resulting model performance:
$\min_{\lambda \in \Delta_m} \max_{\rho \in \Delta_m} \Phi(\lambda, \rho) = \sum_{i=1}^m \rho_i (f_i(A(\lambda, \theta^*(\lambda))) - f_i(\theta^*(\lambda)))$
Here, $A$ represents the model training algorithm (optimizer).

B. Zeroth-Order Hypergradient Estimation

The main innovation is avoiding the computation of $m$ separate task gradients. Instead of using standard backpropagation to compute the hypergradient $\nabla_\lambda \Phi$ , MARIGOLD employs a Zeroth-Order (ZO) method:

Perturbation: Instead of calculating exact gradients, the algorithm perturbs the task weights $\lambda$ by a small random vector $u$ (scaled by $r$ ).
Single Forward/Backward Pass: It computes the loss difference resulting from this perturbation.
Gradient Approximation: The hypergradient is estimated using the finite difference formula:
$\hat{\nabla}_\lambda \Phi \approx \frac{m}{r} \left( \sum_{i=1}^m \rho_i f_i(A(\lambda + ru, \theta)) \right) u$
Complexity Reduction: This approach reduces the per-iteration complexity from $O(md)$ to $O(d)$ , as it requires only one backward pass of the weighted loss, regardless of the number of tasks.

C. Algorithmic Flexibility

Model-Agnostic: Unlike previous methods that theoretically require SGD, MARIGOLD works with any lower-level optimizer (e.g., Adam, AdaGrad) because it treats the optimizer $A$ as a black-box function within the bi-level structure.
General Framework: The framework can handle various balancing criteria (e.g., worst-case decrement, auxiliary learning) by simply changing the Upper-Level objective function $R(\lambda, \theta)$ .

3. Key Contributions

Unified Framework: Introduced MARIGOLD, which unifies model training and gradient balancing into a bi-level optimization problem.
Complexity Breakthrough: Reduced the computational complexity of gradient balancing from $O(md)$ to $O(d)$ per iteration by leveraging zeroth-order differentiation, eliminating the need to store $m$ gradients.
Optimizer Compatibility: Resolved the mismatch between theory (SGD) and practice (Adam) by allowing any user-specified optimizer for the lower-level problem.
Empirical Validation: Demonstrated superior performance and efficiency on both public benchmarks and large-scale industrial datasets.

4. Experimental Results

The authors evaluated MARIGOLD on public datasets (NYU-v2, Cityscapes) and a Meta industrial-scale foundation model for ad ranking.

Public Datasets (NYU-v2 & Cityscapes):
- Performance: MARIGOLD achieved state-of-the-art or comparable performance to existing gradient balancing methods (MGDA, PCGrad, CAGrad, Nash-MTL, etc.) across segmentation, depth estimation, and surface normal prediction tasks.
- Efficiency: It significantly outperformed other gradient balancing methods in wall-clock time. For example, on Cityscapes, MARIGOLD took 100s/epoch compared to 163s for MGDA and 126s for FAMO (the previous most efficient method).
- Convergence: In time-based comparisons, MARIGOLD converged faster than FAMO, achieving higher $\Delta k\%$ (performance gain over Single Task Learning) in less time.
Industrial-Scale Data (Meta Ad Ranking):
- Applied to a large foundation model with 4 tasks (Click, Conversion, etc.).
- Used as an Auxiliary Learning framework to tune the weight of a distillation task.
- Result: MARIGOLD achieved consistent Normalized Entropy (NE) gains (0.03% to 0.14%) over a baseline Linear Scalarization (LS) method, demonstrating its viability for production systems.

5. Significance

Scalability: By reducing complexity to $O(d)$ , MARIGOLD makes advanced gradient balancing feasible for models with hundreds of tasks or massive parameter counts, where previous $O(md)$ methods were computationally prohibitive.
Practicality: It bridges the gap between theoretical MTL algorithms and industrial reality by supporting adaptive optimizers (Adam) and reducing memory footprint.
Generalizability: The bi-level formulation provides a flexible template for solving various MTL sub-problems, including auxiliary learning and hyperparameter tuning, without requiring custom derivations for each specific balancing criterion.

In summary, MARIGOLD offers a computationally efficient, model-agnostic, and theoretically grounded solution to the multi-task gradient conflict problem, enabling the deployment of sophisticated MTL strategies in large-scale production environments.