Efficient Compositional Multi-tasking for On-device Large Language Models

Imagine you have a very smart, but slightly small, robot assistant living inside your smartphone. This robot is great at one thing at a time: it can summarize a long article, or it can translate a text into Spanish, or it can write a funny reply to a friend.

But what happens when you need it to do two things at once? For example, you want to read a long news article, get a short summary of it, and have that summary translated into Spanish—all in one go.

This is the problem the paper solves. Here is the story of how they fixed it, explained simply.

The Problem: The "Do One Thing" Robot

Currently, if you want your phone to summarize and translate, you usually have to ask it to do them separately.

"Hey robot, summarize this." (Robot does it).
"Okay, now translate that summary." (Robot does it again).

This is like asking a chef to chop the vegetables, then stopping, washing the knife, and asking them to cook the soup. It works, but it's slow and wastes energy.

Alternatively, you could try to teach the robot a brand new "Super Skill" that combines both. But your phone has very limited storage space (like a tiny backpack). You can't carry a new, heavy backpack for every single combination of tasks (Summarize+Translate, Summarize+French, Reply+Professional Tone, etc.). There are too many combinations!

The Old Solution: The "Smoothie" Mistake

Researchers tried a method called Model Merging. Imagine you have two different "expert" robots:

Robot A is a master of Summarizing.
Robot B is a master of Translating.

The old idea was to take Robot A and Robot B, dump their brains into a blender, and mix them together to make a "Super Robot."

The Issue: When you blend them, the instructions get confused. The Super Robot might try to summarize and translate at the same time, but it ends up doing a bad job at both. It's like blending a hammer and a screwdriver; you get a weird tool that can't hammer well or screw well.

The New Solution: "Learnable Calibration"

The authors of this paper came up with a clever trick called Learnable Calibration.

Think of your phone's robot as a base car (like a standard Toyota Camry).

You already have specialized kits attached to it: a "Summarizing Kit" and a "Translating Kit." These are small, efficient add-ons (called Adapters or LoRAs) that you already own.

Instead of building a whole new car or smashing the kits together, the authors propose adding a tiny, customizable dashboard between the driver and the engine.

The Setup: You take the existing Summarizing Kit and the Translating Kit and attach them to the car.
The Calibration: You add a very small, smart "tuning knob" (the Learnable Calibration). This knob is tiny—so small it barely takes up any space in your backpack.
The Magic: This knob learns how to tell the engine, "Hey, when I ask for a summary and a translation, don't just do them one after the other. Blend the instructions so the car drives smoothly in both directions at once."

Why is this a Big Deal?

Speed: It happens in one single step (one "inference pass"). The car drives straight to the destination without stopping.
Space: The "tuning knob" is incredibly small. You can have a different knob for every combination of tasks without filling up your phone's memory.
Performance: It works almost as well as the slow, inefficient method of doing tasks one by one, but it's much faster.

The "Benchmark" (The Driving Test)

To prove this works, the researchers built a Driving Test (a benchmark). They created four specific scenarios to test their new method:

Summarize + Translate: "Read this long story and give me the short version in Spanish."
Summarize + Tone Change: "Read this story and give me the short version, but make it sound very professional."
Reply + Translate: "Write a reply to this text, but send it in French."
Reply + Tone Change: "Write a reply, but make it sound funny/witty."

They tested their "Tuning Knob" method against the old "Blender" methods and the "Do it twice" methods.

The Result

The "Tuning Knob" (Learnable Calibration) won.

It was fast (one step).
It was light (tiny storage).
It was smart (it actually understood how to do both tasks together).

The Takeaway

This paper gives us a blueprint for making our phones smarter without making them slower or heavier. It allows our small, on-device AI assistants to juggle multiple complex tasks at once—like summarizing a document while translating it—by using a tiny, smart "tuning knob" to harmonize the skills they already have.

It's the difference between asking a friend to do two chores separately and teaching them how to do both chores simultaneously while humming a tune.

1. Problem Definition

The paper addresses a critical gap in the deployment of Large Language Models (LLMs) on resource-constrained devices (e.g., smartphones). While existing Model Merging techniques allow a single model to handle multiple distinct tasks (e.g., Task A or Task B) by averaging weights, they fail in Compositional Multi-tasking scenarios.

The Challenge: In real-world on-device use cases, users often require the simultaneous execution of multiple tasks on a single input in a single inference pass.
- Example: Summarizing a long text and translating the summary into another language simultaneously.
Current Limitations:
- Existing Merging: Standard merging (e.g., TIES, DARE, Linear Averaging) assumes each test example targets only one task. When applied to compositional tasks, these methods fail to execute both tasks correctly.
- Inefficient Baselines: The only current way to achieve compositional tasks is via a multi-step pipeline (running separate adapters sequentially) or training a dedicated "Joint-Expert" adapter for every specific combination. These approaches are computationally expensive, increase latency (multiple inference passes), and consume excessive storage (training a new adapter for every possible task combination is infeasible on-device).

2. Methodology: Learnable Calibration

The authors propose Learnable Calibration, a novel method designed to enable compositional multi-tasking with a single inference pass and minimal storage overhead.

Core Concept

Instead of training a completely new adapter or simply averaging existing ones, the method:

Starts with pre-existing, task-specific Low-Rank Adapters (LoRAs) already stored on the device (e.g., a Summarization LoRA and a Translation LoRA).
Merges these single-task LoRAs linearly to create a base initialization.
Calibrates this merged base by adding a small number of learnable parameters ( $P$ ) specifically trained to resolve the interference between tasks and enforce the compositional behavior.

Two Variations

The paper introduces two variations of the calibration mechanism to trade off between model size and performance:

Variation #1 (Learnable Calibration): Adds a vector of column-wise biases ( $p$ ) to the merged LoRA update matrix. This is the most storage-efficient version.
Variation #2 (Learnable Calibration++): Adds a full calibration LoRA (factorized into two low-rank matrices $P_2$ and $P_1$ ) on top of the merged adapters. This offers higher expressiveness and better performance.

Training Strategy

Server-Side Pre-training: The calibration parameters are trained on a server using compositional task data (input-output pairs where the output satisfies both tasks). This avoids the need for on-device training.
On-Device Deployment: Once trained, the small calibration parameters are loaded onto the device. The inference process involves a simple matrix operation to combine the base LoRAs and the calibration parameters, resulting in zero additional inference latency compared to standard single-task LoRAs.

3. Key Contributions

New Research Direction: Defined and formalized the problem of Compositional Multi-tasking for on-device LLMs, distinguishing it from standard multi-tasking where examples are single-task.
Benchmark Development: Created a comprehensive benchmark comprising four practical compositional tasks:
- Summarization + Translation (English to Spanish/French/German).
- Summarization + Tone Adjustment (Professional, Casual, Witty, Paraphrase).
- Reply Generation + Translation.
- Reply Generation + Tone Adjustment.
- Total: 14 sub-tasks evaluated across multiple models.
Proposed Method: Introduced Learnable Calibration, a highly efficient method that achieves high performance with negligible storage overhead (0.08%–0.56% of a full joint-expert adapter).
Empirical Validation: Demonstrated that existing merging strategies (TIES, DARE, Linear, etc.) fail at compositional tasks, while the proposed method matches or exceeds the performance of inefficient baselines (multi-step pipelines).

4. Experimental Results

The experiments were conducted on on-device models (LLaMA 3.2 1B, Qwen2.5 1.5B, StableLM2 1.6B).

Performance vs. Efficiency:
- Baseline Failure: Standard merging strategies (Linear, TIES, DARE) performed poorly, often failing to execute one of the tasks (e.g., summarizing but not translating). Their LLM Judge scores were near zero or very low.
- Inefficient Baselines: Multi-step LoRA usage and Joint-Expert LoRAs achieved high performance (e.g., ~73% LLM-J score for Sum+Trans) but required 2 inference passes or massive storage (57MB).
- Learnable Calibration: Achieved comparable performance to the inefficient baselines (e.g., 65.15% LLM-J for Sum+Trans with LC++) but with single-pass inference and <0.5MB additional storage.
Storage Efficiency:
- A Joint-Expert LoRA requires ~57MB.
- Learnable Calibration++ requires only 0.32MB.
- Learnable Calibration (Variation 1) requires only 0.05MB.
Robustness: The method showed strong performance across different model sizes (0.5B to 3B), different languages, and even in out-of-domain scenarios (e.g., training on DialogSum, testing on SAMSum).
Qualitative Analysis: Visual inspection confirmed that while zero-shot and standard merging often failed to perform both tasks (e.g., generating a summary in the wrong language), Learnable Calibration successfully executed both tasks simultaneously.

5. Significance and Impact

Enabling Complex On-Device AI: This work bridges the gap between the theoretical capabilities of LLMs and the practical constraints of mobile devices. It allows users to perform complex, multi-step workflows (e.g., "Summarize this meeting notes and translate them to Spanish") without sending data to the cloud or waiting for sequential processing.
Resource Efficiency: By leveraging existing adapters and adding only tiny calibration parameters, the method solves the "storage explosion" problem associated with training a unique adapter for every possible task combination.
Paradigm Shift: It challenges the assumption that model merging is only for single-task selection, proving that with learnable calibration, merged models can handle simultaneous, compositional requirements.
Practical Deployment: The method is compatible with existing mobile AI frameworks (Android AI Core, Apple Intelligence), making it immediately deployable in real-world consumer applications.

In conclusion, the paper establishes that Compositional Multi-tasking is a viable and necessary capability for on-device LLMs, and Learnable Calibration provides the most efficient path to achieving it, balancing high performance with strict resource constraints.