CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Imagine you are trying to teach a robot to use your smartphone to do something complex, like "Book a flight to Rome for next Saturday, but only non-stop flights."

To do this, the robot doesn't just guess. It has to think through a chain of steps:

Look: What is on the screen right now? (Screen Summary)
Plan: What are the smaller steps needed? (Subtask Plan)
Decide: Which button should I press? (Action Decision)
Act: Actually click the right spot on the screen. (Action Function)

The problem with current robot "brains" (AI models) is that they are like a generalist chef trying to do everything at once. They might be great at chopping vegetables (screen reading) but terrible at baking a cake (clicking the right button). Or, if you try to make them specialists, they get confused about when to switch hats.

This paper introduces CoME (Channel-of-Mobile-Experts), a new way to build these robot brains. Here is how it works, using some simple analogies:

1. The "Specialized Team" vs. The "Generalist"

Think of a standard AI model as a Swiss Army Knife. It has one blade that tries to do everything. It's okay at many things, but not great at any specific thing.

CoME is like a highly organized construction crew with four distinct specialists:

The Architect: Only looks at the blueprints (Screen Summary).
The Foreman: Only figures out the schedule (Subtask Plan).
The Decision Maker: Only chooses which tool to use (Action Decision).
The Worker: Only swings the hammer (Action Function).

In the past, these specialists were all mixed up in one big brain. CoME separates them into four distinct "channels" or experts.

2. The Magic Switch: "Output-Oriented Activation"

Here is the tricky part. In a normal team, you might ask everyone to listen to the input (the user's command) and then decide who speaks. But CoME does something smarter.

Imagine a conductor in an orchestra.

Old way (MoE): The conductor looks at the sheet music (input) and says, "Okay, the violinist is playing, so I'll let the violinist speak."
CoME way: The conductor looks at the moment in the song (the reasoning stage). If the song is at the "drum solo" part, the conductor only lets the drummer speak, even if the sheet music has notes for everyone.

CoME uses Output-Oriented Activation. It knows exactly which stage of thinking the robot is in. If the robot is currently "planning," CoME silences the other three experts and lets the "Planner" do all the talking. If it's time to "click," it switches to the "Worker." This prevents the robot from getting confused or trying to do two things at once.

3. The Training: "The Three-Step Boot Camp"

You can't just give a team of specialists a job and expect them to work together perfectly. The authors designed a progressive training strategy (a step-by-step boot camp):

Step 1: Expert-FT (Specialization): They train each specialist separately. The Architect only learns to read screens; the Worker only learns to click. They become masters of their own craft.
Step 2: Router-FT (The Conductor): They train the "Conductor" (the router) to know exactly when to switch from the Architect to the Worker. It learns the rhythm of the task.
Step 3: CoT-FT (Teamwork): Finally, they let the whole team work together on full tasks, learning how to pass the baton smoothly without dropping it.

4. The Safety Net: "InfoGain-Driven DPO" (The Truth Detector)

Even with a great team, mistakes happen. If the Architect misreads the screen, the Foreman plans the wrong schedule, and the whole thing fails. This is called error propagation.

To fix this, the authors added a Truth Detector called Info-DPO.
Imagine you are grading a student's essay.

Old way: You only look at the final grade. If the answer is right, you give an A, even if the student got there by guessing or using bad logic.
CoME way (Info-DPO): You look at every paragraph. You ask: "Did this paragraph actually help the student get closer to the answer?"
- If a step adds new, useful information (like a lightbulb turning on), it gets a positive score.
- If a step is confusing, repetitive, or leads to a dead end (like spinning in circles), it gets a negative score.

The system then punishes the robot for taking those "spinning in circles" steps and rewards it for taking the "lightbulb" steps. This forces the robot to learn how to think correctly, not just what to guess.

The Result?

When they tested CoME on real-world tasks (like booking flights or navigating apps), it beat the "Swiss Army Knife" models and the other "Specialist" models.

It made fewer mistakes.
It was better at clicking the exact right button.
It used less computer memory (it was more efficient).

In short: CoME is like taking a chaotic group of generalists, turning them into a specialized team, hiring a perfect conductor to manage them, and giving them a strict teacher who grades every single step of their thinking process. The result is a mobile robot that actually knows how to use your phone.

Here is a detailed technical summary of the paper "CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning."

1. Problem Statement

Mobile agents are designed to autonomously execute user instructions on mobile devices. This process requires hybrid-capabilities reasoning, a multi-stage workflow involving:

Screen Summary: Perceiving the current UI state.
Subtask Planning: Breaking down the goal into steps.
Action Decision: Determining the high-level action (e.g., click, type).
Action Function: Generating the low-level execution parameters (e.g., coordinates, text).

Key Challenges:

Capability Integration vs. Decoupling: Existing dense models struggle to balance these distinct capabilities, often leading to unbalanced performance (e.g., good at screen understanding but poor at action grounding).
Limitations of Mixture-of-Experts (MoE): Standard MoE architectures use input-oriented activation, routing different input tokens to different experts. However, hybrid-capability reasoning requires output-oriented activation, where the expert selected must align with the specific reasoning stage required to generate the next output token. Standard MoE fails to dynamically switch experts based on the reasoning stage, leading to suboptimal token generation.
Error Propagation: In multi-step reasoning, errors in intermediate stages (e.g., a flawed subtask plan) propagate and compromise the final action. Existing methods lack mechanisms to evaluate and suppress these invalid intermediate steps.

2. Methodology

The authors propose CoME (Channel-of-Mobile-Experts), a novel agent architecture and training framework designed to address the above challenges.

A. Architecture: Channel-of-Mobile-Experts (CoME)

CoME extends the Feed-Forward Network (FFN) layers of a base Multi-modal Large Language Model (MLLM) with four distinct experts, each specialized for a specific reasoning stage:

$E_{ss}$ : Screen Summary
$E_{sp}$ : Subtask Plan
$E_{ad}$ : Action Decision
$E_{af}$ : Action Function

Output-Oriented Activation:
Unlike standard MoE which routes based on input tokens, CoME employs output-oriented activation:

The input hidden states are replicated across all expert channels.
All experts process the hidden states in parallel.
A Channel Router selects the hidden states from the expert corresponding to the current reasoning stage to generate the output token.
This ensures that the capability required for the specific stage (e.g., planning) is used to generate the tokens for that stage.

B. Progressive Training Strategy

To empower CoME with hybrid capabilities, a three-stage curriculum is introduced:

Expert-FT (Expert Finetuning): The four experts are initialized and fine-tuned independently on task-specific datasets (e.g., screen summarization data for $E_{ss}$ ). This achieves decoupled enhancement of individual capabilities.
Router-FT (Router Finetuning): The channel router is trained using Cross-Entropy loss and a Router Norm loss. The goal is to align the router's selection with the correct reasoning stage, ensuring the right expert is activated for the right output token.
CoT-FT (Chain-of-Thought Finetuning): The entire model is fine-tuned on hybrid-capability reasoning data. This stage facilitates seamless collaboration and balanced optimization among the experts, allowing them to work together in a unified reasoning trajectory.

C. InfoGain-Driven DPO (Info-DPO)

To mitigate error propagation, the authors introduce Info-DPO, a Direct Preference Optimization method driven by Information Gain.

Concept: It quantifies the contribution of each intermediate reasoning step to the final action prediction.
Mechanism: A reward model estimates the information entropy of the ground-truth action before and after each reasoning stage. The reduction in entropy is defined as the InfoGain.
Reward Function: The total reward combines InfoGain (measuring the validity of the reasoning step) and Action Accuracy (measuring the final outcome).
Data Selection: During DPO training, trajectories are selected based on high InfoGain and positive intermediate validity ( $InfoPos$ ). Trajectories with negative InfoGain (indicating the step moved away from the correct answer) are rejected, even if they eventually reach the correct answer by chance. This forces the model to learn reliable reasoning paths.

3. Key Contributions

CoME Architecture: A novel agent design that replaces input-oriented MoE with output-oriented activation, allowing distinct experts to handle specific reasoning stages (Screen Summary, Planning, Decision, Execution) dynamically.
Progressive Training Strategy: A three-stage curriculum (Expert-FT, Router-FT, CoT-FT) that effectively decouples capabilities for specialization and then re-integrates them for balanced reasoning.
InfoGain-Driven DPO: A novel training objective that uses information theory to evaluate intermediate reasoning steps, suppressing invalid reasoning and mitigating error propagation in long-horizon tasks.
Comprehensive Evaluation: Extensive experiments demonstrating state-of-the-art performance on mobile agent benchmarks.

4. Experimental Results

The model was evaluated on two major datasets: AITZ (Android in the Zoo) and AMEX (Android Multi-Annotation Expo).

Performance on AITZ:
- CoME achieved an overall action match accuracy of 66.98%.
- It outperformed dense mobile agents (e.g., Qwen2VL-7B) by +1.57% and sparse MoE models (e.g., Qwen2VL-MoE) by +3.42%, despite having fewer activated parameters (5B vs. 7B/3.9B).
- Notably, CoME showed significant improvement in the challenging CLICK action category (+1.45% over the best baseline).
Performance on AMEX:
- CoME achieved the best overall performance across 9 different apps (72.61%), surpassing dense models by +1.90% and sparse MoE models by +8.05%.
Ablation Studies:
- Removing Info-DPO caused a significant drop in performance (-4.68%), confirming its role in reducing error propagation.
- Removing Router-FT led to a -4.08% drop, proving the necessity of aligning expert activation with reasoning stages.
- Removing Expert-FT resulted in a -4.36% drop, highlighting the need for specialized capability initialization.
Efficiency: CoME achieved higher accuracy while maintaining lower GPU memory usage compared to dense 7B models and other MoE variants.

5. Significance

This paper represents a significant shift in the design of mobile agents:

Paradigm Shift: It moves away from monolithic dense models or standard input-routed MoE toward stage-aware expert architectures. This acknowledges that different parts of a reasoning chain require fundamentally different cognitive capabilities.
Reliability: By introducing InfoGain as a metric for intermediate steps, CoME addresses the critical issue of error propagation in Chain-of-Thought reasoning, ensuring that the model doesn't just "get lucky" with the final answer but follows a logically sound path.
Generalizability: The architecture and training strategy are not limited to mobile agents; they offer a blueprint for any multi-stage reasoning task where different stages require distinct specialized capabilities (e.g., complex scientific reasoning, multi-step code generation).

In summary, CoME demonstrates that decoupling capabilities into specialized experts and aligning them with the reasoning stage via output-oriented activation, combined with information-theoretic preference optimization, leads to more robust, accurate, and efficient mobile agents.