ES-Merging: Biological MLLM Merging via Embedding Space Signals

Imagine you are trying to build the ultimate biological detective.

Currently, we have three brilliant specialists:

Molecule Mike: An expert who knows everything about chemical structures and drugs.
Protein Polly: An expert who understands how proteins (the body's building blocks) work.
Cell Charlie: An expert who knows how individual cells react to treatments.

The problem is, these three experts work in silos. If you ask Mike how a drug affects a specific cell, he might say, "I don't know, I only look at chemicals!" If you ask Charlie, he says, "I only look at cells." To solve complex medical mysteries (like "Will this drug cure this specific cancer cell?"), you need all three experts working together.

The Old Way: The "Blind Blend"

Previously, scientists tried to combine these experts by simply averaging their brains (their computer code). Imagine taking three different languages and averaging the words to create a new language. It's a "blind" approach. They looked at the structure of the experts' brains (the numbers inside the code) and tried to guess which parts were important without actually seeing how the experts thought about a problem.

This often resulted in a confused detective who knew a little bit about everything but was bad at connecting the dots.

The New Way: ES-Merging (The "Listening" Approach)

The paper introduces ES-Merging, a smarter way to combine these experts. Instead of just looking at the static code, ES-Merging asks the experts to solve a test problem and listens to how they think.

Here is the step-by-step process using a creative analogy:

1. The "Probe" (The Test Question)

Imagine you hand a single, complex puzzle piece to all three experts at once. This puzzle piece contains a mix of a drug, a protein, and a cell.

The Old Way: Just looked at the experts' resumes.
ES-Merging: Watches how each expert's brain lights up as they process this specific puzzle piece.

2. The "Lightbulb" Moment (Embedding Space Signals)

As the experts process the puzzle, their internal "lightbulbs" (neural representations) glow differently.

Molecule Mike's brain glows very brightly when he sees the drug part of the puzzle.
Protein Polly's brain glows when she sees the protein part.
Cell Charlie's brain glows for the cell part.

The paper calls this the Embedding Space. It's like a map of how the experts feel about the data. If an expert is truly specialized, their map looks very different from a generic model when they see their specific topic.

3. The Two-Step Merging Strategy

ES-Merging uses two different "lenses" to decide how much to trust each expert:

Lens A: The Layer-by-Layer View (The "Big Picture")
Imagine the experts' brains are made of 30 floors. ES-Merging asks: "On which floors does Mike's brain change the most compared to a generic brain when looking at drugs?"
If the 10th floor is where Mike does his best drug analysis, ES-Merging gives Mike a high weight for that specific floor. It's like saying, "On the 10th floor, we'll let Mike drive the car."
Lens B: The Tiny Detail View (The "Micro-Adjustment")
Even on the 10th floor, not every neuron is equally important. Some neurons might be doing the heavy lifting, while others are just resting. ES-Merging zooms in to see exactly which tiny switches (parameters) are flipping for Mike.
It says, "On the 10th floor, we trust Mike's left-hand switches, but we trust Polly's right-hand switches."

4. The Final Union

By combining the Big Picture (which floors are important) and the Micro View (which specific switches are important), ES-Merging creates a Super-Detective.

This new model isn't just a blurry average.
It knows exactly when to listen to Mike, when to listen to Polly, and when to listen to Charlie.
It preserves the unique "voice" of each expert while teaching them to work together.

Why Does This Matter?

The paper tested this new detective on real-world medical problems, like predicting if a drug will stop a cancer cell or interact with a protein.

The Result: The ES-Merging detective outperformed the old "Blind Blend" methods.
The Surprise: It even beat models that were specifically trained (fine-tuned) for just one task! Usually, training a model for a specific task makes it better at that task but worse at others. ES-Merging managed to keep the "superpowers" of all three experts without losing them.

The Takeaway

Think of ES-Merging as a conductor for an orchestra.

The Old Way was like telling everyone to play the same note at the same volume. It sounded okay, but boring.
The New Way (ES-Merging) listens to the music being played. It knows exactly when the violin (Molecules) needs to be loud, when the cello (Proteins) needs to take the lead, and when the drums (Cells) should keep the rhythm.

By listening to the "music" (the embedding signals) rather than just looking at the sheet music (the parameters), they created a unified model that is smarter, more accurate, and ready to solve complex biological mysteries.

1. Problem Statement

Biological Multimodal Large Language Models (MLLMs) have emerged as powerful tools for scientific discovery, specializing in single modalities such as molecules, proteins, and cells. However, many critical scientific problems (e.g., drug-target interactions, protein-ligand binding) are inherently cross-modal, requiring the integration of knowledge from multiple modalities.

Current approaches face two main limitations:

Specialization Silos: Existing biological MLLMs are typically fine-tuned for a single modality, lacking the ability to perform cross-modal reasoning.
Ineffective Model Merging: While model merging offers a parameter-efficient way to combine specialized models, existing methods (e.g., TIES-Merging, Task Arithmetic) rely on input-agnostic parameter space heuristics (e.g., magnitude, sign, direction). These methods fail to capture the semantic nuances of modality specialization because they do not consider how the model processes specific input data. This leads to "input-blindness," resulting in poor cross-modal integration and degraded performance.

2. Methodology: ES-Merging

The authors propose ES-Merging (Embedding-Signal-based Merging), a framework that shifts the merging paradigm from parameter space signals to embedding space signals. The core insight is that input-aware representations encode rich, modality-specific specialization that can be measured to determine optimal merging coefficients.

The methodology consists of four key stages:

A. Probe Input Design

To elicit modality-specific responses, the authors design a probe input containing tokens from all target modalities (e.g., molecule, protein, cell tokens) concatenated with a textual prefix.

This input is forwarded through the Base LLM and each Specialized MLLM (e.g., Mol-LLaMA, Prot2Text-V2, Cell-o1).
The system extracts layer-wise embeddings for specific modality tokens from both the base and specialized models.

B. Layer-wise Global Merging Coefficients (Coarse-Grained)

This stage identifies which transformer layers contribute most to modality specialization.

Signal: The authors compute the Sliced Wasserstein Distance (SWD) between the distribution of embeddings from the Base LLM and the Specialized MLLM for a given modality.
Calculation: They measure the change in SWD between consecutive layers ( $d^l = SWD^l - SWD^{l-1}$ ). A large change indicates a layer where the specialized model significantly diverges from the base model in processing that modality.
Coefficient: These changes are normalized (Z-score) and aggregated across modalities to generate a layer-wise importance score, which is converted into a global merging coefficient ( $\alpha$ ) via Softmax.

C. Element-wise Local Merging Coefficients (Fine-Grained)

This stage identifies specific parameter elements within a layer that are critical for specialization.

Signal: Instead of distribution distances, this stage uses the Frobenius norm of the difference between individual token embeddings ( $L_2$ distance) between the base and specialized models.
Calculation: The authors compute the gradient magnitude of this embedding distance with respect to each parameter element ( $\theta$ ). High gradients indicate that a specific parameter is highly sensitive to the modality-specific representation shift.
Coefficient: These sensitivity scores are normalized and converted into element-wise local coefficients ( $\beta$ ) via Softmax.

D. Integration

The final merging coefficient ( $\lambda$ ) for each parameter is computed by multiplying the layer-wise global coefficient and the element-wise local coefficient, followed by renormalization across modalities:
$\lambda_{mi}^{l,n} = \frac{\alpha_{mi}^l \cdot \beta_{mi}^{l,n}}{\sum_{m} \alpha_{m}^l \cdot \beta_{m}^{l,n}}$
This ensures that parameters are weighted based on both the global importance of the layer and the local sensitivity of the specific element.

3. Key Contributions

New Merging Paradigm: The paper introduces the first merging framework for biological MLLMs that derives coefficients from embedding space signals rather than static parameter statistics, addressing the "input-blindness" of prior methods.
Dual-Granularity Coefficients: It proposes a novel mechanism to estimate merging coefficients at two complementary levels:
- Layer-wise: Captures coarse-grained specialization shifts across the network depth.
- Element-wise: Captures fine-grained parameter sensitivity within layers.
Efficiency: The method is tuning-free (no backpropagation on downstream tasks) and computationally efficient, requiring only a single forward pass and one gradient computation step to determine coefficients.

4. Experimental Results

The authors evaluated ES-Merging by merging three specialized models (Molecule, Protein, Cell) into a unified model and testing on diverse biological benchmarks.

Instance-Varying Interaction Prediction:
- Tasks: Molecule-Protein interaction (BindingDB, BioSNAP) and Molecule-Cell interaction (DrugComb, GDSC2).
- Result: ES-Merging outperformed all existing merging baselines (e.g., TIES-Merging, EMR-Merging) and even surpassed task-specific fine-tuned models (Avg. Merging + FT). This demonstrates that ES-Merging preserves the reasoning capabilities of experts better than fine-tuning, which often degrades cross-modal reasoning.
Target-Fixed Functionality Prediction:
- Tasks: CYP enzyme inhibition and substrate prediction.
- Result: ES-Merging achieved the best average performance, showing it can effectively integrate expert knowledge for specific biological functions without losing the structural understanding of the specialized models.
Ablation Studies:
- Combining both layer-wise and element-wise coefficients yielded the best results, confirming that integrating signals at different granularities is necessary for robust merging.
- Using only one type of coefficient still outperformed other merging baselines, validating the efficacy of embedding signals.
Computational Cost:
- ES-Merging is 3.4x to 6.1x more computationally efficient than test-time adaptation methods (like AdaMerging) and fine-tuning, as it avoids iterative gradient updates.

5. Significance

Scientific Discovery: ES-Merging provides a principled, efficient way to create unified biological foundation models capable of solving complex cross-modal problems (e.g., drug discovery) without the prohibitive cost of curating massive cross-modal instruction datasets.
Generalizability: While tested on biological modalities, the core principle—using input-aware embedding distributions to guide merging—is modality-agnostic and could be applied to other domains (e.g., vision-language, audio-text).
Interpretability: The method offers insights into where and how models specialize (via coefficient visualization), revealing that modality specialization is not uniform but concentrated in specific layers and parameter elements.

In conclusion, ES-Merging establishes that embedding space signals are a superior foundation for merging specialized MLLMs, enabling robust cross-modal knowledge composition that outperforms both traditional merging heuristics and expensive fine-tuning strategies.