Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts

Imagine you are trying to solve a massive, complex mystery. You have a team of 100 different detectives, each specializing in a different type of clue: one is great at reading handwriting, another at analyzing fingerprints, a third at listening to audio recordings, and a fourth at interpreting medical charts.

In the world of Artificial Intelligence, this is called Multimodal Learning. Usually, AI models try to listen to all 100 detectives at once, mixing their voices into a giant, confusing shout. This works okay for simple cases, but when the clues are complex, noisy, and happen at different times, the AI gets overwhelmed.

This paper introduces a new framework called MERGE (Massively-multimodal Expert Routing for Generalized Exchange). Think of MERGE as a brilliant Mission Control Commander who doesn't just listen to the detectives; it understands how they relate to each other over time and assigns the right detective to the right job at the exact right moment.

Here is the breakdown of how it works, using simple analogies:

1. The Problem: The "Static" Commander

Most current AI models use a "Mixture of Experts" (MoE). Imagine a manager who assigns tasks to employees based on how similar the task looks to the employee's resume.

The Flaw: This manager only looks at the present moment. They don't realize that a clue found in a medical chart yesterday might explain a symptom happening today. They miss the delayed connections.
The Result: The AI gets confused. It might ask the "Fingerprint Expert" to analyze a "Sound Recording" just because they look similar on paper, leading to bad decisions.

2. The Solution: The "Time-Traveling" Commander (MERGE)

MERGE changes the game by asking a crucial question: "How does Clue A from 5 minutes ago affect Clue B right now?"

It uses a concept called RUS (Redundancy, Uniqueness, Synergy) to map the relationships between clues over time:

Redundancy (The Echo): If two sensors (like a heart rate monitor and a pulse oximeter) are saying the exact same thing, MERGE knows they are "echoing." It sends them to the same expert to save energy and avoid repetition.
Uniqueness (The Soloist): If a sensor (like a specific blood test) provides a piece of information no one else has, MERGE sends it to a specialized expert who can focus entirely on that unique signal.
Synergy (The Power Couple): Sometimes, two clues only make sense when combined after a delay. For example, a patient takes a drug (Clue A), and 2 hours later, their fever spikes (Clue B). Alone, they look random. Together, they tell a story. MERGE recognizes this "delayed dance" and sends them to a Synergy Expert designed to connect the dots.

3. The Magic Tool: The "Interaction Map"

To figure out these relationships, MERGE uses a special calculator (called a Multi-scale BATCH Estimator).

Analogy: Imagine you are trying to understand a conversation between two people who speak different languages and have a 5-second delay in their translation earpieces.
The Old Way: You try to guess the connection by listening to one second of audio. You fail.
The MERGE Way: It records the whole conversation, analyzes the delays, and creates a map showing exactly when Person A's words influenced Person B's reaction. It does this for every pair of sensors in the system.

4. The Result: A Smarter, Faster Team

Once MERGE has this map, it acts as a traffic controller for the AI's "Experts" (the neural network layers):

Smart Routing: Instead of randomly guessing which expert should handle a piece of data, MERGE directs tokens (pieces of data) based on the interaction map.
Interpretability: Because the routing is based on clear rules (Redundancy, Uniqueness, Synergy), humans can look at the AI's decisions and say, "Ah, it grouped these two sensors together because they were redundant," rather than it being a "black box."

Real-World Impact

The authors tested this on three very different worlds:

Healthcare (ICU): Predicting if a patient will survive. MERGE realized that a drop in oxygen levels now is often caused by a medication given hours ago. It connected these delayed dots better than any previous model.
Activity Recognition: Tracking how people move. It noticed that your arm swing and your leg movement are "redundant" (they move together), so it processed them together efficiently.
Emotion Detection: Analyzing video, audio, and text. It understood that a sarcastic tone (audio) might contradict the words (text) only after a specific pause, allowing it to catch the sarcasm.

The Bottom Line

MERGE is like upgrading a chaotic newsroom into a highly organized newsroom. Instead of everyone shouting at once, the editor (MERGE) knows exactly which reporters (sensors) are repeating the same story, which ones have the scoop, and which ones need to be paired up to reveal the full truth—especially when the truth takes a little time to unfold.

This makes the AI smarter (higher accuracy), faster (by not wasting energy on redundant data), and easier to trust (because we can see why it made a decision).

1. Problem Statement

Modern applications, particularly in healthcare, involve massively multimodal settings where dozens to hundreds of heterogeneous input streams (e.g., clinical sensors, wearables, imaging, text) interact. These streams possess distinct measurement models, sampling rates, and noise characteristics.

Key Challenges:

Complex Temporal Interactions: Interactions between modalities are not static; they unfold over time with characteristic delays (e.g., a physiological change in one sensor may trigger a response in another hours later).
Limitations of Existing MoE: Current Mixture-of-Experts (MoE) architectures typically route tokens based solely on token-expert similarity. They overlook rich temporal dependencies and delayed cross-modal effects. This leads to suboptimal expert specialization, where experts fail to learn specific interaction-processing skills (redundancy, uniqueness, or synergy), resulting in reduced accuracy and interpretability.
Scalability: Existing methods often struggle to scale to hundreds of modalities or fail to capture continuous, time-delayed interactions effectively.

2. Methodology: The MERGE Framework

The authors propose MERGE (Massively-multimodal Expert Routing for Generalized Exchange), a framework that explicitly quantifies temporal multimodal interactions to guide MoE routing.

A. Capturing Temporal Multimodal Interactions (Temporal RUS)

The core innovation is the computation of Temporal Redundancy, Uniqueness, and Synergy (RUS) using Directed Information.

Formulation: Instead of static Partial Information Decomposition (PID), MERGE uses Multi-Source Directed Information (DI) to measure information flow from past inputs ( $X_{t-\tau}$ ) to a target ( $Y_t$ ) across time lags ( $\tau$ ).
Decomposition: At each time lag $\tau$ $τ$ , the directed information is decomposed into:
- Redundancy ( $R$ ): Shared information between modalities.
- Uniqueness ( $U$ ): Modality-specific contributions.
- Synergy ( $S$ ): Information emerging only when modalities are combined.
Efficient Estimation (Multi-scale BATCH Estimator):
- Standard PID estimation is computationally expensive for high-dimensional, continuous data.
- MERGE introduces a multi-scale BATCH estimator that trains a single model to predict RUS values across multiple time lags simultaneously.
- It uses lag-conditioned discriminators and an alignment tensor (optimized via the Sinkhorn-Knopp algorithm) to enforce marginal-matching constraints, allowing efficient estimation of the optimal distribution $Q^*$ required for PID decomposition without exhaustive enumeration.

B. RUS-Aware MoE Routing

The estimated Temporal RUS values guide the routing mechanism within the MoE layers.

Interaction-Aware Router:
- The router takes token embeddings and the computed Temporal RUS context (pairwise $R, U, S$ values across time lags) as input.
- It uses a Cross-Attention mechanism to weigh pairwise redundancy/synergy and a GRU to capture temporal uniqueness dynamics.
- Routing Logic:
  - High Redundancy: Routes tokens from different modalities to the same expert (Early Fusion).
  - High Uniqueness: Routes tokens to different experts to ensure diverse utilization (Late Fusion).
  - High Synergy: Routes tokens to dedicated Synergy Experts (Cross-Modal Experts) equipped with cross-attention modules to model explicit interactions (Hybrid Fusion).
Auxiliary Loss Functions:
- To enforce these principles during training, the framework employs specific auxiliary losses:
  - Redundancy Loss: Minimizes Jensen-Shannon Divergence (JSD) between routing distributions of highly redundant modalities.
  - Uniqueness Loss: Maximizes JSD (diversifies routing) for unique modalities.
  - Synergy Loss: Encourages routing to synergy experts when synergy scores are high.

3. Key Contributions

Temporal RUS Framework: A novel method to explicitly quantify time-delayed multimodal interactions (Redundancy, Uniqueness, Synergy) using Directed Information, extending PID to continuous, high-dimensional temporal data.
Scalable Estimator: A multi-scale BATCH estimator that efficiently computes temporal RUS across multiple time lags, overcoming the computational bottlenecks of traditional PID.
Principled MoE Routing: The first MoE architecture that uses information-theoretic interaction dynamics to dynamically route tokens to specialized experts, moving beyond static similarity-based routing.
Interpretability: The routing decisions are directly linked to domain-relevant interaction patterns (e.g., identifying when two sensors are redundant vs. synergistic), providing explainable model behavior.

4. Experimental Results

The framework was evaluated across diverse benchmarks: Healthcare (MIMIC-IV for mortality/LOS), Activity Recognition (PAMAP2, Opportunity), and Affective Computing (MOSI, WESAD).

Performance: MERGE consistently outperformed state-of-the-art baselines, including standard Transformers, task-specific fusion models (MulT, MISTS), and existing MoE approaches (FuseMoE, I2MoE).
- Example: On MIMIC-IV In-Hospital Mortality, MERGE achieved an AUROC of 85.40, surpassing FuseMoE (82.33) and MulT (81.63).
- Example: On PAMAP2, MERGE achieved 91.37% accuracy vs. 87.74% for FuseMoE.
Qualitative Insights:
- Medical: The model correctly identified that insulin and furosemide exhibit strong synergy at administration but shift to uniqueness over time, aligning with physiological knowledge.
- Activity: It detected high redundancy between chest and hand sensors during locomotion (coupled movement).
Ablation Studies:
- Removing any auxiliary loss (R, U, or S) degraded performance, confirming all interaction types are critical.
- Longer temporal RUS sequences improved performance, validating the importance of capturing long-range dependencies.
- The multi-scale BATCH estimator achieved near-identical RUS accuracy to step-wise computation but with significantly higher efficiency.

5. Significance and Impact

Bridging Theory and Practice: MERGE successfully bridges the gap between information-theoretic principles (PID) and deep learning architectures (MoE), providing a mathematically grounded approach to multimodal fusion.
Handling "Massively" Multimodal Data: By decoupling the number of experts from the number of modalities and using sparse routing guided by interaction types, the framework scales effectively to settings with hundreds of input streams.
Domain Applicability: The ability to model delayed physiological cascades makes this framework particularly transformative for healthcare and real-world sensor networks, where cause-and-effect relationships are rarely instantaneous.
Future Directions: The authors suggest extending this to Large Language Models (LLMs) and Vision-Language Models (VLMs), where leveraging known temporal interaction dynamics could significantly improve fine-tuning and reasoning capabilities.

In summary, MERGE represents a paradigm shift from treating multimodal data as static feature sets to modeling them as dynamic, interacting systems, resulting in superior performance and interpretability in complex real-world applications.

Massively Multimodal Foundation Models: A Framework for Capturing Interactions with Specialized Mixture-of-Experts

1. The Problem: The "Static" Commander

2. The Solution: The "Time-Traveling" Commander (MERGE)

3. The Magic Tool: The "Interaction Map"

4. The Result: A Smarter, Faster Team

Real-World Impact

The Bottom Line

1. Problem Statement

2. Methodology: The MERGE Framework

A. Capturing Temporal Multimodal Interactions (Temporal RUS)

B. RUS-Aware MoE Routing

3. Key Contributions

4. Experimental Results

5. Significance and Impact

More like this

A Benchmark of Classical and Deep Learning Models for Agricultural Commodity Price Forecasting on A Novel Bangladeshi Market Price Dataset

Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse

FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

Spectral Edge Dynamics Reveal Functional Modes of Learning

S3S^3S3: Stratified Scaling Search for Test-Time in Diffusion Language Models

$S^3$ : Stratified Scaling Search for Test-Time in Diffusion Language Models