CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning

Imagine you are trying to understand a complex story told by three different friends: Alice (who speaks), Bob (who shows pictures), and Charlie (who makes sounds).

In the world of Artificial Intelligence, this is called Multimodal Learning. The goal is to combine their stories to get the full picture.

The Problem: The "Chaotic Coffee Shop"

Most current AI methods try to mix all three friends' stories into one big pot. They take a word from Alice, a frame from Bob, and a sound from Charlie, and throw them together immediately.

The paper argues this is like putting everyone in a noisy, chaotic coffee shop.

The Mismatch: Alice is talking about the beginning of the story (shallow details), while Bob is showing the ending (deep context). When you mix them, the AI gets confused. It's like trying to understand a punchline before hearing the setup.
The Leak: Sometimes, Alice's private joke (which only makes sense to her) accidentally leaks into the group conversation, confusing everyone else.
The Result: The AI gets "semantic asynchrony"—it hears the words but misses the meaning, leading to errors.

The Solution: CLCR (The "Structured Conference")

The authors propose a new system called CLCR (Cross-Level Semantic Collaborative Representation). Instead of a chaotic coffee shop, they organize a structured, three-tiered conference.

Here is how it works, using simple analogies:

1. The Three-Level Hierarchy (The Three Floors)

Instead of mixing everything at once, CLCR organizes the information into three distinct "floors" or levels for each friend:

Floor 1 (Shallow): The basics. For Alice, this is individual words. For Bob, it's simple shapes or colors. For Charlie, it's the raw sound wave.
Floor 2 (Mid): The structure. For Alice, this is sentences and grammar. For Bob, it's recognizing a face or a car. For Charlie, it's the rhythm of speech.
Floor 3 (Deep): The big picture. For Alice, this is the speaker's intent or emotion. For Bob, it's the whole scene context. For Charlie, it's the overall mood.

The Rule: You only talk to people on the same floor. You don't mix Alice's raw sounds (Floor 1) with Bob's deep scene analysis (Floor 3). This prevents confusion.

2. IntraCED: The "Shared vs. Private" Breakout Rooms

Once everyone is on the correct floor, they need to collaborate. But how do they share without leaking secrets?

The Split: At every floor, the AI splits the information into two rooms:
- The Shared Room: Where everyone agrees (e.g., "The person is angry").
- The Private Room: Where only one person speaks (e.g., Alice's specific accent or Bob's specific camera angle).
The Token Budget (The Ticket System): Not every piece of information is important enough to share. CLCR uses a "ticket budget." It asks, "Is this specific word or sound strong enough to enter the Shared Room?" If not, it stays in the Private Room. This stops the conversation from getting cluttered with noise.

3. InterCAD: The "Executive Summary"

After the floor-by-floor discussions, the AI needs to make a final decision.

The Anchor: The system looks at the summaries from all three floors and asks, "Which floor is most important for this specific task?"
- If the task is recognizing a specific action (like "jumping"), the Shallow Floor (movement) gets the most weight.
- If the task is understanding sarcasm, the Deep Floor (intent) gets the most weight.
The Gatekeeper: It combines the "Shared" summaries from the best floors and adds the "Private" details only if they are reliable. This creates a final, compact, and accurate understanding of the situation.

Why is this better?

The paper tested this on tasks like recognizing emotions in videos, finding events in audio, and analyzing sentiment.

Old Way: Like trying to solve a puzzle by dumping all the pieces on the floor and hoping they fit.
CLCR Way: Like sorting the puzzle pieces by edge, color, and pattern first, then assembling them section by section.

The Result: CLCR is much more accurate. It doesn't get confused by mismatched information, it doesn't let private details ruin the shared understanding, and it adapts to whether the task needs "quick details" or "deep meaning."

In a Nutshell

CLCR is a smart AI architect that stops trying to force a square peg into a round hole. It organizes information by depth (shallow to deep), keeps private details separate from shared facts, and only lets the most important information cross over to help the team solve the problem. This leads to AI that understands context, emotion, and events much more like a human does.

1. Problem Statement

Multimodal learning (MML) aims to integrate information from diverse sources (e.g., text, vision, audio) to create robust representations. However, existing methods often suffer from cross-level semantic asynchrony.

The Core Issue: Current approaches typically project all modalities into a single latent space for fusion, assuming interactions occur at a single semantic level. In reality, multimodal data is hierarchical:
- Shallow levels: Capture lexical, frame-level, or micro-prosodic cues.
- Mid levels: Encode phrases, part-level structures, or syllable patterns.
- Deep levels: Reflect discourse intent, event context, or long-range dependencies.
Consequences: Mixing tokens from different semantic levels without control leads to:
- Semantic confusion and mismatch.
- Error propagation.
- Leakage of modality-specific (private) factors into shared channels, suppressing task-relevant cues.
- Increased noise ( $I(Z; N)$ ) relative to task information ( $I(Z; Y)$ ) under information bottleneck constraints.

2. Methodology: CLCR Framework

The authors propose Cross-Level Co-Representation (CLCR), a framework that explicitly organizes each modality into a three-level semantic hierarchy and enforces strict rules for cross-modal interaction. The architecture consists of four main components:

A. Semantic-Hierarchy Encoder

Structure: Processes each modality (Linguistic, Visual, Acoustic) into three aligned levels ( $\ell \in \{1, 2, 3\}$ ) representing shallow, mid, and deep semantics.
Implementation:
- Language: Uses pre-trained BERT layers (early, middle, late) to extract lexical, phrasal, and discourse-level features.
- Vision/Audio: Uses three-stage Temporal Convolutional Networks (TCN) with increasing receptive fields to capture local primitives, part-level structures, and global context.
Goal: Ensures features are aligned in semantic depth and channel width before fusion.

B. Intra-Level Co-Exchange Domain (IntraCED)

This module operates independently at each semantic level to manage cross-modal interaction.

Disentanglement: Projects features into orthogonal Shared (modality-invariant) and Private (modality-specific) subspaces using learned projectors ( $P^{sh}$ and $P^{pr}$ ).
Budgeted Token Exchange:
- Only the Shared subspace is allowed to exchange information across modalities.
- A learnable token budget ( $B_\ell$ ) limits the number of tokens participating in cross-modal attention at each level. This prevents noisy fusion by selecting only the most reliable shared evidence.
- Private streams remain isolated and are routed directly to the task head, preventing leakage.
Regularization: Uses a whitened cross-correlation loss to enforce statistical separation between shared and private streams and between different modalities' private streams.

C. Inter-Level Co-Aggregation Domain (InterCAD)

This module aggregates information across the three levels to form the final task representation.

Semantic Synchronization: Compresses shared and private streams into fixed-size summaries. It learns anchors for each level and uses a perceptron to generate dynamic level weights ( $\omega$ ), synchronizing semantic scales.
Modality Selection:
- Shared Path: Uses the global shared context as a query to perform attention-based selection of the most informative modalities for the shared representation ( $\bar{z}_{sh}$ ).
- Private Path: Uses a confidence gate to weight private summaries ( $u_{pr}$ ) based on their reliability, avoiding cross-level mixing on the private path.
Regularization: Introduces an inter-level loss to penalize:
- Redundancy in private streams across levels.
- Leakage between shared and private components.
- Assigning weight to incompatible level pairs (semantic incompatibility).

D. Objective Optimization

The model is trained end-to-end with a composite loss function:
$L_{all} = L_{task} + \lambda_{intra}L_{Intra} + \lambda_{inter}L_{Inter}$
Where $L_{task}$ is cross-entropy (classification) or MSE (regression), and the regularization terms enforce disentanglement and level consistency.

3. Key Contributions

CLCR Framework: A novel architecture that explicitly structures multimodal data into a three-level semantic hierarchy, addressing the overlooked issue of cross-level asynchrony.
IntraCED & InterCAD Modules:
- IntraCED: Enables budgeted, shared-only token exchange at each level, preventing private factor leakage.
- InterCAD: Performs anchor-guided cross-level aggregation with selective modality routing, preserving modality-specific cues while fusing shared semantics.
Regularization Strategy: Design of intra- and inter-level losses that stabilize shared-private separation and penalize semantic mismatches across depths.
State-of-the-Art Performance: Demonstrated robustness and accuracy across diverse tasks and datasets.

4. Experimental Results

CLCR was evaluated on six benchmarks spanning Emotion Recognition, Event Localization, Sentiment Analysis, and Action Recognition.

Acoustic-Visual Tasks (CREMA-D, KS, AVE, UCF101):
- CLCR achieved the highest accuracy and F1 scores.
- Improvements: Outperformed the strongest baselines (e.g., ARL, MLA) by 1.2% – 1.46% in accuracy and 1.1% – 1.48% in F1.
Multimodal Sentiment Analysis (CMU-MOSI, CMU-MOSEI):
- MOSI: Reduced MAE by 0.032 and improved Acc-2 by 2.65% over the best baseline.
- MOSEI: Reduced MAE by 0.025 and improved Acc-2 by 2.54%.
Ablation Studies:
- Removing IntraCED or InterCAD caused consistent performance drops, confirming the necessity of both level-wise separation and cross-level aggregation.
- Full Mix (shuffling levels) performed worst, proving that semantic alignment is critical.
- Regularization: Removing both regularizers led to the worst performance, highlighting their role in reducing redundancy and leakage.
Robustness: CLCR demonstrated superior stability under Gaussian noise compared to early fusion baselines, maintaining decision boundaries even with noisy inputs.

5. Significance

This paper fundamentally shifts the paradigm of multimodal fusion from a "single-space" approach to a "structured, hierarchical" approach.

Theoretical Insight: It identifies cross-level semantic asynchrony as a primary cause of representation fragility, distinct from generic modality heterogeneity.
Practical Impact: By enforcing strict separation of private features and budgeted exchange of shared features, CLCR produces more interpretable and robust models.
Generalization: The method generalizes well across tasks with different temporal dynamics (e.g., short clips vs. long monologues) and modalities, suggesting it is a robust solution for future multimodal systems.