CaReFlow: Cyclic Adaptive Rectified Flow for Multimodal Fusion

The Big Problem: The "Language Barrier" Between Senses

Imagine you are trying to understand a movie scene. You have three sources of information:

The Visuals: What the actors look like (facial expressions).
The Audio: How they sound (tone of voice).
The Script: What they are actually saying (words).

In a perfect world, these three things would all "speak the same language." But in reality, they are like three different tribes living in different countries.

The Visuals speak "Face-ese."
The Audio speaks "Tone-ese."
The Script speaks "Word-ese."

When a computer tries to combine them to understand if a person is happy or sad, it's like trying to mix oil and water. They don't blend well. This is called the "Modality Gap." Because they don't mix, the computer gets confused and makes bad guesses.

The Old Solutions: Trying to Force a Handshake

Previous methods tried to fix this by forcing the Visuals and Audio to shake hands with the Script one-on-one.

The Problem: It's like trying to teach a person from Country A to speak Country B's language by only pairing them with one specific person from Country B. They might learn that one person's accent, but they won't understand the whole country's culture.
The Result: The computer still struggles to understand the big picture, especially if it doesn't have enough perfect examples to study.

The New Solution: CaReFlow (The "Universal Translator" Bus)

The authors created CaReFlow (Cyclic Adaptive Rectified Flow). Think of this as a high-tech, magical bus system that transports information from the "Visual/Audio" countries to the "Script" country.

Here is how CaReFlow works, broken down into three superpowers:

1. The "One-to-Many" Bus Ride (Seeing the Whole City)

Instead of pairing one Visual with one Script, CaReFlow lets a single Visual data point look at the entire city of Scripts.

The Analogy: Imagine you are a tourist in a new city. Old methods told you to only look at one specific landmark. CaReFlow puts you on a bus that drives you past every landmark, park, and street.
Why it helps: Now, your Visual data understands the whole vibe of the Script language, not just a tiny snippet. This makes the translation much more robust.

2. The "Adaptive Relaxation" Rule (Strict vs. Chill)

The bus has a smart driver who knows when to be strict and when to be chill.

Strict Mode: If the Visual and Script come from the same person in the same scene, the driver forces them to align perfectly. "You two are a team; you must match!"
Chill Mode: If the Visual and Script come from different people or different scenes, the driver relaxes the rules. "You don't have to be identical, just be in the same neighborhood."
Why it helps: This prevents the computer from getting confused. It knows exactly which data points need to match perfectly and which ones just need to be generally similar. It solves the "Who do I match with?" confusion.

3. The "Cyclic" Round Trip (Don't Lose Your Luggage)

Sometimes, when you translate something, you lose the original flavor. If you translate a poem from English to French, you might lose the rhyme.

The Analogy: CaReFlow has a "Return Ticket." After it translates the Visuals into the Script language, it immediately tries to translate them back to the original Visuals.
Why it helps: If the computer can't translate it back, it knows it lost some important details. This "Round Trip" ensures that no information is lost during the journey. The final result keeps the best of both worlds.

The Result: A Happy Marriage of Senses

Once CaReFlow does its job, the Visuals, Audio, and Scripts are no longer strangers in different countries. They are now neighbors who speak the same language.

The Test: The researchers tested this on datasets where computers had to guess emotions (like "Is this person sarcastic?" or "Are they happy?").
The Outcome: Even though CaReFlow used a very simple method to combine the data (just a basic "glue" called a simple fusion network), it beat almost every other complex method out there.
The Visual Proof: When they drew a map of how the data looks, the "Visual" dots and "Script" dots were far apart in old methods. With CaReFlow, they were huddled together in a tight, happy group.

Summary

CaReFlow is like a smart, efficient translator bus. It doesn't just match one person to another; it lets everyone see the whole crowd, knows when to be strict and when to be flexible, and checks its luggage on the way back to make sure nothing was lost. The result? A computer that finally understands human emotions by truly "hearing" and "seeing" them together.

1. Problem Statement

The core challenge addressed in Multimodal Affective Computing (MAC) is the "modality gap."

Definition: Data from different modalities (e.g., visual, acoustic, and language) reside in distinct, non-aligned regions of the feature space due to their heterogeneous nature and different feature extractors.
Consequence: This distributional discrepancy prevents vanilla multimodal models from effectively modeling inter-dependencies, leading to sub-optimal fusion and poor generalization. In some cases, a multimodal model performs worse than a unimodal (language-only) model.
Limitations of Existing Methods:
- Contrastive Learning & Transformers: Often focus on one-to-one alignment within a single sample. They fail to expose source data points to the "global distributional context" of the target modality.
- Generative Models (GANs/Diffusion): While capable of distribution mapping, they often suffer from high computational costs (recursive training) or slow inference. They also typically lack mechanisms to handle the ambiguity of mapping one source point to a complex target distribution without losing specific modality information.

2. Methodology: CaReFlow

The authors propose CaReFlow (Cyclic Adaptive Rectified Flow), a framework that reformulates modality gap reduction as a distribution mapping task using Rectified Flow. The goal is to map the distributions of source modalities (visual, acoustic) to the dominant target modality (language) via a straight, fast trajectory.

The framework consists of three key innovations:

A. One-to-Many Mapping Strategy

Unlike traditional methods that map a source sample to a single target sample, CaReFlow leverages the Rectified Flow mechanism to allow each source data point to observe the global distribution of the target modality.

Mechanism: During the transformation (rectification) process, a source data point is influenced by the broader target distribution rather than a single paired point.
Benefit: This mitigates the issue of insufficient paired data within samples and enables a more robust learning of the distribution transformation trajectory.

B. Adaptive Relaxed Alignment

To address the ambiguity inherent in one-to-many mapping (where a source point could theoretically map to many target points), CaReFlow introduces a dynamic alignment constraint.

Strict Alignment: For modality pairs belonging to the same sample, the alignment is enforced strictly (margin $\eta = 0$ ).
Relaxed Alignment: For pairs from different samples, the alignment constraint is relaxed based on the semantic similarity of their labels.
- If labels are the same/similar: A small margin ( $\epsilon$ ) is applied.
- If labels are different: A larger margin is applied.
Mathematical Formulation: The loss function incorporates a margin $\eta_{m1,m2}$ that adjusts the strictness of the rectification based on whether the pair is from the same sample or the label distance between samples. This allows the model to learn accurate mappings without requiring recursive training iterations.

C. Cyclic Rectified Flow (Information Preservation)

To prevent the loss of discriminative information from the source modality during the transformation to the target distribution, CaReFlow employs a cycle-consistency constraint.

Forward Flow: Maps source features ( $X_m$ ) to the target distribution ( $X_{m,l}$ ).
Backward Flow: Maps the transformed features ( $X_{m,l}$ ) back to the original source features ( $X_m$ ).
Objective: This ensures that the transformed features retain sufficient modality-specific information necessary for the final fusion and prediction tasks.

Optimization: The total loss combines the main task prediction loss, the forward rectified flow loss, and the backward cyclic loss. The model uses a simple Multi-Layer Perceptron (MLP) as the drift force (velocity vector-field) network, parameterized with time embeddings.

3. Key Contributions

Novel Formulation: First adaptation of Rectified Flow to the multimodal fusion domain, treating the modality gap as a distribution mapping problem.
CaReFlow Framework: A novel architecture featuring:
- One-to-Many Mapping: For robust global distribution alignment.
- Adaptive Relaxed Alignment: To resolve mapping ambiguity and prioritize same-sample/same-category pairs.
- Cyclic Information Flow: To ensure information preservation during distribution transformation.
Performance with Simplicity: Demonstrates that even with a simple fusion method (concatenation + MLP), reducing the modality gap via CaReFlow yields state-of-the-art results.

4. Experimental Results

CaReFlow was evaluated on five benchmark datasets for Multimodal Affective Computing:

Sentiment Analysis (MSA): CMU-MOSI, CMU-MOSEI, CH-SIMS-v2.
Humor/Sarcasm Detection (MHD/MSD): UR-FUNNY, MUStARD.

Key Findings:

State-of-the-Art (SOTA) Performance: CaReFlow outperformed existing baselines (including DLF, MulT, CLGSI, and Diffusion Bridge) across all datasets.
- Example: On CMU-MOSI, it improved Acc7 by over 1 point and Acc2 by over 1 point compared to the previous SOTA (DLF).
- Example: On CH-SIMS-v2, it achieved a massive improvement of over 4 points in Acc5 compared to baselines.
Ablation Studies:
- Removing Distribution Alignment caused a significant performance drop (e.g., -4.5 points in Acc2 on CH-SIMS-v2), proving the necessity of gap reduction.
- Removing Cyclic Flow or Adaptive Alignment also led to notable declines, confirming their roles in information preservation and mapping accuracy.
- One-to-Many Mapping was identified as the most critical component for performance gains.
Comparison with Other Methods: CaReFlow outperformed contrastive learning (CLGSI) and diffusion-based methods, achieving better alignment while using fewer parameters than complex transformer-based baselines.
Visualization: t-SNE visualizations confirmed that CaReFlow significantly reduces the distance between different modalities in the feature space compared to other distribution mapping methods.

5. Significance

Efficiency: CaReFlow achieves superior alignment without the high computational cost of recursive training or complex generative models. It uses a single-step (or two-step Euler) inference process.
Robustness: The method is robust to hyperparameter variations and works effectively even with simple downstream fusion networks, suggesting that pre-fusion distribution alignment is a critical factor often overlooked in multimodal research.
Generalizability: The approach is not limited to sentiment analysis; it successfully generalizes to humor and sarcasm detection, indicating its potential for broader multimodal tasks.
Theoretical Insight: The paper highlights that exposing source data to the global target distribution (one-to-many) and enforcing semantic-aware constraints (adaptive relaxed alignment) are more effective than strict one-to-one pairing for bridging the modality gap.