LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

Imagine you are a personal shopper trying to guess what a customer wants to buy next.

The Old Way: The "Single-Store" Shopper

Traditionally, recommendation systems (like the ones on Amazon or Netflix) act like shoppers who only know one store.

If you buy a coffee maker, they suggest more coffee beans.
If you watch a sci-fi movie, they suggest another sci-fi movie.

The Problem: This approach has two big flaws:

Data Sparsity: If you only buy one thing, the shopper doesn't know enough about you to make a good guess.
The "Domain Wall": They don't realize that the person who buys high-end kitchen gadgets might also love cooking shows on TV. They treat your "kitchen life" and your "movie life" as two completely separate people.

The New Solution: The "Super-Shopper" (LLM-EMF)

This paper introduces a new system called LLM-EMF. Think of it as upgrading your personal shopper into a Super-Shopper with three superpowers:

1. The "Imagination" Power (LLM Enhancement)

Sometimes, a product label is boring. It just says "Red Dress."

Old Way: The system sees "Red Dress" and stops there.
LLM-EMF: It asks a Super-Brain (Large Language Model): "Hey, tell me everything interesting about this red dress. Is it for a summer party? Does it look vintage? Who usually wears this?"
The Result: The Super-Brain generates a rich story and keywords. Now, instead of just "Red Dress," the system understands it as "Elegant, vintage-style, summer party attire." This helps the system connect the dress to other items like "floral hats" or "sandals," even if they are in different categories.

2. The "Eagle Eye" Power (Multimodal Fusion)

The Super-Shopper doesn't just read text; they see and feel the items too.

Visuals: They look at the actual photo of the item (using a tool called CLIP, which is like a camera that understands art and objects).
Text: They read the description and the LLM's generated story.
The ID: They also remember the item's unique barcode.
The Magic: The system combines the picture, the story, and the barcode into one super-detailed profile for every item. It's like having a 3D hologram of the product instead of just a flat picture.

3. The "Balanced Scale" Power (Hierarchical Attention)

Here is the tricky part. Imagine a user who buys 100 books but only 1 kitchen gadget.

The Old Problem: A normal system would get obsessed with the books and ignore the kitchen gadget, thinking, "They only care about reading!"
The LLM-EMF Fix: This system has a Smart Scale. It looks at your history and says, "Okay, they love books, but that one kitchen gadget is really important to them. Let's not let the books drown out the kitchen item."
It carefully balances the influence of different areas of your life so that a small but significant interest isn't lost in the noise of a big interest.

How It Works in Real Life

Let's say you are a user named Alex.

History: You bought a "Cast Iron Skillet" (Kitchen) and watched "Chef's Table" (Movie).
The Process:
- The LLM reads "Cast Iron Skillet" and adds context: "Durable, heavy, great for searing steaks, vintage aesthetic."
- The Visual Engine sees the skillet's photo and understands it's "rustic and black."
- The Balanced Scale notices you have a long history of buying kitchenware but a short history of watching cooking shows. It gives extra weight to your cooking show interest so it doesn't get ignored.
The Prediction: The system realizes that because you like "rustic, durable cooking tools" AND "cooking shows," you are likely to buy a high-end chef's knife next.
The Result: It recommends the knife.

Why This Paper Matters

Previous systems were like specialists who only knew one thing well. This new system is a generalist with a super-photographic memory and a creative writer.

It connects the dots between different parts of your life (Cross-Domain).
It understands the vibe of an item, not just its name (Multimodal).
It listens to your small interests as much as your big ones (Balanced Attention).

The authors tested this on real shopping data (like Food vs. Kitchen, and Movies vs. Books) and found that this "Super-Shopper" was much better at guessing what you want to buy next than any previous method. It's a smarter, more human-like way to recommend things.

1. Problem Statement

Cross-Domain Sequential Recommendation (CDSR) aims to predict a user's next interaction by leveraging historical sequences across multiple domains (e.g., Food and Kitchen, or Movies and Books). While traditional Sequential Recommendation (SR) models handle single domains well, they suffer from:

Data Sparsity: Limited interaction data within specific domains.
Domain Bias: Overfitting to domain-specific patterns, hindering generalization.
Underutilization of Multimodal Data: Existing CDSR methods often rely solely on Item IDs, ignoring rich visual and textual metadata.
Domain Imbalance: In cross-domain settings, high-frequency domains often dominate the learning process, suppressing signals from smaller domains.
LLM Limitations: Recent LLM-based approaches often augment text but fail to explicitly address domain imbalance or integrate multimodal (visual) features effectively.

2. Methodology: LLM-EMF Framework

The proposed LLM-EMF framework integrates Large Language Models (LLMs), multimodal fusion (Visual, Textual, ID), and a hierarchical attention mechanism. The architecture consists of four main stages:

A. Prompt-Based LLM Augmentation

To bridge semantic gaps between domains, the authors use an LLM (specifically Deepseek-r1) to generate domain-agnostic contextual knowledge.

Process: A predefined prompt template takes an item's title and domain as input.
Output: The LLM generates enriched text containing key attributes, detailed insights, and potential user interests.
Goal: Create domain-agnostic semantic attributes that improve alignment between different domains (e.g., linking "cooking" in the Food domain to "kitchenware" in the Kitchen domain).

B. Multimodal Feature Integration

The framework constructs three types of embeddings for every item:

ID Embeddings ( $E_{id}$ ): Learnable semantic vectors initialized for item identities.
Visual Embeddings ( $E_{img}$ ): Generated using a frozen CLIP image encoder.
Textual Embeddings ( $E_{tex}$ ): Generated using the frozen CLIP text encoder on the original titles and the LLM-augmented text.

C. Hierarchical Attention Mechanism

To address domain imbalance and capture complex dependencies, the model processes three distinct sub-sequences for each user:

$S_X$ : Interactions within Domain X.
$S_Y$ : Interactions within Domain Y.
$S_{X+Y}$ : The merged sequence of interactions from both domains.

A hierarchical attention mechanism processes these sequences separately before fusion. This prevents high-frequency domains from dominating the representation. The model computes attention weights (Query, Key, Value) to capture both intra-sequence (local) and inter-sequence (global) dependencies.

D. Decision Generation & Loss Optimization

Prediction: The model generates sequence representations ( $h_{id}, h_{img}, h_{tex}$ ) for each sub-sequence. It calculates similarity scores (Cosine Similarity) between the sequence representation and item embeddings to predict the next item.
Fusion: Final predictions are a weighted sum of ID, Visual, and Textual predictions.
Loss Function: The total loss ( $L$ ) is a weighted combination of losses from Domain X, Domain Y, and the merged domain ( $X+Y$ ), controlled by hyperparameters $\lambda_1$ and $\lambda_2$ to balance domain contributions.

3. Key Contributions

Prompt-Driven LLM Augmentation: A novel strategy to generate domain-agnostic textual attributes, enhancing semantic alignment across domains without fine-tuning the LLM.
Unified Multimodal Fusion: The first CDSR framework to systematically unify ID, visual (CLIP), and LLM-enriched textual embeddings within a single architecture.
Domain-Balanced Hierarchical Attention: A mechanism that explicitly regulates the influence of each domain, preventing data-rich domains from overshadowing sparse ones.
State-of-the-Art Performance: Demonstrated superior performance across multiple metrics on real-world e-commerce datasets.

4. Experimental Results

The model was evaluated on two cross-domain scenarios using the Amazon dataset:

Scenarios: "Food-Kitchen" and "Movie-Book".
Baselines: Compared against traditional methods (NCF, GRU4Rec, SASRec), advanced CDSR models (MIFN, Tri-CDR, MAN), and recent LLM-based methods (LLMRec, IFCDSR).

Key Findings:

Performance: LLM-EMF outperformed all baselines.
- Food-Kitchen: Achieved 9.24% MRR (Food) and 5.13% MRR (Kitchen), surpassing the previous best (LLMRec) by a significant margin.
- Movie-Book: Achieved 6.32% MRR (Movie) and 2.86% MRR (Book).
Ablation Study:
- Adding Textual Fusion improved MRR by ~0.85 points.
- Adding LLM Enhancement further improved performance, proving the value of generated context.
- Adding Visual Fusion provided a substantial boost (~0.8 points), highlighting the importance of visual signals.
- The full model (LLM + Text + Visual) achieved the highest scores, confirming the complementary nature of these modalities.

5. Significance

This paper addresses critical gaps in current recommendation systems by:

Bridging the Semantic Gap: Using LLMs to create shared semantic spaces between disparate domains (e.g., connecting a movie plot to a book genre via generated text).
Leveraging Multimodality: Proving that visual and textual data, when fused with ID data, significantly outperform ID-only models in sparse cross-domain scenarios.
Solving Imbalance: Providing a robust architectural solution (hierarchical attention) to the common problem of domain dominance in multi-domain learning.
Scalability: The approach uses frozen encoders (CLIP) and prompt-based LLM generation, making it computationally efficient and scalable without requiring massive LLM fine-tuning.

In conclusion, LLM-EMF establishes a new benchmark for Cross-Domain Sequential Recommendation by effectively combining the generative power of LLMs with the perceptual capabilities of multimodal models, all while maintaining a balanced focus on diverse user interests.