LLM-Enhanced Multimodal Fusion for Cross-Domain Sequential Recommendation

This paper proposes LLM-EMF, a novel cross-domain sequential recommendation framework that leverages frozen CLIP embeddings and a multiple attention mechanism to fuse visual and LLM-enhanced textual data, thereby significantly outperforming existing methods in modeling complex user preferences across diverse e-commerce domains.

Wangyu Wu, Zhenhong Chen, Wenqiao Zhang, Xianglin Qiu, Siqi Song, Xiaowei Huang, Fei Ma, Jimin Xiao

Published 2026-03-02
📖 4 min read☕ Coffee break read

Imagine you are a personal shopper trying to guess what a customer wants to buy next.

The Old Way: The "Single-Store" Shopper

Traditionally, recommendation systems (like the ones on Amazon or Netflix) act like shoppers who only know one store.

  • If you buy a coffee maker, they suggest more coffee beans.
  • If you watch a sci-fi movie, they suggest another sci-fi movie.

The Problem: This approach has two big flaws:

  1. Data Sparsity: If you only buy one thing, the shopper doesn't know enough about you to make a good guess.
  2. The "Domain Wall": They don't realize that the person who buys high-end kitchen gadgets might also love cooking shows on TV. They treat your "kitchen life" and your "movie life" as two completely separate people.

The New Solution: The "Super-Shopper" (LLM-EMF)

This paper introduces a new system called LLM-EMF. Think of it as upgrading your personal shopper into a Super-Shopper with three superpowers:

1. The "Imagination" Power (LLM Enhancement)

Sometimes, a product label is boring. It just says "Red Dress."

  • Old Way: The system sees "Red Dress" and stops there.
  • LLM-EMF: It asks a Super-Brain (Large Language Model): "Hey, tell me everything interesting about this red dress. Is it for a summer party? Does it look vintage? Who usually wears this?"
  • The Result: The Super-Brain generates a rich story and keywords. Now, instead of just "Red Dress," the system understands it as "Elegant, vintage-style, summer party attire." This helps the system connect the dress to other items like "floral hats" or "sandals," even if they are in different categories.

2. The "Eagle Eye" Power (Multimodal Fusion)

The Super-Shopper doesn't just read text; they see and feel the items too.

  • Visuals: They look at the actual photo of the item (using a tool called CLIP, which is like a camera that understands art and objects).
  • Text: They read the description and the LLM's generated story.
  • The ID: They also remember the item's unique barcode.
  • The Magic: The system combines the picture, the story, and the barcode into one super-detailed profile for every item. It's like having a 3D hologram of the product instead of just a flat picture.

3. The "Balanced Scale" Power (Hierarchical Attention)

Here is the tricky part. Imagine a user who buys 100 books but only 1 kitchen gadget.

  • The Old Problem: A normal system would get obsessed with the books and ignore the kitchen gadget, thinking, "They only care about reading!"
  • The LLM-EMF Fix: This system has a Smart Scale. It looks at your history and says, "Okay, they love books, but that one kitchen gadget is really important to them. Let's not let the books drown out the kitchen item."
  • It carefully balances the influence of different areas of your life so that a small but significant interest isn't lost in the noise of a big interest.

How It Works in Real Life

Let's say you are a user named Alex.

  1. History: You bought a "Cast Iron Skillet" (Kitchen) and watched "Chef's Table" (Movie).
  2. The Process:
    • The LLM reads "Cast Iron Skillet" and adds context: "Durable, heavy, great for searing steaks, vintage aesthetic."
    • The Visual Engine sees the skillet's photo and understands it's "rustic and black."
    • The Balanced Scale notices you have a long history of buying kitchenware but a short history of watching cooking shows. It gives extra weight to your cooking show interest so it doesn't get ignored.
  3. The Prediction: The system realizes that because you like "rustic, durable cooking tools" AND "cooking shows," you are likely to buy a high-end chef's knife next.
  4. The Result: It recommends the knife.

Why This Paper Matters

Previous systems were like specialists who only knew one thing well. This new system is a generalist with a super-photographic memory and a creative writer.

  • It connects the dots between different parts of your life (Cross-Domain).
  • It understands the vibe of an item, not just its name (Multimodal).
  • It listens to your small interests as much as your big ones (Balanced Attention).

The authors tested this on real shopping data (like Food vs. Kitchen, and Movies vs. Books) and found that this "Super-Shopper" was much better at guessing what you want to buy next than any previous method. It's a smarter, more human-like way to recommend things.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →