MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

This paper proposes Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that generates semantic IDs by leveraging a multi-expert architecture to balance cross-modal synergy and specificity, followed by behavior-aware fine-tuning to align semantic representations with user preferences, thereby enhancing scalability and generalization in recommender systems.

Yi Xu, Moyu Zhang, Chenxuan Li, Zhihao Liao, Haibo Xing, Hao Deng, Jinxin Hu, Yu Zhang, Xiaoyi Zeng, Jing Zhang

Published 2026-03-03
📖 5 min read🧠 Deep dive

Imagine you are walking through a massive, chaotic library that never stops growing. Every day, millions of new books arrive, and the old ones change their covers.

The Problem: The "Name Tag" System
Traditionally, recommendation systems (like those on Amazon or TikTok) treat every item like a book with a unique, meaningless barcode (an ID).

  • The Issue: If a new book arrives, the system doesn't know what it is; it just sees a new barcode. If a book is very popular, the system learns its barcode well. But if a book is rare (a "long-tail" item), the system barely knows it exists.
  • The Result: The system struggles to recommend new or rare items because it's just memorizing barcodes, not understanding what the items actually are.

The Solution: The "Semantic ID" (The Book Summary)
Instead of barcodes, researchers want to give every item a "Semantic ID"—a short description or a summary of its content (like "Summer Beach Dress" or "Spicy Vegan Curry"). This way, if you like "Summer Beach Dress," the system can recommend a new dress with a similar summary, even if it's never seen it before.

The Challenge: The "Translation" Problem
However, creating these summaries is tricky. Items have different "languages":

  1. Text: The title and description.
  2. Images: The photo of the item.

Sometimes, the text and the image tell the same story (Synergy). Sometimes, they tell different, unique parts of the story (Uniqueness).

  • Example: A picture of a dress might look "casual," but the text says "formal evening wear."
  • The Mistake: Old systems either mashed them together into one blurry mess (losing the unique details) or kept them completely separate (missing the connection between the two).

The New System: MMQ (The "Expert Team" Approach)
The paper introduces MMQ (Multimodal Mixture-of-Quantization). Think of MMQ as a team of specialized translators working together to write the perfect summary for every item.

How MMQ Works (The Two-Stage Process)

Stage 1: The "Expert Team" Training (The Library Catalogers)

Imagine a team of librarians trying to catalog a new book.

  • The Specialists: Some librarians only look at the text, some only look at the pictures, and some look at both.
  • The Magic: MMQ uses a "Mixture of Experts."
    • Modality-Specific Experts: These librarians focus on the unique details (e.g., "This text mentions 'vegan,' which the photo doesn't show").
    • Modality-Shared Experts: These librarians find the common ground (e.g., "Both the text and photo suggest 'summer'").
  • The Rule: To prevent the librarians from all saying the same thing (which would be boring and useless), the system forces them to be different from each other. It's like telling them, "You must find a unique angle!" This ensures the final summary is rich and complete.

Stage 2: The "Behavior-Aware" Tuning (The Sales Floor)

Here is the clever part. Just because a book has a great summary doesn't mean it will sell.

  • The Gap: A book might be described as "Dark and Gritty" (Semantic), but users might actually buy it because they like "Fast Paced Action" (Behavior). The description doesn't always match what people do.
  • The Fix: MMQ doesn't stop at writing the summary. It goes to the "Sales Floor" (the recommendation system) and watches what people actually click on.
  • The Adjustment: It gently tweaks the summaries based on real user behavior. If users who click on "Summer Dresses" also tend to click on "Sandals," the system adjusts the "Summer Dress" summary to subtly include that connection. It bridges the gap between "what the item is" and "what people want."

Why This Matters (The Results)

  1. Better for New Items: Because the system understands the meaning (the summary) rather than just the barcode, it can instantly recommend new items that fit your taste, even if the system has never seen them before.
  2. Better for Rare Items: It solves the "long-tail" problem. Rare items get a good summary based on their content, so they don't get ignored just because they don't have many clicks yet.
  3. Real-World Success: The team tested this on a massive e-commerce platform. The result?
    • More money made from ads.
    • More people buying things (Conversion Rate went up by over 4%).
    • More orders placed.

The Big Picture Analogy

Think of the old system as a phone book. You have to know the exact name (ID) to find someone. If you don't know the name, you can't find them.

MMQ is like a smart personal assistant.

  • It doesn't just look at the name; it reads your resume (text) and looks at your photo (image).
  • It has a team of experts to understand your unique skills and your common traits.
  • It watches who you actually talk to (behavior) and adjusts its understanding of you.
  • Now, when you ask for a job, it doesn't just match your name; it matches your vibe and skills to the perfect opportunity, even if you've never applied there before.

In short, MMQ turns a chaotic library of barcodes into a smart, understanding system that knows exactly what you want, even before you do.

Get papers like this in your inbox

Personalized daily or weekly digests matching your interests. Gists or technical summaries, in your language.

Try Digest →