Cross-Representation Knowledge Transfer for Improved Sequential Recommendations

Imagine you are trying to guess what your friend wants to eat for dinner tonight. To do this, you have two different ways of thinking about the problem:

The "Chronological Storyteller" (Sequential Model): You look at what they ate yesterday, the day before, and the day before that. You notice a pattern: "Oh, they had pizza on Tuesday, pasta on Wednesday, so maybe they want a salad today?" This method is great at spotting immediate habits and trends, but it treats every meal as a separate event in a line. It doesn't really know that "Pizza" and "Burgers" are both "fast food" or that "Salad" and "Soup" are both "light lunches." It misses the big picture connections between the food items themselves.
The "Social Networker" (Graph Model): You look at a giant map of everyone's eating habits. You see that people who like Pizza also tend to like Burgers, and people who like Sushi often like Sashimi. This method is amazing at understanding how items are related to each other (the "global" context). However, it's terrible at timing. It doesn't know if your friend ate Pizza before or after the Burger; it just sees they are connected. It might suggest a dessert when your friend just finished a heavy meal because it sees the link, ignoring the sequence.

The Problem

For a long time, recommendation systems (like those on Netflix, Amazon, or Spotify) had to choose one of these two friends.

If they chose the Storyteller, they were good at predicting the next step but missed the deeper connections between items.
If they chose the Social Networker, they understood the relationships well but got confused about the order of events.

Existing attempts to combine them were like trying to glue two different languages together without a translator. The results were often clunky, and the system would get confused about which "voice" to listen to.

The Solution: CREATE (The "Super-Translator")

The authors of this paper created a new framework called CREATE (Cross-Representation Knowledge Transfer). Think of it as hiring a Super-Translator who sits between the Storyteller and the Social Networker.

Here is how it works, using a simple analogy:

1. The Two Experts (The Encoders)

The system runs two experts simultaneously:

Expert A (Sequential): Watches the user's history like a movie, focusing on the plot (what happened first, second, third).
Expert B (Graph): Looks at a giant spiderweb of connections, focusing on how all the characters (items) relate to one another.

2. The "Warm-Up" (Training the Expert)

Before the two experts start talking to each other, the system gives Expert B (the Graph Network) a warm-up session.

Analogy: Imagine you are teaching a new employee (the Graph model) about the company culture. You let them study the employee handbook and meet everyone before they start working with the veteran employee (the Sequential model). This ensures the new employee doesn't give bad advice that confuses the veteran. This step is crucial because the Graph model needs to learn the "map" of connections before it tries to influence the "story."

3. The "Handshake" (Representation Alignment)

This is the magic part. Usually, when you combine two experts, they speak different "languages." One might describe a user as "someone who likes action movies," while the other says "a person who watches on Friday nights." They are talking about the same person but using different codes.

The authors use a technique called Barlow Twins (named after a famous twin study, but here it's about twins agreeing on a secret code).

Analogy: Imagine the two experts are twins who need to agree on a secret handshake. The system forces them to align their "handshakes" (their internal math) so that when they describe the same user, they are essentially saying the exact same thing, just from different angles.
Crucially, this handshake isn't just about agreeing; it's about redundancy reduction. It forces them to stop repeating the same obvious facts and instead share unique insights. Expert A tells Expert B about the timing, and Expert B tells Expert A about the relationships. They fill in each other's gaps.

Why is this better?

No "Folding In" Needed: In old systems, if a new user joined, the system had to stop and recalculate their entire profile from scratch (like re-writing a whole book). CREATE is smart enough to just look at the new items the user interacted with and instantly update the recommendation without a massive overhaul.
Better Accuracy: By combining the "what happened next" (Storyteller) with "what is related to what" (Social Networker), the system predicts what you want with much higher accuracy.
Real-World Ready: The authors tested this on massive real-world data (like Amazon products and Yandex Music) and found it consistently beat the best existing systems.

The Bottom Line

The CREATE framework is like hiring a team where one person is an expert on timing and another is an expert on connections, and then forcing them to hold hands and speak the same language. The result is a recommendation system that doesn't just guess what you'll click next; it understands why you'll click it, based on both your recent habits and the hidden web of relationships between the things you love.

1. Problem Statement

Sequential recommendation systems aim to predict the next item a user will interact with based on their historical interaction sequence.

Limitations of Current Approaches:
- Sequential Models (e.g., SASRec, BERT4Rec): These utilize Transformer architectures to capture local, temporal dependencies within a user's interaction history. However, they often treat sequences in isolation, failing to explicitly model complex, global relationships between items and users (e.g., item-item similarity or transitive dependencies) that exist outside the specific sequence order.
- Graph Models (e.g., LightGCN, UltraGCN): These explicitly model global user-item interactions via graph structures, capturing high-order relationships. However, they often struggle to capture the temporal evolution of user preferences, leading to suboptimal performance in next-item prediction tasks compared to state-of-the-art (SOTA) sequential models.
The Gap: Existing hybrid approaches that combine these two paradigms often suffer from poor representation fusion, reliance on "folding-in" procedures to update user embeddings for new interactions (which is computationally expensive and unstable), and the use of contrastive losses that require negative sampling.

2. Methodology: The CREATE Framework

The authors propose CREATE (Cross-REpresentation Aligned Transfer Encoders), a framework that integrates Transformer-based sequential encoders with Graph Neural Networks (GNNs) through a novel alignment mechanism.

Core Architecture

Shared Embedding Layer: Users and items are mapped to a shared $d$ -dimensional latent space. Positional encodings are added for the sequential component.
Dual Encoder System:
- Sequential Encoder (Local): Uses a Transformer backbone (SASRec or BERT4Rec) to model the user's immediate interaction history and temporal dynamics. It outputs a user state vector ( $h_u$ ) used for prediction.
- Graph Encoder (Global): Uses a GNN backbone (LightGCN or UltraGCN) to model the global user-item interaction graph. It captures structural dependencies and item-item relationships.
Asymmetric Inference: A key design choice is that user embeddings are only used during training. During inference, the system relies solely on the sequential encoder enriched by the graph encoder's item embeddings. This eliminates the need for "folding-in" (updating user vectors for new interactions on the fly), making the system robust to out-of-training users and asymmetric graph structures.

Key Technical Innovations

Representation Alignment via Barlow Twins:
- Instead of using contrastive loss (which requires negative sampling and can be unstable), CREATE uses the Barlow Twins objective to align the sequential (local) and graph (global) representations.
- Mechanism: It minimizes the cross-correlation between the two views of the same user. The loss function pushes the diagonal of the cross-correlation matrix to 1 (invariance) and off-diagonal elements to 0 (redundancy reduction). This ensures the two encoders learn consistent but non-redundant features.
Two-Phase Training Strategy (Warm-up):
- Phase 1 (Warm-up): The graph encoder is pre-trained independently for $N_{warmup}$ epochs. This ensures the graph encoder produces high-quality item embeddings before they are introduced to the sequential model.
- Phase 2 (Joint Optimization): Both encoders are trained end-to-end. The total loss is a weighted sum:
  $L = L_{local} + w_{global}L_{global} + w_{BT}L_{BT}$
  Where $L_{local}$ is the sequential cross-entropy, $L_{global}$ is the graph loss (e.g., BPR or UltraGCN loss), and $L_{BT}$ is the Barlow Twins alignment loss.

3. Key Contributions

Novel Framework: Proposes a unified framework that effectively combines the temporal modeling of Transformers with the structural modeling of GNNs for next-item prediction.
Redundancy Reduction: Introduces the Barlow Twins objective for representation alignment, which outperforms traditional contrastive losses by avoiding negative sampling and reducing feature redundancy between encoders.
Asymmetric Inference Design: Eliminates the need for user embeddings during inference, removing the computational bottleneck of "folding-in" procedures and improving robustness to new user interactions.
Training Protocol: Develops a warm-up strategy that stabilizes the training of multi-representational models, ensuring the graph component provides relevant knowledge without destabilizing the sequential backbone.

4. Experimental Results

The framework was evaluated on five datasets: MovieLens-1M, Amazon (Clothing, Sports, Beauty), and Yambda-50M (a large-scale music dataset).

Performance Gains:
- CREATE consistently outperformed pure sequential baselines (SASRec, BERT4Rec), pure graph baselines (LightGCN, UltraGCN), and recent multi-representational models (LOOM, MRGSRec, GSAU).
- Notable Improvement: On the Yambda-50M dataset, CREATE achieved a +38% improvement in NDCG@10 and +26% in NDCG@100 over the strong SASRec baseline.
- On Amazon Beauty, it showed a +15% improvement in NDCG@10.
Ablation Studies:
- Warm-up: Pre-training the graph encoder significantly improved final performance, with optimal warm-up epochs varying by dataset complexity (e.g., 50 epochs for Beauty, 10 for Sports).
- Alignment: The Barlow Twins alignment method yielded superior recommendation quality compared to contrastive loss or no alignment, though it slightly reduced coverage metrics (suggesting a focus on precision over breadth).
- Graph Size: Using a subset of historical interactions (e.g., 40-100%) to build the graph was found to be optimal, balancing structural richness with noise reduction.
Coverage vs. Accuracy: While some alignment methods reduced coverage, CREATE maintained a favorable accuracy-coverage trade-off, significantly outperforming random and popularity-based baselines in ranking quality.

5. Significance

The paper addresses a critical bottleneck in modern recommender systems: the inability to simultaneously model temporal dynamics and global structural relationships effectively.

Industrial Applicability: By removing the need for user embeddings during inference, CREATE is highly suitable for real-world industrial deployment where user activity is continuous and dynamic.
Methodological Advancement: The successful application of Barlow Twins for representation alignment in recommendation tasks offers a new direction for multi-view learning, moving away from the limitations of contrastive learning.
State-of-the-Art Performance: The results demonstrate that fusing local and global signals through careful alignment and training protocols can significantly boost recommendation quality, setting a new benchmark for sequential recommendation tasks.

The code for the framework is publicly available, facilitating further research and adoption.

Cross-Representation Knowledge Transfer for Improved Sequential Recommendations

The Problem

The Solution: CREATE (The "Super-Translator")

1. The Two Experts (The Encoders)

2. The "Warm-Up" (Training the Expert)

3. The "Handshake" (Representation Alignment)

Why is this better?

The Bottom Line

1. Problem Statement

2. Methodology: The CREATE Framework

Core Architecture

Key Technical Innovations

3. Key Contributions

4. Experimental Results

5. Significance

More like this

Complexity of Classical Acceleration for ℓ1\ell_1ℓ1​-Regularized PageRank

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

Language Guided Adversarial Purification

Graph-based Active Learning for Entity Cluster Repair

Neural Green's Operators for Parametric Partial Differential Equations

Complexity of Classical Acceleration for $\ell_1$ -Regularized PageRank