Reproducibility and Artifact Consistency of the SIGIR 2022 Recommender Systems Papers Based on Message Passing

Imagine the world of Recommender Systems (the algorithms behind Netflix, Amazon, and Spotify) as a massive, high-stakes cooking competition. Every year, chefs (researchers) present their new "secret sauce" recipes at a prestigious event called SIGIR. They claim their new dish is the most delicious, efficient, and revolutionary thing ever created.

This paper is essentially a group of food critics (the authors) who decided to go into the kitchen, taste the dishes, and check if the recipes actually work as described. They focused on the "Graph-based" recipes presented at the 2022 event.

Here is what they found, explained simply:

1. The "Recipe" vs. The "Real Dish" (Artifact Consistency)

When a chef submits a recipe, they also hand over their ingredients and the exact steps they took. The critics checked if the ingredients in the box matched the list on the card.

The Problem: In many cases, the ingredients didn't match. Sometimes, the "training" ingredients (what the chef practiced with) accidentally included the "test" ingredients (the final exam).
The Analogy: Imagine a student taking a math test but having the answer key taped to their desk. They get a perfect score, but they didn't actually learn the math. In the paper, this is called information leakage. The models looked brilliant because they were cheating, not because they were smart.
The Result: About half of the papers had "broken" data splits. The results were like a house of cards; they looked good until you tried to build on them.

2. The "Magic Wand" Illusion (Reproducibility)

The critics tried to cook the dishes themselves using the provided recipes to see if they could get the same taste.

The Problem: They couldn't. In many cases, the dish tasted completely different, or the kitchen exploded (the code crashed).
The Analogy: It's like buying a "Do-It-Yourself" furniture kit where the instructions are in a language you don't speak, the screws are missing, and the diagram shows a chair but the box contains a table. When the critics tried to build it, they ended up with a wobbly stool that didn't match the picture.
The Result: They could only successfully reproduce about half of the results. For some papers, the results were impossible to replicate at all.

3. The "Weak Opponent" Trap (Baselines)

In a cooking competition, you judge a new dish by comparing it to the old classics. If your new soup is better than the old one, you win.

The Problem: The researchers often compared their new, complex "Graph Neural Network" soups against "baselines" (the old classics) that were undercooked, burnt, or made with the wrong ingredients.
The Analogy: Imagine a new, fancy robot chef enters the competition. Instead of comparing it to a human master chef, they compare it to a toaster that can barely make toast. The robot looks like a genius because it beat the toaster. But when you compare the robot to a real human chef, the robot is actually terrible.
The Result: On a popular dataset called "Amazon-Book," the fancy new graph models were actually worse than simple, old-school methods (like ItemKNN). The researchers claimed "State-of-the-Art!" but they were just beating a weak opponent.

4. The "Copy-Paste" Epidemic (Impact on Future Research)

The critics looked at what happened in 2023. Did the next generation of chefs learn from the mistakes?

The Problem: Unfortunately, yes and no. Many new papers used the old, broken recipes as a starting point. Because the original recipes were flawed, the new ones were built on shaky ground.
The Analogy: It's like a game of "Telephone" where the first person whispers a wrong instruction, and by the time it reaches the end, everyone is speaking a different language. The new researchers were trying to compare apples to oranges because everyone was using different measuring cups and different definitions of "delicious."
The Result: It became almost impossible to compare the new papers with the old ones. The field was spinning its wheels, creating a lot of noise but not much progress.

The Big Takeaway

The paper concludes that the field of Recommender Systems is suffering from a Reproducibility Crisis.

Too much hype: Researchers are rushing to publish complex models that look good on paper but fail in the real world.
Bad habits: They are using "cheating" data splits and comparing themselves to weak opponents to make their work look better than it is.
The Cost: This wastes time and money. Other scientists can't build on these results because the foundation is cracked.

The Solution? The authors suggest we need to stop chasing "fancy" metrics and start being honest. We need:

Better Recipes: Clear, documented code and data that anyone can use.
Honest Comparisons: Test new ideas against strong, well-tuned opponents, not weak ones.
Admitting Failure: It's okay to say, "This recipe didn't work on this specific ingredient." That is real science.

In short: The field is full of "magic tricks" that don't actually work when you look behind the curtain. It's time to put the magic away and get back to solid, reliable cooking.

1. Problem Statement

The paper addresses the growing "reproducibility crisis" in Recommender Systems (RS), specifically focusing on graph-based techniques utilizing message passing (e.g., Graph Neural Networks). Despite the surge in publications at top-tier venues like SIGIR, there is a lack of empirical evidence confirming whether:

The provided source code and data artifacts are consistent with the paper descriptions.
The reported results can be reproduced by independent researchers.
The proposed methods genuinely outperform simple, robust baselines or if their "state-of-the-art" (SOTA) claims are artifacts of poor experimental design (e.g., information leakage, weak baselines, or erroneous data splits).

The authors hypothesize that many recent RS papers suffer from methodological flaws that invalidate their conclusions and propagate bad practices to subsequent research.

2. Methodology

The study employs a rigorous, multi-stage reproducibility framework involving 10 candidate papers from SIGIR 2022 (and one foundational paper from SIGIR 2020, LightGCN) and a qualitative analysis of 11 follow-up papers from SIGIR 2023.

A. Artifact and Consistency Analysis

Selection: 10 papers focusing on message-passing RS for top- $N$ recommendation were selected.
Consistency Check: The authors manually inspected source code and data splits to verify alignment with paper descriptions. They checked for:
- Data Splits: Whether splits followed the described protocol (e.g., user-wise random holdout) or contained anomalies (e.g., information leakage, overlapping training/test sets, or non-random popularity distributions).
- Implementation: Whether the code matched the algorithm described, including early-stopping procedures and hyperparameter selection.
Reproducibility Criteria: A result was deemed "reproduced" if the relative difference in effectiveness metrics (Recall, NDCG) was $\le 2\%$ compared to the original paper, using the same artifacts and setup.

B. Experimental Reproduction

Framework: All methods were integrated into a unified evaluation framework (based on the authors' previous work) to ensure fair comparison.
Baselines: A comprehensive set of 21 baselines was used, ranging from simple non-personalized methods (TopPop) and nearest-neighbor algorithms (UserKNN, ItemKNN) to matrix factorization (MF-BPR, iALS) and advanced graph/neural methods (MultVAE, SLIM).
Hyperparameter Optimization: The authors performed independent Bayesian hyperparameter optimization (50 trials) for all baselines and candidate methods to ensure they were not being compared against poorly tuned baselines.
Early-Stopping: The authors enforced strict early-stopping on validation data, correcting instances where original papers used test data for early-stopping (a form of information leakage).

C. Impact Analysis

The authors qualitatively analyzed SIGIR 2023 papers that cited the 2022 papers as baselines to determine if the identified flaws (e.g., specific data splits, weak baselines) were being propagated.

3. Key Contributions

Systematic Artifact Audit: A comprehensive audit of 10 SIGIR 2022 RS papers, revealing that while 90% provided artifacts, many contained critical inconsistencies.
Identification of Systemic Flaws: The paper categorizes three major categories of failure:
- Erroneous Data Splits: Many papers used splits that did not match their descriptions (e.g., training/test sets with different item popularity distributions) or contained data leakage (overlapping interactions).
- Inconsistent Methodologies: Discrepancies between paper descriptions and code regarding early-stopping (often using test data) and hyperparameter tuning.
- Weak Baselines: A preference for complex, poorly optimized baselines over simple, robust ones, creating a false impression of improvement.
Quantitative Reproducibility Assessment: A large-scale experiment involving ~25,000 model trainings over 4 years of compute time to reproduce results.
Qualitative Impact Study: An analysis showing that the lack of standardized protocols makes it nearly impossible to compare results between SIGIR 2022 and SIGIR 2023 papers, hindering scientific progress.

4. Key Results

A. Artifact Consistency

Data Splits: 5 out of 9 papers with complete artifacts used data splits inconsistent with their descriptions. Notably, LightGCN, SimGCL, HAKG, GTN, and KGCL used splits that were not true random holdouts, leading to biased evaluation.
Information Leakage: Three papers (GDE, HAKG, GTN) had training and test sets that partially overlapped. Two others (HAKG, KGCL) used test data for early-stopping, artificially inflating performance.
Code Consistency: While core model implementations were generally consistent, training procedures (like early-stopping logic) often differed from the text.

B. Reproducibility Rates

Partial Reproduction: Only 3 out of 9 papers had results that could be reproduced for at least 50% of their datasets.
Failure to Reproduce: One paper (HAKG) was entirely irreproducible on the available datasets due to memory constraints and code issues.
Variation: Reproducibility rates ranged from 0% to 66% depending on the method and dataset.

C. Competitiveness Against Baselines

General Failure: Most message-passing methods failed to outperform simple baselines (e.g., ItemKNN, SLIM, RP3 $\beta$ ).
Amazon-Book Dataset: This dataset showed the most dramatic failure. Most graph-based methods performed significantly worse than simple baselines. For example, LightGCN achieved an NDCG of ~0.03, while simple ItemKNN achieved ~0.06 (twice the performance).
Optimization Impact: When the authors performed independent hyperparameter optimization, some methods (e.g., INMO, SimGCL) improved significantly, but still rarely surpassed the best simple baselines (MultVAE, SLIM) on Amazon-Book.
GDE Exception: Graph Denoising Encoder (GDE) was the only method that outperformed baselines in 80% of cases, but this was attributed to a unique (and arguably unfair) data split where the training set was only 20% of the data.

D. Impact on SIGIR 2023

Incomparability: It is virtually impossible to compare results between SIGIR 2022 and 2023 papers due to inconsistent preprocessing, data splits, and baseline selection.
Propagation of Errors: While some specific erroneous splits were not directly copied, the lack of transparency and standardized protocols means new papers continue to use non-comparable experimental setups.

5. Significance and Implications

Scientific Integrity: The study suggests that the field of Graph-based RS may be stagnating. The "SOTA" claims are often illusory, driven by methodological errors rather than genuine algorithmic advances.
Community Practices: The paper highlights a "leaderboard chasing" phenomenon where researchers optimize for specific, often flawed, benchmark splits rather than generalizability.
Recommendations:
- Standardization: Adoption of standardized datasets and splits (e.g., via ir_datasets) to ensure comparability.
- Rigorous Baselines: Mandatory inclusion of strong, well-optimized simple baselines (like ItemKNN or SLIM) to prove genuine improvement.
- Transparency: Full disclosure of data splitting procedures, hyperparameter search spaces, and early-stopping criteria.
- Negative Results: Encouraging the publication of negative results to prevent the field from chasing dead ends.
- Review Process: The authors suggest that expecting full reproducibility during peer review is unrealistic due to computational costs, proposing alternatives like "Registered Reports" where methodology is reviewed before execution.

In conclusion, the paper serves as a critical wake-up call for the Recommender Systems community, arguing that without addressing these fundamental issues of reproducibility and experimental rigor, the field risks a long period of stagnation where published results are unreliable and non-comparable.