Abstract Summary
Large Language Models (LLMs) have been shown to have promising effectiveness in recommender systems. RecRanker, a recent LLM-based recommendation model, has demonstrated strong results on the top-$k$ recommendation task. However, the contribution of each of its core components, namely user sampling, initial ranking list generation, prompt construction, and an instruction tuning strategy, remains underexplored. In this work, we inspect the reproducibility of RecRanker, and study the impact and role of its various components in recommendation performance. We begin by reproducing the RecRanker's pipeline through the implementation of all its key components. Our reproduction shows that the pairwise and listwise instruction tuning methods achieve a performance comparable to that reported in the original paper. For the pointwise method, while we are also able to reproduce the original paper¡¯s results, further analysis shows that the abnormal high performance due to data leakage from the inclusion of ground-truth information in the prompts. To enable a fair and comprehensive evaluation of LLM-based top-$k$ recommendations, we propose RecRankerEval, an extensible framework that covers five key dimensions: user sampling strategy, initial recommendation model, LLM backbone, dataset selection, and instruction tuning method. Using the RecRankerEval framework, we show that the original results of RecRanker can be reproduced on the ML-100K and ML-1M datasets, as well as an additional Amazon-Music dataset, but not on BookCrossing due to the lack of timestamp information in the original RecRanker paper. Furthermore, we demonstrate that RecRanker's performance can be improved by employing alternative user sampling methods (e.g., DBSCAN), stronger initial recommenders (e.g., XSimGCL), and more capable LLMs (e.g., Llama3).