In this project, we'll use the Goodreads dataset collected by
Mengting Wan, Julian McAuley, "Item Recommendation on Monotonic Behavior Chains", RecSys 2018.
On Dumbo's HDFS, you will find the following files in hdfs:/user/bm106/pub/goodreads
:
goodreads_interactions.csv
user_id_map.csv
book_id_map.csv
The first file contains tuples of user-book interactions. For example, the first five linrd are
user_id,book_id,is_read,rating,is_reviewed
0,948,1,5,0
0,947,1,5,1
0,946,1,5,0
0,945,1,5,0
The other two files consist of mappings between the user and book numerical identifiers used in the interactions file, and their alphanumeric strings which are used in supplementary data (see below). Overall there are 876K users, 2.4M books, and 223M interactions.
Your recommendation model should use Spark's alternating least squares (ALS) method to learn latent factor representations for users and items. Be sure to thoroughly read through the documentation on the pyspark.ml.recommendation module before getting started.
This model has some hyper-parameters that you should tune to optimize performance on the validation set, notably:
- the rank (dimension) of the latent factors, and
- the regularization parameter lambda.
You will need to construct train, validation, and test splits of the data. It's a good idea to do this first (using a fixed random seed) and save the results, so that your validation scores are comparable across runs.
Data splitting for recommender system interactions (user-item ratings) can be a bit more delicate than the typical randomized partitioning that you might encounter in a standard regression or classification setup, and you will need to think through the process carefully. As a general recipe, we recommend the following:
- Select 60% of users (and all of their interactions) to form the training set.
- Select 20% of users to form the validation set. For each validation user, use half of their interactions for training, and the other half should be held out for validation. (Remember: you can't predict items for a user with no history at all!)
- Remaining users: same process as for validation.
As mentioned below, it's a good idea to downsample the data when prototyping your implementation.
Downsampling should follow similar logic to partitioning: don't downsample interactions directly.
Instead, sample a percentage of users, and take all of their interactions to make a miniature version of the data.
Any items not observed during training (i.e., which have no interactions in the training set, or in the observed portion of the validation and test users), can be omitted unless you're implementing cold-start recommendation as an extension.
In general, users with few interactions (say, fewer than 10) may not provide sufficient data for evaluation, especially after partitioning their observations into train/test. You may discard these users from the experiment, but document your exact steps in the report.
Once your model is trained, you will need to evaluate its accuracy on the validation and test data. Scores for validation and test should both be reported in your final writeup. Evaluations should be based on predicted top 500 items for each user.
The choice of evaluation criteria for hyper-parameter tuning is up to you, as is the range of hyper-parameters you consider, but be sure to document your choices in the final report. As a general rule, you should explore ranges of each hyper-parameter that are sufficiently large to produce observable differences in your evaluation score.
In addition to the RMS error metric, Spark provides some additional evaluation metrics which you can use to evaluate your implementation. Refer to the ranking metrics section of the documentation for more details. If you like, you may also use additional software implementations of recommendation or ranking metric evaluations, but please cite any additional software you use in the project.
Start small, and get the entire system working start-to-finish before investing time in hyper-parameter tuning! To avoid over-loading the cluster, I recommend starting locally on your own machine and using one of the genre subsets rather than the full dataset.
You may also find it helpful to convert the raw CSV data to parquet format for more efficient access. We recommend doing these steps early on.
You may consider downsampling the data to more rapidly prototype your model. If you do this, be careful that your downsampled data includes enough users from the validation set to test your model.