Restricted Boltzmann Machine implementation in Tensorflow #390

WessZumino · 2018-12-28T03:18:55Z

Description

RBM implementation of collaborative filtering in TF. Although the code runs on a CPU in reasonable times (few minutes), it is advised to use a GPU enabled DSVM. All time metrics reported in the notebooks have been obtained on a single GPU DSVM.

Motivation and Context

It adds neural network based, probabilistic CF capabilities.

Checklist:

My code follows the code style of this project, as detailed in our contribution guidelines.
I have added tests.
I have updated the documentation accordingly.

yueguoguo

This is awesome! A few minor comments for you to consider.

reco_utils/dataset/rbm_splitters.py

reco_utils/recommender/rbm/Mrbm_tensorflow.py

notebooks/00_quick_start/rbm_movielens.ipynb

reco_utils/dataset/rbm_splitters.py

notebooks/00_quick_start/rbm_movielens.ipynb

notebooks/02_model/rbm_deep_dive.ipynb

anargyri · 2018-12-28T13:49:27Z

Why are there separate RBM splitter functions instead of putting these in python_splitters.py? The splitting functionality is more general than RBM, isn't it? What is not covered in the existing splitting methods?
Could you sync up with Le about the approach we have followed?

anargyri · 2018-12-28T13:50:34Z

I see a test for splitters but is there a test for the core RBM functionality? Could you add smoke and integration tests too?

WessZumino · 2018-12-29T05:17:15Z

@anargyri yes you are right, the splitting functionality is partially independent from the algorithm one uses after. Nevertheless, it is essential to have the same user/item matrix structure both in the train and in test set when using the RBM. Using the random splitter does not work and indeed I got errors and/or incosistent results when using it. The stratified splitter on the other hand does the job, but my understanding is that it does this by time and it is therefore deterministic, while a random splitter usually improves the generalizability of the results.

I found more useful to split data in a different way that is kind of a hybrid stratified/random splitter. The advantages are:

It returns the same user/item matrix structure both in the train and in the test set
It preserves the form of the rating distribution in the train/test set by applying a local (per user) relative splitting ratio, beside the global one.
It randomizes the sample adding extra stability.
Splitting directly the numpy arrays is faster than using PD dataframes in this case.

Due to the last point, I prefered to keep it separate from the python_splitter, where all oprations are performed on DFs.

WessZumino · 2018-12-29T05:19:47Z

@anargyri yes you are right, I am working on the tests for the core algo. This is more involving due to the complexity of the algo, but I will add it asap. Apart from that, all function, methods and results have been extensively tested.

miguelgfierro · 2018-12-29T10:55:21Z

Agree with Andreas, splitters should be general functions, not related to a specific algo.
In the past we discussed about using dataframes Vs numpys, we had a lot of code in Df so we left it like that, but as you said numpys should be faster.

I think a middle ground approach would be to have numpy and df splitters, to follow single responsibility and explicit.

anargyri · 2018-12-29T12:03:12Z

@anargyri yes you are right, the splitting functionality is partially independent from the algorithm one uses after. Nevertheless, it is essential to have the same user/item matrix structure both in the train and in test set when using the RBM. Using the random splitter does not work and indeed I got errors and/or incosistent results when using it. The stratified splitter on the other hand does the job, but my understanding is that it does this by time and it is therefore deterministic, while a random splitter usually improves the generalizability of the results.

I suspect the issue you refer to is more general than RBM and applies to collaborative filtering, ALS etc. too. I don't see why stratified vs. random etc. is specific to RBM. In any case, we have had plans to add more splitters that cover all cases in the diagram. Feel free to discuss with Le and Miguel about how to add any methods that you need.

Splitting is tricky with all algorithms, but there are reasons to keep it separate from the algorithms. Apart from software best practices like modularity, a more important reason is that splitting relates to how you evaluate an algorithm offline. Hence it has to be suited to the use case and the data scientist needs to have flexibility to choose different splitting methods i.e. we should not force them to use a specific splitter when doing RBM. There are constraints of course and issues like cold users / items, but the data scientist should have flexibility to choose among splitters the ones that are applicable and preferable for their use case. Moreover, in order to be able to compare different algorithms in a fair way, the splitters should be independent of the algorithms.

Due to the last point, I prefered to keep it separate from the python_splitter, where all oprations are performed on DFs.

This is a valid point but it could be addressed in another way. You could add a numpy_splitters.py similar to what we did with python_splitters.py and spark_splitters.py.

WessZumino · 2018-12-30T04:40:59Z

I suspect the issue you refer to is more general than RBM and applies to collaborative filtering, ALS etc. too. I don't see why stratified vs. random etc. is specific to RBM. In any case, we have had plans to add more splitters that cover all cases in the diagram. Feel free to discuss with Le and Miguel about how to add any methods that you need.

Indeed, it is more general than RBM and it applies to all CF algorithms. If I understood the issue, I can just remane the splitter as numpy_splitter.py as suggested by you and @miguelgfierro.

Moreover, in order to be able to compare different algorithms in a fair way, the splitters should be independent of the algorithms.

The splitter is independent of the RBM, and it can be used with ALS and SAR as well. The only constraint, as I said before, is that the train/test set should contain the same number of users/items ( to be more precise, the only constraint is the number of items to be conserved). For this reason, the output of the splitter is somehow redundant, as it generates the train/test set both as numpy sparse matrices and pandas df; the latter can be used for ALS and SAR. It is on my To-do list to check how the performance of these two algos is affected by the splitter.

I think my problem was as follows:

The splitter works directly on the user/item affinity matrix X . This means that the workflow should be:

first generate X from the DF, then apply the splitter to obtain Xtr and Xtst (both np matrices)
feed the matrices to the algo for training and prediction

However, the existing algos in the repo follow the inverse workflow:

first split the data from the DF and obtain tr_df and test_df (both data frames)
use the dfs directly in the algo, where X_tr and X_tst are obtained.

This is the reason why I decided to separate the splitter from the available ones. Still, as the metrics are evaluated from a DF, I need to convert everything back at the end, but this takes additional time.

notebooks/00_quick_start/rbm_movielens.ipynb

notebooks/02_model/rbm_deep_dive.ipynb

anargyri · 2019-01-29T18:37:06Z

These notebooks look very nice! Great job!

WessZumino · 2019-01-30T04:01:58Z

Thanks @anargyri for carefully reading the notebooks and the suggested fixtures/improvements!

WessZumino · 2019-01-30T10:04:55Z

I have added the unit tests for the rbm, thank you @miguelgfierro @anargyri and @yueguoguo for all the great suggestions, fixtures and improvements!

WessZumino added 5 commits December 28, 2018 10:33

added splitter unit test

4e0977a

added rbm algo

a1df57f

added rbm splitter

30eaa8e

added quick start

769f68b

added deep dive

35be239

WessZumino added the algorithm label Dec 28, 2018

WessZumino requested review from yueguoguo, miguelgfierro and anargyri December 28, 2018 03:18

yueguoguo reviewed Dec 28, 2018

View reviewed changes

rbm and splitter reformat

ffd28d9

docstring added to __init__

549a6b5

WessZumino added 5 commits December 29, 2018 13:48

added load_pandas_df()

e5c2757

precision has been changed to accuracy

94301df

precision --> accuracy + explanation + panda movielens loader

4298e4a

added papermill record

8725512

equations are now centered

abb2bb3

WessZumino added 5 commits December 31, 2018 13:49

name change to rbm.py

4c21c94

improved the evaluation of the free energy

226240c

switched to Adam optimizer

8d4ad2d

fixed typos and obtained new rankings

56fee4b

typos fixed, improved presentation and new ranking metrics

3dd2011

WessZumino added 4 commits January 29, 2019 16:10

rename + pytest.approx() + float ratings

00ff395

removed supervised l description + moved images to new blobstorage

8bf8552

udated description fixed typos

ae4d37f

rbm dataset generation for rbm test

e3f310e