Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restricted Boltzmann Machine implementation in Tensorflow #390

Merged
merged 53 commits into from
Jan 31, 2019
Merged

Conversation

WessZumino
Copy link
Contributor

@WessZumino WessZumino commented Dec 28, 2018

Description

RBM implementation of collaborative filtering in TF. Although the code runs on a CPU in reasonable times (few minutes), it is advised to use a GPU enabled DSVM. All time metrics reported in the notebooks have been obtained on a single GPU DSVM.

Motivation and Context

It adds neural network based, probabilistic CF capabilities.

Checklist:

  • My code follows the code style of this project, as detailed in our contribution guidelines.
  • I have added tests.
  • I have updated the documentation accordingly.

Copy link
Collaborator

@yueguoguo yueguoguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! A few minor comments for you to consider.

reco_utils/dataset/rbm_splitters.py Outdated Show resolved Hide resolved
reco_utils/dataset/rbm_splitters.py Outdated Show resolved Hide resolved
reco_utils/dataset/rbm_splitters.py Outdated Show resolved Hide resolved
reco_utils/dataset/rbm_splitters.py Outdated Show resolved Hide resolved
reco_utils/recommender/rbm/Mrbm_tensorflow.py Outdated Show resolved Hide resolved
reco_utils/recommender/rbm/Mrbm_tensorflow.py Outdated Show resolved Hide resolved
reco_utils/recommender/rbm/Mrbm_tensorflow.py Outdated Show resolved Hide resolved
notebooks/00_quick_start/rbm_movielens.ipynb Outdated Show resolved Hide resolved
reco_utils/dataset/rbm_splitters.py Outdated Show resolved Hide resolved
notebooks/00_quick_start/rbm_movielens.ipynb Outdated Show resolved Hide resolved
notebooks/00_quick_start/rbm_movielens.ipynb Show resolved Hide resolved
notebooks/02_model/rbm_deep_dive.ipynb Outdated Show resolved Hide resolved
notebooks/02_model/rbm_deep_dive.ipynb Outdated Show resolved Hide resolved
notebooks/02_model/rbm_deep_dive.ipynb Outdated Show resolved Hide resolved
notebooks/02_model/rbm_deep_dive.ipynb Show resolved Hide resolved
@anargyri
Copy link
Collaborator

anargyri commented Dec 28, 2018

Why are there separate RBM splitter functions instead of putting these in python_splitters.py? The splitting functionality is more general than RBM, isn't it? What is not covered in the existing splitting methods?
Could you sync up with Le about the approach we have followed?

@anargyri
Copy link
Collaborator

anargyri commented Dec 28, 2018

I see a test for splitters but is there a test for the core RBM functionality? Could you add smoke and integration tests too?

@WessZumino
Copy link
Contributor Author

WessZumino commented Dec 29, 2018

@anargyri yes you are right, the splitting functionality is partially independent from the algorithm one uses after. Nevertheless, it is essential to have the same user/item matrix structure both in the train and in test set when using the RBM. Using the random splitter does not work and indeed I got errors and/or incosistent results when using it. The stratified splitter on the other hand does the job, but my understanding is that it does this by time and it is therefore deterministic, while a random splitter usually improves the generalizability of the results.

I found more useful to split data in a different way that is kind of a hybrid stratified/random splitter. The advantages are:

  1. It returns the same user/item matrix structure both in the train and in the test set
  2. It preserves the form of the rating distribution in the train/test set by applying a local (per user) relative splitting ratio, beside the global one.
  3. It randomizes the sample adding extra stability.
  4. Splitting directly the numpy arrays is faster than using PD dataframes in this case.

Due to the last point, I prefered to keep it separate from the python_splitter, where all oprations are performed on DFs.

@WessZumino
Copy link
Contributor Author

@anargyri yes you are right, I am working on the tests for the core algo. This is more involving due to the complexity of the algo, but I will add it asap. Apart from that, all function, methods and results have been extensively tested.

@miguelgfierro
Copy link
Collaborator

Agree with Andreas, splitters should be general functions, not related to a specific algo.
In the past we discussed about using dataframes Vs numpys, we had a lot of code in Df so we left it like that, but as you said numpys should be faster.

I think a middle ground approach would be to have numpy and df splitters, to follow single responsibility and explicit.

@anargyri
Copy link
Collaborator

@anargyri yes you are right, the splitting functionality is partially independent from the algorithm one uses after. Nevertheless, it is essential to have the same user/item matrix structure both in the train and in test set when using the RBM. Using the random splitter does not work and indeed I got errors and/or incosistent results when using it. The stratified splitter on the other hand does the job, but my understanding is that it does this by time and it is therefore deterministic, while a random splitter usually improves the generalizability of the results.

I suspect the issue you refer to is more general than RBM and applies to collaborative filtering, ALS etc. too. I don't see why stratified vs. random etc. is specific to RBM. In any case, we have had plans to add more splitters that cover all cases in the diagram. Feel free to discuss with Le and Miguel about how to add any methods that you need.

Splitting is tricky with all algorithms, but there are reasons to keep it separate from the algorithms. Apart from software best practices like modularity, a more important reason is that splitting relates to how you evaluate an algorithm offline. Hence it has to be suited to the use case and the data scientist needs to have flexibility to choose different splitting methods i.e. we should not force them to use a specific splitter when doing RBM. There are constraints of course and issues like cold users / items, but the data scientist should have flexibility to choose among splitters the ones that are applicable and preferable for their use case. Moreover, in order to be able to compare different algorithms in a fair way, the splitters should be independent of the algorithms.

Due to the last point, I prefered to keep it separate from the python_splitter, where all oprations are performed on DFs.

This is a valid point but it could be addressed in another way. You could add a numpy_splitters.py similar to what we did with python_splitters.py and spark_splitters.py.

@WessZumino
Copy link
Contributor Author

I suspect the issue you refer to is more general than RBM and applies to collaborative filtering, ALS etc. too. I don't see why stratified vs. random etc. is specific to RBM. In any case, we have had plans to add more splitters that cover all cases in the diagram. Feel free to discuss with Le and Miguel about how to add any methods that you need.

Indeed, it is more general than RBM and it applies to all CF algorithms. If I understood the issue, I can just remane the splitter as numpy_splitter.py as suggested by you and @miguelgfierro.

Moreover, in order to be able to compare different algorithms in a fair way, the splitters should be independent of the algorithms.

The splitter is independent of the RBM, and it can be used with ALS and SAR as well. The only constraint, as I said before, is that the train/test set should contain the same number of users/items ( to be more precise, the only constraint is the number of items to be conserved). For this reason, the output of the splitter is somehow redundant, as it generates the train/test set both as numpy sparse matrices and pandas df; the latter can be used for ALS and SAR. It is on my To-do list to check how the performance of these two algos is affected by the splitter.

I think my problem was as follows:

The splitter works directly on the user/item affinity matrix X . This means that the workflow should be:

  1. first generate X from the DF, then apply the splitter to obtain Xtr and Xtst (both np matrices)
  2. feed the matrices to the algo for training and prediction

However, the existing algos in the repo follow the inverse workflow:

  1. first split the data from the DF and obtain tr_df and test_df (both data frames)
  2. use the dfs directly in the algo, where X_tr and X_tst are obtained.

This is the reason why I decided to separate the splitter from the available ones. Still, as the metrics are evaluated from a DF, I need to convert everything back at the end, but this takes additional time.

@anargyri
Copy link
Collaborator

These notebooks look very nice! Great job!

@WessZumino
Copy link
Contributor Author

Thanks @anargyri for carefully reading the notebooks and the suggested fixtures/improvements!

@WessZumino
Copy link
Contributor Author

I have added the unit tests for the rbm, thank you @miguelgfierro @anargyri and @yueguoguo for all the great suggestions, fixtures and improvements!

@miguelgfierro miguelgfierro merged commit f05e3c5 into staging Jan 31, 2019
@miguelgfierro miguelgfierro deleted the Mrbm branch January 31, 2019 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants