-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restricted Boltzmann Machine implementation in Tensorflow #390
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! A few minor comments for you to consider.
Why are there separate RBM splitter functions instead of putting these in |
I see a test for splitters but is there a test for the core RBM functionality? Could you add smoke and integration tests too? |
@anargyri yes you are right, the splitting functionality is partially independent from the algorithm one uses after. Nevertheless, it is essential to have the same user/item matrix structure both in the train and in test set when using the RBM. Using the random splitter does not work and indeed I got errors and/or incosistent results when using it. The stratified splitter on the other hand does the job, but my understanding is that it does this by time and it is therefore deterministic, while a random splitter usually improves the generalizability of the results. I found more useful to split data in a different way that is kind of a hybrid stratified/random splitter. The advantages are:
Due to the last point, I prefered to keep it separate from the python_splitter, where all oprations are performed on DFs. |
@anargyri yes you are right, I am working on the tests for the core algo. This is more involving due to the complexity of the algo, but I will add it asap. Apart from that, all function, methods and results have been extensively tested. |
Agree with Andreas, splitters should be general functions, not related to a specific algo. I think a middle ground approach would be to have numpy and df splitters, to follow single responsibility and explicit. |
I suspect the issue you refer to is more general than RBM and applies to collaborative filtering, ALS etc. too. I don't see why stratified vs. random etc. is specific to RBM. In any case, we have had plans to add more splitters that cover all cases in the diagram. Feel free to discuss with Le and Miguel about how to add any methods that you need. Splitting is tricky with all algorithms, but there are reasons to keep it separate from the algorithms. Apart from software best practices like modularity, a more important reason is that splitting relates to how you evaluate an algorithm offline. Hence it has to be suited to the use case and the data scientist needs to have flexibility to choose different splitting methods i.e. we should not force them to use a specific splitter when doing RBM. There are constraints of course and issues like cold users / items, but the data scientist should have flexibility to choose among splitters the ones that are applicable and preferable for their use case. Moreover, in order to be able to compare different algorithms in a fair way, the splitters should be independent of the algorithms.
This is a valid point but it could be addressed in another way. You could add a |
Indeed, it is more general than RBM and it applies to all CF algorithms. If I understood the issue, I can just remane the splitter as
The splitter is independent of the RBM, and it can be used with ALS and SAR as well. The only constraint, as I said before, is that the train/test set should contain the same number of users/items ( to be more precise, the only constraint is the number of items to be conserved). For this reason, the output of the splitter is somehow redundant, as it generates the train/test set both as numpy sparse matrices and pandas df; the latter can be used for ALS and SAR. It is on my To-do list to check how the performance of these two algos is affected by the splitter. I think my problem was as follows: The splitter works directly on the user/item affinity matrix X . This means that the workflow should be:
However, the existing algos in the repo follow the inverse workflow:
This is the reason why I decided to separate the splitter from the available ones. Still, as the metrics are evaluated from a DF, I need to convert everything back at the end, but this takes additional time. |
These notebooks look very nice! Great job! |
Thanks @anargyri for carefully reading the notebooks and the suggested fixtures/improvements! |
I have added the unit tests for the rbm, thank you @miguelgfierro @anargyri and @yueguoguo for all the great suggestions, fixtures and improvements! |
Description
RBM implementation of collaborative filtering in TF. Although the code runs on a CPU in reasonable times (few minutes), it is advised to use a GPU enabled DSVM. All time metrics reported in the notebooks have been obtained on a single GPU DSVM.
Motivation and Context
It adds neural network based, probabilistic CF capabilities.
Checklist: