Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factorization machines #172

Open
wants to merge 69 commits into
base: user_item_features
Choose a base branch
from

Conversation

martincousi
Copy link

Here is a basic factorization machine algorithm that takes into account only the user and item ids. It is equivalent to SVD when using degree=2. I have implemented this algorithm with the tffm library as well as the polylearn library for testing purpose. I found that the tffm is the preferable one given the different options it allows. To be used with GridSearchCV and RandomizedSearchCV, it however requires a special value for the session_config argument (see doc).

It's yet unclear to me what should be good default values for the algorithm that would work in most settings. Currently, it appears that both algorithms are slow while I would have though that using tensorflow would be fast...

This PR also contains tests for the feature option to Dataset, Trainset, etc.

I am planning to construct more elaborate factorization machine algorithms. The tests for the factorization machine algorithms will follow.

dumping is now done with pickle 'highest protocol'
@martincousi
Copy link
Author

martincousi commented Apr 25, 2018

I have added three new factorization machine algos. They are many more possible but most of them can be accomplished by using the features. Additional ones could also be conceived when the library will support context (user-item pair features such as timestamp, location, etc.).

I would like these algos to be modular such that you can turn on/off implicit information, features, etc. I guess the best way would be to create the sparse lists in FMAlgo and turn on/off the different components in the children. What do you think? Also, should there be many FM objects or only one with multiple options?

By the way, the special value for session_config is not needed to do parallelization, at least not on my system.

@NicolasHug
Copy link
Owner

Thanks a lot,

Once again I really appreciate the efforts with the docs and the tests.

I'm definitely interested in adding FM into surprise! This is a lot of code for me to digest though ^^ and I don't have tons of free time ATM (should be easier in the following months), so I just wanted to make sure you know that the review process may take long.

should there be many FM objects or only one with multiple options?

I personally like it when there's a single uniform interface to deal with, but it should still be easy to use. Like, if there are lots of incompatible parameters in a single class, maybe it's best to separate them into different classes. I'll leave it to your own appreciation to decide what's best here.

Are you actually using the FM algos you implemented? If so, with what dataset? I'd like to play around with them to get a feel of how to use them, that would make the understanding of all the code (especially the feature part) a lot easier for me.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants