-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support sparse matrices for X #240
Conversation
Codecov Report
@@ Coverage Diff @@
## master #240 +/- ##
=======================================
Coverage 98.27% 98.28%
=======================================
Files 7 7
Lines 755 759 +4
=======================================
+ Hits 742 746 +4
Misses 13 13
Help us with your feedback. Take ten seconds to tell us how you rate us. |
📝 Docs preview for commit 38feeca at: https://www.adriangb.com/scikeras/refs/pull/240/merge/ |
Todo: update docs, maybe an example notebook? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tutorial notebook looks good, explaining why and how to use sparse matrices. See my last comment for some minor suggestions.
|
||
The dataset we'll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively. | ||
It consists of a single categorical feature with equal number of categories as rows. | ||
This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a perfect example, as data with many categorical features and/or categories per feature seems like the main reason to make use of a sparse matrix. Taking that to the extreme highlights the benefit of sparse matrices well.
```python | ||
%memit sparse_pipeline.fit(X, y) | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see that you did know how to properly monitor memory usage.
|
||
```python | ||
%memit sparse_pipline_uint8.fit(X, y) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 more things you might want to add:
- Monitor and mention the computation time for the sparse variant. While reduced memory usage is an obvious advantage of using sparse matrices, it came along with increased computation times (for my experiments at least). I'd be curious to see if you measure the same in these somewhat better setup experiments.
- Maybe mention the use of tf.data.Dataset along with a generator as an alternative solution to memory issues. I think this will work in most use cases. It just won't work when the scikeras wrapper is used inside another wrapper or pipeline that doesn't support tf.data.Dataset. Which was the reason I needed to use a sparse matrix.
Not sure how relevant these 2 suggestions are, so I'll leave it up to you if you want include that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great points. I added a run time measurement and mentioned Datasets, feel free to suggest changes to those sections
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those additions look good to me.
Heya! Just tested this and it doesn't work for me - converting it to lil makes the container blow up when I try to train, I need the csr matrix to stay a CSR matrix (TF can do much more efficient math on those!). The thing that worked for me was just passing |
Ouch. I forget why I had to put the conversion in there (I mean, there's a comment, but I'm sure some test failed). |
Closes #239