Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sparse matrices for X #240

Merged
merged 11 commits into from
Jul 22, 2022
Merged

Support sparse matrices for X #240

merged 11 commits into from
Jul 22, 2022

Conversation

adriangb
Copy link
Owner

Closes #239

@codecov-commenter
Copy link

codecov-commenter commented Aug 11, 2021

Codecov Report

Merging #240 (38feeca) into master (2c8e9e0) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #240   +/-   ##
=======================================
  Coverage   98.27%   98.28%           
=======================================
  Files           7        7           
  Lines         755      759    +4     
=======================================
+ Hits          742      746    +4     
  Misses         13       13           
Impacted Files Coverage Δ
scikeras/wrappers.py 97.53% <100.00%> (+0.02%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us.

@github-actions
Copy link

github-actions bot commented Aug 11, 2021

📝 Docs preview for commit 38feeca at: https://www.adriangb.com/scikeras/refs/pull/240/merge/

@adriangb adriangb mentioned this pull request Aug 11, 2021
@adriangb
Copy link
Owner Author

Todo: update docs, maybe an example notebook?

Copy link

@carlogeertse carlogeertse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tutorial notebook looks good, explaining why and how to use sparse matrices. See my last comment for some minor suggestions.


The dataset we'll be using is designed to demostrate a worst-case/best-case scenario for dense and sparse input features respectively.
It consists of a single categorical feature with equal number of categories as rows.
This means the one-hot encoded representation will require as many columns as it does rows, making it very ineffienct to store as a dense matrix but very efficient to store as a sparse matrix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a perfect example, as data with many categorical features and/or categories per feature seems like the main reason to make use of a sparse matrix. Taking that to the extreme highlights the benefit of sparse matrices well.

```python
%memit sparse_pipeline.fit(X, y)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see that you did know how to properly monitor memory usage.


```python
%memit sparse_pipline_uint8.fit(X, y)
```

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 more things you might want to add:

  • Monitor and mention the computation time for the sparse variant. While reduced memory usage is an obvious advantage of using sparse matrices, it came along with increased computation times (for my experiments at least). I'd be curious to see if you measure the same in these somewhat better setup experiments.
  • Maybe mention the use of tf.data.Dataset along with a generator as an alternative solution to memory issues. I think this will work in most use cases. It just won't work when the scikeras wrapper is used inside another wrapper or pipeline that doesn't support tf.data.Dataset. Which was the reason I needed to use a sparse matrix.

Not sure how relevant these 2 suggestions are, so I'll leave it up to you if you want include that information.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great points. I added a run time measurement and mentioned Datasets, feel free to suggest changes to those sections

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those additions look good to me.

@adriangb adriangb merged commit 8d5e1a9 into master Jul 22, 2022
@mattalhonte-srm
Copy link

Heya! Just tested this and it doesn't work for me - converting it to lil makes the container blow up when I try to train, I need the csr matrix to stay a CSR matrix (TF can do much more efficient math on those!). The thing that worked for me was just passing accept_sparse=True and 0 other code changes. Thanks!

@adriangb
Copy link
Owner Author

Ouch. I forget why I had to put the conversion in there (I mean, there's a comment, but I'm sure some test failed).

@adriangb adriangb deleted the support-sparse-matrix-X branch July 23, 2022 03:20
@adriangb adriangb restored the support-sparse-matrix-X branch July 23, 2022 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use of sparse matrices
4 participants