Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse get dummies perf #21997

Merged
merged 4 commits into from
Jul 20, 2018

Conversation

TomAugspurger
Copy link
Contributor

Previously, we did a scalar elem == -1 for every element in the ndarray.

This replaces that check with a vectorized array == -1.

Running the ASV now. In the meantime, here's a simple timeit on the same problem

# HEAD
In [3]: %timeit pd.get_dummies(s, sparse=True)
561 ms ± 4.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Master
In [3]: %timeit pd.get_dummies(s, sparse=True)
        2.18 s ± 273 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@TomAugspurger TomAugspurger added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type labels Jul 20, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Jul 20, 2018
@pep8speaks
Copy link

pep8speaks commented Jul 20, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on July 20, 2018 at 15:48 Hours UTC

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jul 20, 2018

Here's the ASV (only a 3x speedup).

[100.00%] ··· Running reshape.GetDummies.time_get_dummies_1d_sparse                                                                                    1.79s       before           after         ratio
     [272bbdc7]       [bc658b03]
-           1.79s       1.05±0.02s     0.59  reshape.GetDummies.time_get_dummies_1d_sparse

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

dtype=pd.api.types.CategoricalDtype(categories))
self.s = s

def time_get_dummies_1d(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit: you can param over sparce=False/True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference for leaving them separate, since they're such distinct code paths and it's a tad easier to run just sparse with this layout. Happy to change if you feel strongly about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, no strong preference to use params then.

@jreback
Copy link
Contributor

jreback commented Jul 20, 2018

lgtm.

@jreback jreback merged commit 322dbf4 into pandas-dev:master Jul 20, 2018
@jreback
Copy link
Contributor

jreback commented Jul 20, 2018

thanks!

Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants