Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: faster indexing for non-fastpath groupby ops #34214

Merged
merged 2 commits into from
May 17, 2020

Conversation

jbrockmendel
Copy link
Member

Per discussions about removing libreduction code, this is part of an effort to make the non-libreduction path more performant.

Performance comparisons are done by disabling fast_apply entirely and taking the two most-affected asvs:

import numpy as np
from pandas import DataFrame

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%prun -s cumtime df.groupby(["key", "key2"]).apply(lambda x: 1)
PR -> 0.263 s
No optimization -> 0.308 s
master -> .039 s

%prun -s cumtime df.groupby("key").apply(lambda x: 1)
PR -> 0.083 s
No optimization -> 0.127 s
master -> .012 s

@jbrockmendel jbrockmendel added Groupby Performance Memory or execution speed performance labels May 17, 2020
@jreback jreback added this to the 1.1 milestone May 17, 2020
@jreback jreback merged commit 6f065b6 into pandas-dev:master May 17, 2020
@jbrockmendel jbrockmendel deleted the slow-apply branch May 17, 2020 21:35
Japanuspus added a commit to Japanuspus/pandas that referenced this pull request Aug 12, 2020
This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b].

Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`.

Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find.

[6f065b]: pandas-dev@6f065b6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants