PERF: faster indexing for non-fastpath groupby ops #34214

jbrockmendel · 2020-05-16T18:29:31Z

Per discussions about removing libreduction code, this is part of an effort to make the non-libreduction path more performant.

Performance comparisons are done by disabling fast_apply entirely and taking the two most-affected asvs:

import numpy as np
from pandas import DataFrame

N = 10 ** 4
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)

%prun -s cumtime df.groupby(["key", "key2"]).apply(lambda x: 1)
PR -> 0.263 s
No optimization -> 0.308 s
master -> .039 s

%prun -s cumtime df.groupby("key").apply(lambda x: 1)
PR -> 0.083 s
No optimization -> 0.127 s
master -> .012 s

This bug is a regression in v1.1.0 and was introduced by the fix for pandas-devGH-34214 in commit [6f065b]. Underlying cause is that the `*Splitter` classes do not use the `._constructor` property and do not call `__finalize__`. Please note that the method name used for `__finalize__` calls was my best guess since documentation for the value has been hard to find. [6f065b]: pandas-dev@6f065b6

jbrockmendel added 2 commits May 15, 2020 19:56

PERF: speedup non-fastpath in groupby ops

f97361b

PERF: make_block_same_class

f143ed2

jbrockmendel added Groupby Performance Memory or execution speed performance labels May 17, 2020

jreback added this to the 1.1 milestone May 17, 2020

jreback merged commit 6f065b6 into pandas-dev:master May 17, 2020

jbrockmendel deleted the slow-apply branch May 17, 2020 21:35

Japanuspus mentioned this pull request Aug 11, 2020

DataFrame.groupby doesn't preserve _metadata #29442

Closed

Japanuspus mentioned this pull request Aug 12, 2020

Fix GH-29442 DataFrame.groupby doesn't preserve _metadata #35688

Merged

5 tasks

simonjayhawkins mentioned this pull request Oct 29, 2020

BUG: groupby __iter__ on pandas 1.1.x not propagating _metadata on DataFrame subclasses #37343

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster indexing for non-fastpath groupby ops #34214

PERF: faster indexing for non-fastpath groupby ops #34214

jbrockmendel commented May 16, 2020

PERF: faster indexing for non-fastpath groupby ops #34214

PERF: faster indexing for non-fastpath groupby ops #34214

Conversation

jbrockmendel commented May 16, 2020