ENH: Allow numba aggregations to return non-float64 results #53444

lithomas1 · 2023-05-29T21:53:33Z

xref ENH: Allow numba aggregations to return non-float64 results #44952 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…a-overloads

asv_bench/benchmarks/groupby.py

pandas/core/_numba/executor.py

mroeschke · 2023-06-07T00:49:25Z

pandas/core/_numba/executor.py

+    return column_looper
+
+
+default_dtype_mapping: dict[np.dtype, Any] = {


Curious, could we not just define signatures for numba.jit to use when running the function?

We allocate arrays inside the function and need to pass a dtype there as well.

Not sure how to access the signature from inside the func.

rhshadrach

Looks good; just some thoughts/suggestions

pandas/core/_numba/executor.py

rhshadrach · 2023-06-08T01:02:11Z

pandas/core/groupby/groupby.py

-        result = aggregator(sorted_data, starts, ends, 0, *aggregator_args)
+        result = sorted_df._mgr.apply(
+            aggregator, start=starts, end=ends, **aggregator_kwargs
+        )


This is changing *arggregator_args -> **aggregator_kwargs, but then within aggregator it is being used as *aggregator_kwargs. This is only used internally right? I'm just wondering if we can make this less fragile somehow (changing of order kwargs might produce a bug, right?), but I'm not seeing a way.

Yeah, the current method is really sketchy but it should be OK, since UDFs take another path.
(only args/kwargs that go through here are stuff like ddof for std/var).

The reason it's like this is since BlockManager.apply only takes kwargs.
Is it fine to change that?

pandas/core/_numba/executor.py

pandas/core/groupby/groupby.py

rhshadrach · 2023-06-08T01:32:45Z

pandas/core/window/rolling.py

@@ -646,10 +646,27 @@ def _numba_apply(
            step=self.step,
        )
        self._check_window_bounds(start, end, len(values))
+        # For now, map everything to float to match the Cython impl
+        # even though it is wrong
+        # TODO: Could preserve correct dtypes in future


There an issue for this?

#53214, I'll add it to the comment.

…mba-overloads

rhshadrach

lgtm

mroeschke · 2023-06-12T20:58:17Z

Could you also add a whatsnew note for 2.1?

lithomas1 · 2023-06-13T19:06:03Z

There seems to be some flakiness with the benchmarks I added.
(Wasn't able to get an error message unfortunately).

I'll let this sit for a couple of days then, but other than that it should be good to go.

EDIT: Root caused, it was a timeout in the benchs.

lithomas1 · 2023-06-14T23:04:04Z

asv_bench/benchmarks/groupby.py

+            # because it re-uses the Window min/max kernel
+            # so it will time out ASVs
+            # "min",
+            # "max",


Disabled min/max because it's reaaaallllly sloooooow.
It takes 20s (as opposed to milliseconds for the other kernels) to run, and can time out the ASVs sometimes(causing flakiness).

Best guess is that the list operations are slowing it down. Snakeviz tells me most (99% of the time) is spent in the numba kernel, and I can't profile into there.

I'm planning on splitting groupby stuff from the Window numba kernels in the future, so hopefully this doesn't stay commented for long.

…ev#53444) * ENH: non float64 result support in numba groupby * refactor & simplify * fix CI * maybe green? * skip unsupported ops in other bench as well * updates from code review * remove commented code * update whatsnew * debug benchmarks * Skip min/max benchmarks

lithomas1 added 2 commits May 23, 2023 06:53

ENH: non float64 result support in numba groupby

bcd93e0

refactor & simplify

e22d783

lithomas1 added Groupby Dtype Conversions Unexpected or buggy dtype conversions Window rolling, ewma, expanding numba numba-accelerated operations labels May 29, 2023

lithomas1 requested review from rhshadrach and mroeschke May 30, 2023 14:13

Merge branch 'main' into numba-overloads

5be4d9e

lithomas1 changed the title ~~WIP: ENH: Allow numba aggregations to return non-float64 results~~ ENH: Allow numba aggregations to return non-float64 results May 31, 2023

lithomas1 marked this pull request as ready for review May 31, 2023 22:57

lithomas1 added 4 commits June 5, 2023 13:58

fix CI

9f2f70d

Merge branch 'main' of https://github.com/pandas-dev/pandas into numb…

6f12756

…a-overloads

maybe green?

00ce652

skip unsupported ops in other bench as well

64ecaec

mroeschke reviewed Jun 7, 2023

View reviewed changes

asv_bench/benchmarks/groupby.py Outdated Show resolved Hide resolved

mroeschke reviewed Jun 7, 2023

View reviewed changes

pandas/core/_numba/executor.py Show resolved Hide resolved

mroeschke reviewed Jun 7, 2023

View reviewed changes

rhshadrach reviewed Jun 8, 2023

View reviewed changes

lithomas1 added 4 commits June 9, 2023 11:51

updates from code review

4d58a47

Merge branch 'main' into numba-overloads

405a71c

remove commented code

c6d4ffe

Merge branch 'numba-overloads' of github.com:lithomas1/pandas into nu…

8f076e7

…mba-overloads

lithomas1 requested review from mroeschke and rhshadrach June 12, 2023 17:14

rhshadrach approved these changes Jun 12, 2023

View reviewed changes

mroeschke approved these changes Jun 12, 2023

View reviewed changes

mroeschke added this to the 2.1 milestone Jun 12, 2023

update whatsnew

d05ebdf

lithomas1 added 2 commits June 13, 2023 07:10

Merge branch 'main' into numba-overloads

5b4f7fc

debug benchmarks

e67bbeb

lithomas1 marked this pull request as draft June 13, 2023 16:52

Merge branch 'main' into numba-overloads

6f103ab

lithomas1 added 2 commits June 14, 2023 13:55

Merge branch 'main' into numba-overloads

6d75ce4

Skip min/max benchmarks

b0d22db

lithomas1 marked this pull request as ready for review June 14, 2023 23:00

lithomas1 commented Jun 14, 2023

View reviewed changes

lithomas1 merged commit 870a504 into pandas-dev:main Jun 15, 2023

lithomas1 deleted the numba-overloads branch June 15, 2023 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow numba aggregations to return non-float64 results #53444

ENH: Allow numba aggregations to return non-float64 results #53444

lithomas1 commented May 29, 2023 •

edited

Loading

mroeschke Jun 7, 2023

lithomas1 Jun 9, 2023

rhshadrach left a comment

rhshadrach Jun 8, 2023

lithomas1 Jun 9, 2023

rhshadrach Jun 8, 2023

lithomas1 Jun 9, 2023

rhshadrach left a comment

mroeschke commented Jun 12, 2023

lithomas1 commented Jun 13, 2023 •

edited

Loading

lithomas1 Jun 14, 2023

		return column_looper


		default_dtype_mapping: dict[np.dtype, Any] = {

ENH: Allow numba aggregations to return non-float64 results #53444

ENH: Allow numba aggregations to return non-float64 results #53444

Conversation

lithomas1 commented May 29, 2023 • edited Loading

mroeschke Jun 7, 2023

Choose a reason for hiding this comment

lithomas1 Jun 9, 2023

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Jun 8, 2023

Choose a reason for hiding this comment

lithomas1 Jun 9, 2023

Choose a reason for hiding this comment

rhshadrach Jun 8, 2023

Choose a reason for hiding this comment

lithomas1 Jun 9, 2023

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

mroeschke commented Jun 12, 2023

lithomas1 commented Jun 13, 2023 • edited Loading

lithomas1 Jun 14, 2023

Choose a reason for hiding this comment

lithomas1 commented May 29, 2023 •

edited

Loading

lithomas1 commented Jun 13, 2023 •

edited

Loading