Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow numba aggregations to return non-float64 results #53444

Merged
merged 17 commits into from
Jun 15, 2023

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 commented May 29, 2023

@lithomas1 lithomas1 added Groupby Dtype Conversions Unexpected or buggy dtype conversions Window rolling, ewma, expanding numba numba-accelerated operations labels May 29, 2023
@lithomas1 lithomas1 requested review from rhshadrach and mroeschke May 30, 2023 14:13
@lithomas1 lithomas1 changed the title WIP: ENH: Allow numba aggregations to return non-float64 results ENH: Allow numba aggregations to return non-float64 results May 31, 2023
@lithomas1 lithomas1 marked this pull request as ready for review May 31, 2023 22:57
return column_looper


default_dtype_mapping: dict[np.dtype, Any] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, could we not just define signatures for numba.jit to use when running the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allocate arrays inside the function and need to pass a dtype there as well.

Not sure how to access the signature from inside the func.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; just some thoughts/suggestions

pandas/core/_numba/executor.py Outdated Show resolved Hide resolved
Comment on lines 1370 to 1381
result = aggregator(sorted_data, starts, ends, 0, *aggregator_args)
result = sorted_df._mgr.apply(
aggregator, start=starts, end=ends, **aggregator_kwargs
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing *arggregator_args -> **aggregator_kwargs, but then within aggregator it is being used as *aggregator_kwargs. This is only used internally right? I'm just wondering if we can make this less fragile somehow (changing of order kwargs might produce a bug, right?), but I'm not seeing a way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the current method is really sketchy but it should be OK, since UDFs take another path.
(only args/kwargs that go through here are stuff like ddof for std/var).

The reason it's like this is since BlockManager.apply only takes kwargs.
Is it fine to change that?

pandas/core/_numba/executor.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@@ -646,10 +646,27 @@ def _numba_apply(
step=self.step,
)
self._check_window_bounds(start, end, len(values))
# For now, map everything to float to match the Cython impl
# even though it is wrong
# TODO: Could preserve correct dtypes in future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There an issue for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#53214, I'll add it to the comment.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added this to the 2.1 milestone Jun 12, 2023
@mroeschke
Copy link
Member

Could you also add a whatsnew note for 2.1?

@lithomas1 lithomas1 marked this pull request as draft June 13, 2023 16:52
@lithomas1
Copy link
Member Author

lithomas1 commented Jun 13, 2023

There seems to be some flakiness with the benchmarks I added.
(Wasn't able to get an error message unfortunately).

I'll let this sit for a couple of days then, but other than that it should be good to go.

EDIT: Root caused, it was a timeout in the benchs.

@lithomas1 lithomas1 marked this pull request as ready for review June 14, 2023 23:00
# because it re-uses the Window min/max kernel
# so it will time out ASVs
# "min",
# "max",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabled min/max because it's reaaaallllly sloooooow.
It takes 20s (as opposed to milliseconds for the other kernels) to run, and can time out the ASVs sometimes(causing flakiness).

Best guess is that the list operations are slowing it down. Snakeviz tells me most (99% of the time) is spent in the numba kernel, and I can't profile into there.

I'm planning on splitting groupby stuff from the Window numba kernels in the future, so hopefully this doesn't stay commented for long.

@lithomas1 lithomas1 merged commit 870a504 into pandas-dev:main Jun 15, 2023
@lithomas1 lithomas1 deleted the numba-overloads branch June 15, 2023 02:08
canthonyscott pushed a commit to canthonyscott/pandas-anthony that referenced this pull request Jun 23, 2023
…ev#53444)

* ENH: non float64 result support in numba groupby

* refactor & simplify

* fix CI

* maybe green?

* skip unsupported ops in other bench as well

* updates from code review

* remove commented code

* update whatsnew

* debug benchmarks

* Skip min/max benchmarks
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…ev#53444)

* ENH: non float64 result support in numba groupby

* refactor & simplify

* fix CI

* maybe green?

* skip unsupported ops in other bench as well

* updates from code review

* remove commented code

* update whatsnew

* debug benchmarks

* Skip min/max benchmarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby numba numba-accelerated operations Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants