Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix some cases of groupby(...).transform with dropna=True #45953

Merged
merged 19 commits into from
Feb 27, 2022

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Feb 12, 2022

Part of #17093

Plan to expand the whatsnew note as more parts of #45839 are merged.

Ran groupby ASVs; groupby.Nth.time_frame_nth showed a performance decrease, then increase, then twice not significantly changed. No other ASVs had significant changes.

@rhshadrach rhshadrach added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Apply Apply, Aggregate, Transform, Map labels Feb 12, 2022
…ndas into transform_dropna_1

� Conflicts:
�	pandas/core/groupby/groupby.py
�	pandas/core/groupby/ops.py
@rhshadrach rhshadrach marked this pull request as draft February 13, 2022 16:28
@rhshadrach rhshadrach marked this pull request as ready for review February 13, 2022 18:04
Get the original integer locations of result_index in the input.
"""
# Original indices are where group_index would go via sorting.
# But when dropna is true, we need to remove null values whilst accounting for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"whilst" may be tough for non-native speakers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - changed to while

@@ -193,10 +191,6 @@ class TestPrinting(base.BasePrintingTests):

class TestGroupBy(base.BaseGroupbyTests):
def test_groupby_extension_transform(self, data_for_grouping, request):
if data_for_grouping.dtype.storage == "pyarrow" and pa_version_under2p0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh surprised that this solves it. neat.

@@ -57,6 +57,39 @@ Styler

- Fixed bug in :class:`CSSToExcelConverter` leading to ``TypeError`` when border color provided without border style for ``xlsxwriter`` engine (:issue:`42276`)

.. _whatsnew_150.notable_bug_fixes.groupby_transform_dropna:

Using ``dropna=True`` with ``groupby`` transformers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use "transformer" elsewhere? Could use "transform"/"transforms"? Is there a risk of confusion for ML people for whom "transformer" means something else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do mention transformer once in the docs, but agreed with sticking to transform here.


A transformer is an operation whose result has the same size as its input. When the
result is a :class:`DataFrame` or :class:`Series`, it is also required that the
index of the result is equal to that of the input. In pandas 1.4, using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is equal" -> "matches"?

I think there was a recent issue with something like df.groupby(foo).transform(DataFrame.sort_index) where there was some question as to whether the index should be retained. Did that get sorted out? (Or is it not relevant here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't yet get sorted out. In fact, I went to improve the docs because of that issue, and found that dropna was not well supported for transforms. Wanted to first improve this before returning to the docs.

I agree matches is better than equal here - and plan to clarify that groupby(...).transform will align when possible.

@rhshadrach
Copy link
Member Author

Thanks @jbrockmendel - changes made and green (so far, at least)

@jreback jreback added this to the 1.5 milestone Feb 27, 2022
@jreback jreback merged commit 1efa4fb into pandas-dev:main Feb 27, 2022
@jreback
Copy link
Contributor

jreback commented Feb 27, 2022

thanks @rhshadrach very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants