Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: User-control of result keys in GroupBy.apply #34998

Merged
merged 110 commits into from
Mar 30, 2022

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jun 25, 2020

Currently we determine whether to include the group keys as a level of the result's MultiIndex by looking at whether the result's index matches the original DataFrame. This provides a keyword to control that behavior so that users can consistently get the group keys or not, regardless of whether the udf happens to be a transform or not. The default behavior remains the same.

I called the keyword result_group_keys, but would welcome alternatives.

Closes #34809
Closes #31612
Closes #14927
Closes #13056
Closes #27212
Closes #9704

cc @WillAyd and @jorisvandenbossche from the issue.

@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 25, 2020
@WillAyd
Copy link
Member

WillAyd commented Jun 25, 2020

Is it possible to leverage the existing as_index keyword instead of introducing a new one? I've never been quite clear on what the former should do, so maybe it should cover at least part of this

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Jun 25, 2020

adding more keywords is a huge -1 here. Ok if you deprecate and remove one or more, but adding more to an already really complex operation is sure to cause way more bugs.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 26, 2020

The documentation of group_keys isn't clear. Is that adding the groups to the inputs before the apply or the outputs after the apply? If it's intended for the outputs I can resuse it, but currently it seemingly has no effect for this usecase:

In [10]: def f(x):
    ...:     return x.copy()  # same index

In [11]: def g(x):
    ...:     return x.copy().rename(lambda x: x + 1)  # different index

In [12]: df = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})


In [4]: df.groupby("A", group_keys=True).apply(f)
Out[4]:
   A  B
0  a  1
1  b  2

In [5]: df.groupby("A", group_keys=True).apply(g)
Out[5]:
     A  B
A
a 1  a  1
b 2  b  2

In [6]: df.groupby("A", group_keys=False).apply(f)
Out[6]:
   A  B
0  a  1
1  b  2

In [7]: df.groupby("A", group_keys=False).apply(g)
Out[7]:
   A  B
1  a  1
2  b  2

@TomAugspurger
Copy link
Contributor Author

group_keys does kick in here, when the applied function returns a subset of the dataframe..

In [8]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
   ...: df.groupby("key", group_keys=True).apply(lambda x: x[:2])
   ...:
   ...:
   ...:
Out[8]:
       key  value
key
1   0    1      0
    1    1      1
2   3    2      3
    4    2      4
3   6    3      6
    7    3      7

In [9]: df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})
   ...: df.groupby("key", group_keys=False).apply(lambda x: x[:2])
   ...:
   ...:
   ...:
   ...:
Out[9]:
   key  value
0    1      0
1    1      1
3    2      3
4    2      4
6    3      6
7    3      7

So it seems the intent is similar to what we want. However I don't know about the defaults. group_keys defaults to true. We'll need to see whether that's appropriate for this case.

@jreback
Copy link
Contributor

jreback commented Jun 26, 2020

might be ok to add result_type keyword similar to what we use in apply but must depreciate group_keys then

@TomAugspurger
Copy link
Contributor Author

Well, 50 failures in tests/groupby when I just treat group_keys like you're suggesting (adding the group keys to the output). I'll see what I can do.

@TomAugspurger
Copy link
Contributor Author

It does seem like the intent of group_keys is to control this sort of behavior. We just didn't apply it consistently for a few reasons.

  1. It just wasn't used in _GroupBy._python_apply_general.
  2. It's dropped when slicing a DataFrameGroupBy object
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: gb = df.groupby("A", group_keys=False)

In [4]: gb.group_keys
Out[4]: False

In [5]: gb[['B']].group_keys
Out[5]: True

In [6]: gb['B'].group_keys
Out[6]: True

I'll push a commit later with some updates to always honor group keys, but my initial reaction is that this is too large of a change. I'm sure there are people relying on the buggy behavior, so we'll need to do this with a deprecation.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 29, 2020

OK, tests should hopefully pass, but this isn't mergeable yet because of the behavior change. Namely honoring group_keys changes the output shape of "transform-like" .apply:

# 1.0.4
In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
   A  B
0  0  1
1  0  2
2  1  3
# PR
In [3]: df.groupby("A").apply(lambda x: x)
Out[3]:
     A  B
A
0 0  0  1
  1  0  2
1 2  1  3

My next commit will have compatibility layer. We'll change the default group_keys to None and warn if we get to a case where group_keys=None and we've detected a transform-like apply.

Note that this doesn't help people who were already doing .groupby(..., group_keys=True).apply(lambda x: x), but I think that's unavoidable. I think we just call that buggy behavior that's been fixed.

@TomAugspurger
Copy link
Contributor Author

Here's the warning

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({"A": [0, 0, 1], "B": [1, 2, 3]})

In [3]: df.groupby("A").apply(lambda x: x)
/Users/taugspurger/.virtualenvs/pandas-dev/bin/ipython:1: FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.
To preserve the previous behavior, use

        >>> .groupby(..., group_keys=False)

To adopt the future behavior and silence this warning, use

        >>> .groupby(..., group_keys=True)
  #!/Users/taugspurger/Envs/pandas-dev/bin/python
Out[3]:
   A  B
0  0  1
1  0  2
2  1  3

@WillAyd
Copy link
Member

WillAyd commented Jun 29, 2020

On board with the direction here; I was initially thinking through if the default should be True or False going forward but I think the way you have it is nice if can get green

@jreback
Copy link
Contributor

jreback commented Feb 20, 2022

will look again soon

@rhshadrach
Copy link
Member

@jreback ping

@rhshadrach
Copy link
Member

@jreback - ping

@jreback jreback added this to the 1.5 milestone Mar 30, 2022
@jreback jreback merged commit efb262f into pandas-dev:main Mar 30, 2022
@jreback
Copy link
Contributor

jreback commented Mar 30, 2022

thanks @rhshadrach and @TomAugspurger nice to have the consistency!

and of course let the bug reports roll in......

@TomAugspurger
Copy link
Contributor Author

Really glad to see this in. Great work @rhshadrach!

@TomAugspurger TomAugspurger deleted the 34809-result-type branch March 30, 2022 15:45
@rhshadrach rhshadrach mentioned this pull request Jul 12, 2022
9 tasks
yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022
@jbrockmendel
Copy link
Member

@TomAugspurger this is causing warnings in test_multiindex_group_all_columns_when_empty, can you update the affected tests to avoid/catch these warnings?

@rhshadrach
Copy link
Member

Thanks for finding these @jbrockmendel - this will be addressed in #48159

rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Sep 8, 2022
- [x] This PR adds support for `group_keys` in `groupby`. Starting pandas 1.5.0, issues around `group_keys` have been resolved:

pandas-dev/pandas#34998
pandas-dev/pandas#47185

- [x] This PR defaults `group_keys` to `False` which is the same as what pandas is going to be defaulting to in the future version.
- [x] Required to unblock `pandas-1.5.0` upgrade in cudf: #11617

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Ashwin Srinath (https://github.com/shwina)

URL: #11659
@jbrockmendel
Copy link
Member

@TomAugspurger want to make a PR enforcing this deprecation?

@rhshadrach
Copy link
Member

@TomAugspurger - more than happy if you want to take this, otherwise I can take it.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment