-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent groupby.apply
behaviour depending on column dtypes
#43206
Comments
Thanks @od-crypto for the report.
That was the result on 1.2.5 On master the second case now has the same index as the other two cases, a regular index. not sure without investigating further which is correct or whether the changes were intentional. will label as a regression for now pending further investigation. |
for 1st case first bad commit: [a3bb751] REF: back DatetimeBlock, TimedeltaBlock by DTA/TDA (#40456) so the change was probably not intentional. @jbrockmendel is #42921 (comment) relevant here? |
Looks like this is driven by a difference in the cython vs python code paths, likely involving how |
Porting the mutation check from the now-deleted apply_frame_axis_0 into BaseGrouper.apply seems to fix this. need to run the rest of the tests |
Looks like that broke 7 tests. @od-crypto any interest in trying to track this down? |
OK, some progress here. within each group, each of the columns is already sorted, so the sort_values being applied doesn't do any reindexing, so it evaluates @od-crypto what does the [0, 1] index level in |
@jbrockmendel which behaviour to choose, df1/df3 or df2, is a second question, important is that it is consistent at the end.
|
@jbrockmendel answering your question above: the [0, 1] index level in df2 seems to correspond to the index of the 'uid' group, sorted by 'uid'. |
I agree. On master it is currently consistent, but with the result that you said in the OP seems strange. I'm trying to understand what the first level in the MultiIndex result corresponds to. Can you help me understand this? |
but to be clear, on released pandas it is not. so if we decide that the new behavior is correct and is a bugfix and not an api breaking change we may want to consider backporting #42992 |
@jbrockmendel, By "df1/df3 seem strange" I meant strange within the same pandas release 1.3.2, as when one feeds the above reported code with an unsorted input df, one obtains for all the three cases the df2 behaviour. The first level in the Multiindex of df2 seem to "correspond to the index of the 'uid' group, sorted by 'uid'." However if choosing between the df1/df3 and df2 for the next release, one might consider voting for the df1/df3, as backward compatibility is always a weighty argument... |
The first level in df2.index in the OP is |
changing milestone to 1.3.5 |
This is on master so it does appear that this invalidates the consistency argument for master. so we can probably rule out backporting #42992 #43206 (comment) (This actually fixed another regression #41999 but was not backported) so it looks like we effectively now have two regressions
|
I've added the blocker for rc label to increase priority if not fixed for 1.3.5 |
changing milestone |
we cannot block on this w/o a PR for the rc. |
@rhshadrach @jbrockmendel can you summarise the issue & what we need to decide here |
I agree with previous comments that in the OP, the result for The core issue here is A potential secondary issue is the resulting index containing the values 0 and 1 (ref: #43206 (comment)). This looked odd at first glance to me as well, but I believe it is correct behavior. With |
This is a hard problem, my only real opinion here is that reinstating libreduction.apply_frame_axis0 (which git blame suggests would fix this) would cause more problems than it solves. |
moving to 1.5 we cannot wait on this. |
@jbrockmendel Are you thinking #34998 wouldn't fix? Agreed on not reintroducing apply_frame_axis0. |
Haven't looked at it closely. It's plausible. |
Currently, running df1 = df.groupby('uid', as_index=False)[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df2 = df.groupby('uid', as_index=False)[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
df3 = df.groupby('uid', as_index=False)[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True)) produces FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object. and results to: >>> print(df1)
0 1 2017-01-01 00:00:00 2017-01-01
1 2 2017-01-01 00:00:00 2017-01-01
2 1 2017-02-01 00:00:00 2017-02-01
3 2 2017-02-01 00:00:00 2017-02-01
>>> print(df2)
uid str_val
0 1 2017-01-01 00:00:00
1 2 2017-01-01 00:00:00
2 1 2017-02-01 00:00:00
3 2 2017-02-01 00:00:00
>>> print(df3)
uid date_val
0 1 2017-01-01
1 2 2017-01-01
2 1 2017-02-01
3 2 2017-02-01 So now the behaviour is consistent which was the BUG. I think it should be closed now. |
However, the default behaviour will change in the future: >>> df1 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'str_val', 'date_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
>>> df2 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'str_val']].apply(lambda x: x.sort_values(by='str_val',ascending=True))
>>> df3 = df.groupby('uid', as_index=False, group_keys=True)[['uid', 'date_val']].apply(lambda x: x.sort_values(by='date_val',ascending=True))
nt(df3)
>>> print(df1)
uid str_val date_val
0 0 1 2017-01-01 00:00:00 2017-01-01
2 1 2017-02-01 00:00:00 2017-02-01
1 1 2 2017-01-01 00:00:00 2017-01-01
3 2 2017-02-01 00:00:00 2017-02-01
>>> print(df2)
uid str_val
0 0 1 2017-01-01 00:00:00
2 1 2017-02-01 00:00:00
1 1 2 2017-01-01 00:00:00
3 2 2017-02-01 00:00:00
>>> print(df3)
uid date_val
0 0 1 2017-01-01
2 1 2017-02-01
1 1 2 2017-01-01
3 2 2017-02-01 |
Please check @rhshadrach |
Thanks @NumberPiOso - agreed. This behavior can now be controlled by the user, is consistent, and will have the expected output in pandas 2.0. Closing. |
I see strange
groupby.apply
behaviour: resulting index depends on the existence of theTimestamp
dtype column. MWE:Output:
I expect the index to be like in the second output (
df2
). Indexes indf1
anddf3
seem strange.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : 5f648bf
python : 3.8.0.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
Version : Darwin Kernel Version 18.7.0: Mon Apr 27 20:09:39 PDT 2020; root:xnu-4903.278.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.3.2
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: