Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna #35078

Merged

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented Jul 1, 2020

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add test(s)?

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added the Groupby label Jul 1, 2020
@arw2019
Copy link
Member Author

arw2019 commented Jul 7, 2020

A note - in addition to the missing values handling, his PR also fixes a version of #14466 in the SeriesGroupBy version of transform

I solved it similarly to #12559

@pep8speaks
Copy link

pep8speaks commented Jul 9, 2020

Hello @arw2019! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-07 19:26:30 UTC

@arw2019 arw2019 requested a review from WillAyd July 9, 2020 19:42
@arw2019
Copy link
Member Author

arw2019 commented Jul 9, 2020

Ready to go modulo any comments

@arw2019
Copy link
Member Author

arw2019 commented Jul 16, 2020

@TomAugspurger what do you think? This aims to resolve the SeriesGroupBy.__getitem__ bug you pointed out a few weeks ago

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to make the simplest change possible here - so it means make it at a lower level

@@ -548,8 +548,10 @@ def _transform_general(
# we will only try to coerce the result type if
# we have a numeric dtype, as these are *always* user-defined funcs
# the cython take a different path (and casting)
# make sure we don't accidentally upcast (GH35014)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this related?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this change equivalent results for SeriesGroupBy andDataFrameGroupBy are cast differently

In [2]: df = pd.DataFrame({"A": [0, 0, 1, None], "B": [1, 2, 3, None]})
In [3]: gb = df.groupby("A", dropna=False)
In [4]: gb[['B']].transform(len)
Out[4]: 
   B
0  2
1  2
2  1
3  1
In[5]: gb['B'].transform(len)
Out[5]: 
0    2.0
1    2.0
2    1.0
3    1.0
Name: B, dtype: float64

I tracked this down to SeriesGroupBy._selected_obj which for some reason upcasts:

In [9]: gb['B']._selected_obj                                                                                         
Out[9]: 
0    1.0
1    2.0
2    3.0
3    NaN
Name: B, dtype: float64

@@ -624,7 +625,10 @@ def _get_index(self, name):
"""
Safe get index, translate keys for datelike to underlying repr.
"""
return self._get_indices([name])[0]
if isna(name):
return self._get_indices([pd.NaT])[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would want _get_indices to handle a null rather than this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok! moved this

@jreback
Copy link
Contributor

jreback commented Aug 3, 2020

I think #35444 is a more general soln here.

@rhshadrach
Copy link
Member

@jreback: Unfortunately my PR is not sufficient here. The root issue lies with the use of a dictionary for self.indices within the _GroupBy class. Trying to make a key be None (or np.nan or pd.NaT) causes issues. AFAICT, dropna is never used once the GroupBy object is created.

I think this isn't an issue with propagating dropna, but rather with the SeriesGroupBy transform function itself. Somehow DataFrameGroupBy avoids these issues, although I need to spend more time looking through the code to understand how.

pandas/core/groupby/grouper.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Aug 5, 2020

you are making a lot of changes here, pls try to simplify.

@arw2019
Copy link
Member Author

arw2019 commented Aug 5, 2020

you are making a lot of changes here, pls try to simplify.

@jreback Ok!

I redid the solution by copying the logic in DataFrameGroupBy._transform_general which is very similar & avoids this problem (thanks @rhshadrach for the suggestion)

@arw2019 arw2019 force-pushed the groupby-getitem-dropna-doesnt-propagate branch 2 times, most recently from 516d474 to fa2d90a Compare August 7, 2020 01:30
@arw2019 arw2019 force-pushed the groupby-getitem-dropna-doesnt-propagate branch from fa2d90a to 9b536dd Compare August 7, 2020 01:32
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/tests/groupby/test_groupby_dropna.py Outdated Show resolved Hide resolved
@jreback jreback added this to the 1.2 milestone Aug 7, 2020
@jreback jreback merged commit f194094 into pandas-dev:master Aug 7, 2020
@jreback
Copy link
Contributor

jreback commented Aug 7, 2020

thanks @arw2019

@arw2019
Copy link
Member Author

arw2019 commented Aug 7, 2020

thanks @jreback for reviewing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna
5 participants