-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna #35078
BUG: DataFrameGroupBy.__getitem__ fails to propagate dropna #35078
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add test(s)?
Ready to go modulo any comments |
@TomAugspurger what do you think? This aims to resolve the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to make the simplest change possible here - so it means make it at a lower level
pandas/core/groupby/generic.py
Outdated
@@ -548,8 +548,10 @@ def _transform_general( | |||
# we will only try to coerce the result type if | |||
# we have a numeric dtype, as these are *always* user-defined funcs | |||
# the cython take a different path (and casting) | |||
# make sure we don't accidentally upcast (GH35014) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without this change equivalent results for SeriesGroupBy
andDataFrameGroupBy
are cast differently
In [2]: df = pd.DataFrame({"A": [0, 0, 1, None], "B": [1, 2, 3, None]})
In [3]: gb = df.groupby("A", dropna=False)
In [4]: gb[['B']].transform(len)
Out[4]:
B
0 2
1 2
2 1
3 1
In[5]: gb['B'].transform(len)
Out[5]:
0 2.0
1 2.0
2 1.0
3 1.0
Name: B, dtype: float64
I tracked this down to SeriesGroupBy._selected_obj
which for some reason upcasts:
In [9]: gb['B']._selected_obj
Out[9]:
0 1.0
1 2.0
2 3.0
3 NaN
Name: B, dtype: float64
pandas/core/groupby/groupby.py
Outdated
@@ -624,7 +625,10 @@ def _get_index(self, name): | |||
""" | |||
Safe get index, translate keys for datelike to underlying repr. | |||
""" | |||
return self._get_indices([name])[0] | |||
if isna(name): | |||
return self._get_indices([pd.NaT])[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would want _get_indices to handle a null rather than this way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! moved this
I think #35444 is a more general soln here. |
@jreback: Unfortunately my PR is not sufficient here. The root issue lies with the use of a dictionary for I think this isn't an issue with propagating |
you are making a lot of changes here, pls try to simplify. |
@jreback Ok! I redid the solution by copying the logic in |
516d474
to
fa2d90a
Compare
fa2d90a
to
9b536dd
Compare
…ropna-doesnt-propagate
thanks @arw2019 |
thanks @jreback for reviewing |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff