-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add dropna in groupby to allow NaN in keys #30584
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show some examples of using this (in the top of the PR). (ultimately these would need to become doc-examples).
doc/source/whatsnew/v1.0.0.rst
Outdated
@@ -549,6 +549,7 @@ Other API changes | |||
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`) | |||
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`) | |||
- :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`). | |||
- :meth:`DataFrame.groupby` and :meth:`Series.groupby` have gained ``dropna`` argument in order to allow ``NaN`` values in group keys (:issue:`3729`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would need a subsection, this is a major new feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And i also updated on top in description.
pandas/core/algorithms.py
Outdated
sort: bool = False, | ||
na_sentinel: int = -1, | ||
size_hint: Optional[int] = None, | ||
dropna: Optional[bool] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be just bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to bool
pandas/core/algorithms.py
Outdated
@@ -630,6 +634,9 @@ def factorize( | |||
uniques, codes = safe_sort( | |||
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False | |||
) | |||
if dropna is False and (codes == na_sentinel).any(): | |||
uniques = np.append(uniques, [np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we push this down to cython, but ok here
pandas/core/algorithms.py
Outdated
@@ -630,6 +634,9 @@ def factorize( | |||
uniques, codes = safe_sort( | |||
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False | |||
) | |||
if dropna is False and (codes == na_sentinel).any(): | |||
uniques = np.append(uniques, [np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this should be a dtype appropriate for the dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think dtype
is defined below in _construct_data
? also, if added, for categorical values, will get ValueError: Categorial categories cannot be null
error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm, this has to be a None
or NA
value in there. otherwise, output of uniques does not have na value, and also, my python will crash with FatalError
somehow
pandas/tests/groupby/test_groupby.py
Outdated
@@ -2025,3 +2025,146 @@ def test_groupby_crash_on_nunique(axis): | |||
expected = expected.T | |||
|
|||
tm.assert_frame_equal(result, expected) | |||
|
|||
|
|||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a new file, test_groupby_dropna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, moved to test_groupby_dropna
pandas/tests/groupby/test_groupby.py
Outdated
), | ||
], | ||
) | ||
def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example that uses NaT (datetime & timedelta)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added! see in test_groupby_dropna
pandas/tests/groupby/test_groupby.py
Outdated
["A", None, 12.3, 233.0, 12], | ||
["B", "A", 123.23, 123, 1], | ||
["A", "B", 1, 1, 1.0], | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we have NaN in 2 groups? does this work (e.g. different 1st level, but NaN) in 2nd. Also how is nan in first level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a lot more different scenarios!
doc/source/whatsnew/v1.0.0.rst
Outdated
We've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to | ||
allow ``NaN`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include | ||
``NaN`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards | ||
compatibility (:issue:`3729`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an example (like you have in the doc-strings)
also add examples in groupby.rst (and point from here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in both!
pandas/core/algorithms.py
Outdated
@@ -630,6 +634,9 @@ def factorize( | |||
uniques, codes = safe_sort( | |||
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False | |||
) | |||
if dropna is False and (codes == na_sentinel).any(): | |||
uniques = np.append(uniques, [np.nan]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness
pandas/core/frame.py
Outdated
@@ -5648,6 +5648,41 @@ def update( | |||
Type | |||
Captive 210.0 | |||
Wild 185.0 | |||
|
|||
We can also choose to include NaN in group keys or not by defining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defining -> setting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrased
pandas/core/generic.py
Outdated
@@ -7346,6 +7346,12 @@ def clip( | |||
If False: show all values for categorical groupers. | |||
|
|||
.. versionadded:: 0.23.0 | |||
dropna : bool, default True | |||
If True, and if group keys contain NaN values, NaN values together |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't say NaN, say NA values (e.g. can also be NaT or the new NA scalar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed!
pandas/core/algorithms.py
Outdated
values, | ||
sort: bool = False, | ||
na_sentinel: int = -1, | ||
size_hint: Optional[int] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some independent tests for factorize with dropna
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emm, I think i added two test cases already in test_algos
. Is it what you mean? or you want to have different test cases?
ping @jreback while waiting for other reviews p.s. somehow, there are less-than-usual CI checks after a rebase (i expect at least 13 or so, but see only 6, seems a lot of checks on other OS or python version are not triggered), not sure how/why, and pls let me know if another rebase is needed to see if the change is green on all. |
hmm, try rebasing again, this happened once before then resolved itself. |
emm, weird, indeed seems it resolves itself. @jreback |
ping for reviews ^^ |
Don't wait around on me.
…On Tue, Apr 28, 2020 at 5:21 AM Kaiqi Dong ***@***.***> wrote:
ping for reviews ^^
@jreback <https://github.com/jreback> @jbrockmendel
<https://github.com/jbrockmendel> @TomAugspurger
<https://github.com/TomAugspurger> @jorisvandenbossche
<https://github.com/jorisvandenbossche>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30584 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIQFPAQ3UJ52HM43DWLRO2U4PANCNFSM4KBWP3MQ>
.
|
maybe @WillAyd @jbrockmendel could take a look for some feedbacks? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a nice change. @jreback
@charlesdong1991 can you merge master and ping on green. |
thanks @charlesdong1991 very nice! this has been a long time requested feature! |
stupid question: how does that feature propagate into the standard pandas branches? any timeframe? |
My understanding is this is slated for inclusion in pandas 1.1.0, but I'm not sure when that is planned for. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Note that this PR will NOT fix the issue for
pivot_table
for now, the reason is that there is already an argument calleddropna
inpivot_table
and it has slightly different meaning, currently it means:Do not include columns whose entries are all NaN
.I would propose a change in the follow-up PR for this since this is an api change: change the name of current
dropna
todrop_all_na
maybe? and then adddropna
to it and it is aligned with thedropna
in groupby.Summary:
After this PR, it will optional to inlcude NaN in group keys, e.g. below, and i also add example in docstring as well:
will get
with
dropna=False
,For Series, it is the same: