Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged
merged 59 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
13b03a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Dec 31, 2019
98f6127
fix issue 3729
charlesdong1991 Dec 31, 2019
d5fd74c
fix conflicts
charlesdong1991 Dec 31, 2019
eb717ec
not check type
charlesdong1991 Dec 31, 2019
de2ee5d
Add groupby test for Series
charlesdong1991 Dec 31, 2019
def05cc
Add whatsnew note
charlesdong1991 Dec 31, 2019
2888807
Code change based on JR review
charlesdong1991 Jan 1, 2020
b357659
fix conflicts
charlesdong1991 Jan 1, 2020
dc4fef1
add forgotten commits
charlesdong1991 Jan 1, 2020
25482ec
add forgotten commit
charlesdong1991 Jan 1, 2020
015336d
Add dropna for series
charlesdong1991 Jan 1, 2020
ac2a79f
add doc example for Series
charlesdong1991 Jan 1, 2020
eb9a6f7
Add level example for series groupby
charlesdong1991 Jan 1, 2020
ffb70f8
Add doc example for frame groupby
charlesdong1991 Jan 1, 2020
b0e3cce
Code change based on JR reviews
charlesdong1991 Jan 2, 2020
a1d5510
add doc
charlesdong1991 Jan 2, 2020
11ef56a
move doc
charlesdong1991 Jan 2, 2020
b247a8b
NaN to NA
charlesdong1991 Jan 2, 2020
7cb027c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Jan 2, 2020
d730c4a
Fix linting
charlesdong1991 Jan 2, 2020
42c4934
fix rst issue
charlesdong1991 Jan 2, 2020
2ba79b9
fix rst issue
charlesdong1991 Jan 2, 2020
8b79b6c
refactor based on WA review
charlesdong1991 Jan 3, 2020
a4fdf2d
merge master and resolve conflicts
charlesdong1991 Feb 10, 2020
4ac15e3
remove blank
charlesdong1991 Feb 10, 2020
4ebbad3
code change on reviews
charlesdong1991 Feb 10, 2020
f141b80
use pd.testing
charlesdong1991 Feb 10, 2020
23ad19b
linting
charlesdong1991 Feb 10, 2020
bafc4a5
fixup
charlesdong1991 Feb 10, 2020
c98bafe
fixup
charlesdong1991 Feb 10, 2020
86a5958
doc
charlesdong1991 Feb 10, 2020
6cf31d7
validation
charlesdong1991 Feb 10, 2020
2b77f37
xfail windows
charlesdong1991 Feb 10, 2020
451ec97
rebase and resolve conflict
charlesdong1991 Feb 19, 2020
1089b18
fixup based on WA review
charlesdong1991 Feb 22, 2020
63da563
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Feb 22, 2020
1b3f22a
fix conflicts
charlesdong1991 Apr 7, 2020
3f360a9
reduce tests
charlesdong1991 Apr 7, 2020
5cabe4b
fix pep8
charlesdong1991 Apr 7, 2020
76ffb9f
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 11, 2020
6c126c7
rebase and docs fixes
charlesdong1991 Apr 11, 2020
6d61d6a
fixup doc
charlesdong1991 Apr 11, 2020
3630e8b
remove inferred type
charlesdong1991 Apr 11, 2020
1cec7f1
better comment
charlesdong1991 Apr 11, 2020
1a1bb49
remove xfail
charlesdong1991 Apr 11, 2020
7ea2e79
use fixture
charlesdong1991 Apr 11, 2020
13b1e9a
coelse type for windows build
charlesdong1991 Apr 11, 2020
92a7eed
fixup
charlesdong1991 Apr 11, 2020
1315a9d
fixup
charlesdong1991 Apr 11, 2020
a7959d5
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 15, 2020
9fec9a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
ffbae76
Doc fixup
charlesdong1991 Apr 27, 2020
ef90d7c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
e219748
rebase and resolve conflict
charlesdong1991 Apr 27, 2020
2940908
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 May 7, 2020
4ea6aa0
try merge master again
charlesdong1991 May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,16 @@ method on a :func:`pandas.api.indexers.BaseIndexer` subclass that will generate
indices used for each window during the rolling aggregation. For more details and example usage, see
the :ref:`custom window rolling documentation <stats.custom_rolling_window>`

.. _whatsnew_1000.groupby_key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 1.1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved


Allow NaN in groupby key
^^^^^^^^^^^^^^^^^^^^^^^^

We've added a ``dropna`` keyword to :meth:`DataFrame.groupby` and :meth:`Series.groupby` in order to
allow ``NaN`` values in group keys. Users can define ``dropna`` to ``False`` if they want to include
``NaN`` values in groupby keys. The default is set to ``True`` for ``dropna`` to keep backwards
compatibility (:issue:`3729`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an example (like you have in the doc-strings)

also add examples in groupby.rst (and point from here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in both!


.. _whatsnew_1000.enhancements.other:

Other enhancements
Expand Down
9 changes: 8 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -596,7 +596,11 @@ def _factorize_array(
)
@Appender(_shared_docs["factorize"])
def factorize(
values, sort: bool = False, na_sentinel: int = -1, size_hint: Optional[int] = None
values,
sort: bool = False,
na_sentinel: int = -1,
size_hint: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some independent tests for factorize with dropna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, I think i added two test cases already in test_algos. Is it what you mean? or you want to have different test cases?

dropna: bool = True,
) -> Tuple[np.ndarray, Union[np.ndarray, ABCIndex]]:
# Implementation notes: This method is responsible for 3 things
# 1.) coercing data to array-like (ndarray, Index, extension array)
Expand Down Expand Up @@ -630,6 +634,9 @@ def factorize(
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if not dropna and (codes == na_sentinel).any():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you assign codes == na_sentinel to a mask variable prior to this? Slightly preferable to re-evaluating this in the branch

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks!

uniques = np.append(uniques, [np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we push this down to cython, but ok here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this should be a dtype appropriate for the dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think dtype is defined below in _construct_data? also, if added, for categorical values, will get ValueError: Categorial categories cannot be null error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, this has to be a None or NA value in there. otherwise, output of uniques does not have na value, and also, my python will crash with FatalError somehow

codes = np.where(codes == na_sentinel, len(uniques) - 1, codes)

uniques = _reconstruct_data(uniques, dtype, original)

Expand Down
39 changes: 38 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -1988,7 +1988,7 @@ def to_feather(self, path):
@Substitution(klass="DataFrame")
@Appender(_shared_docs["to_markdown"])
def to_markdown(
self, buf: Optional[IO[str]] = None, mode: Optional[str] = None, **kwargs,
self, buf: Optional[IO[str]] = None, mode: Optional[str] = None, **kwargs
) -> Optional[str]:
kwargs.setdefault("headers", "keys")
kwargs.setdefault("tablefmt", "pipe")
Expand Down Expand Up @@ -5648,6 +5648,41 @@ def update(
Type
Captive 210.0
Wild 185.0

We can also choose to include NaN in group keys or not by defining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining -> setting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rephrased

`dropna` parameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention the default


>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5

>>> df.groupby(by=["b"], dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4

>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])

>>> df.groupby(by="a").sum()
b c
a
a 13.0 13.0
b 12.3 123.0

>>> df.groupby(by="a", dropna=False).sum()
b c
a
a 13.0 13.0
b 12.3 123.0
NaN 12.3 33.0
"""
)
@Appender(_shared_docs["groupby"] % _shared_doc_kwargs)
Expand All @@ -5661,6 +5696,7 @@ def groupby(
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
dropna: bool = True,
) -> "groupby_generic.DataFrameGroupBy":

if level is None and by is None:
Expand All @@ -5677,6 +5713,7 @@ def groupby(
group_keys=group_keys,
squeeze=squeeze,
observed=observed,
dropna=dropna,
)

_shared_docs[
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7346,6 +7346,12 @@ def clip(
If False: show all values for categorical groupers.

.. versionadded:: 0.23.0
dropna : bool, default True
If True, and if group keys contain NaN values, NaN values together
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't say NaN, say NA values (e.g. can also be NaT or the new NA scalar)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed!

with row/column will be dropped.
If False, NaN values will also be treated as the key in groups

.. versionadded:: 1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 1.0.0
.. versionadded:: 1.1.0


Returns
-------
Expand Down
6 changes: 6 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,7 @@ def __init__(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: bool = True,
):

self._selection = selection
Expand All @@ -396,6 +397,8 @@ def __init__(
self.observed = observed
self.mutated = mutated

self.dropna = dropna
jreback marked this conversation as resolved.
Show resolved Hide resolved

if grouper is None:
from pandas.core.groupby.grouper import get_grouper

Expand All @@ -407,6 +410,7 @@ def __init__(
sort=sort,
observed=observed,
mutated=self.mutated,
dropna=self.dropna,
)

self.obj = obj
Expand Down Expand Up @@ -2543,6 +2547,7 @@ def get_groupby(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: bool = True,
) -> GroupBy:

klass: Type[GroupBy]
Expand Down Expand Up @@ -2571,4 +2576,5 @@ def get_groupby(
squeeze=squeeze,
observed=observed,
mutated=mutated,
dropna=dropna,
)
15 changes: 13 additions & 2 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,9 @@ def __new__(cls, *args, **kwargs):
cls = TimeGrouper
return super().__new__(cls)

def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
def __init__(
self, key=None, level=None, freq=None, axis=0, sort=False, dropna=True
jreback marked this conversation as resolved.
Show resolved Hide resolved
):
self.key = key
self.level = level
self.freq = freq
Expand All @@ -112,6 +114,7 @@ def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
self.indexer = None
self.binner = None
self._grouper = None
self.dropna = dropna

@property
def ax(self):
Expand All @@ -138,6 +141,7 @@ def _get_grouper(self, obj, validate: bool = True):
level=self.level,
sort=self.sort,
validate=validate,
dropna=self.dropna,
)
return self.binner, self.grouper, self.obj

Expand Down Expand Up @@ -250,6 +254,7 @@ def __init__(
sort: bool = True,
observed: bool = False,
in_axis: bool = False,
dropna: bool = True,
):
self.name = name
self.level = level
Expand All @@ -261,6 +266,8 @@ def __init__(
self.observed = observed
self.in_axis = in_axis

self.dropna = dropna
jreback marked this conversation as resolved.
Show resolved Hide resolved

# right place for this?
if isinstance(grouper, (Series, Index)) and name is None:
self.name = grouper.name
Expand Down Expand Up @@ -413,7 +420,9 @@ def _make_codes(self) -> None:
codes = self.grouper.codes_info
uniques = self.grouper.result_index
else:
codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)
codes, uniques = algorithms.factorize(
self.grouper, sort=self.sort, dropna=self.dropna
)
uniques = Index(uniques, name=self.name)
self._codes = codes
self._group_index = uniques
Expand All @@ -432,6 +441,7 @@ def get_grouper(
observed: bool = False,
mutated: bool = False,
validate: bool = True,
dropna: bool = True,
) -> "Tuple[ops.BaseGrouper, List[Hashable], FrameOrSeries]":
"""
Create and return a BaseGrouper, which is an internal
Expand Down Expand Up @@ -621,6 +631,7 @@ def is_in_obj(gpr) -> bool:
sort=sort,
observed=observed,
in_axis=in_axis,
dropna=dropna,
)
if not isinstance(gpr, Grouping)
else gpr
Expand Down
36 changes: 32 additions & 4 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1433,7 +1433,7 @@ def to_string(
@Substitution(klass="Series")
@Appender(generic._shared_docs["to_markdown"])
def to_markdown(
self, buf: Optional[IO[str]] = None, mode: Optional[str] = None, **kwargs,
self, buf: Optional[IO[str]] = None, mode: Optional[str] = None, **kwargs
) -> Optional[str]:
return self.to_frame().to_markdown(buf, mode, **kwargs)

Expand Down Expand Up @@ -1620,6 +1620,34 @@ def _set_name(self, name, inplace=False):
Captive 210.0
Wild 185.0
Name: Max Speed, dtype: float64

We can also choose to include NaN in group keys or not by defining
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention the default value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned

`dropna` parameter:

>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])
>>> ser.groupby(level=0).sum()
a 3
b 3
dtype: int64

>>> ser.groupby(level=0, dropna=False).sum()
a 3
b 3
NaN 3
dtype: int64

>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot']
>>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed")
>>> ser.groupby(["a", "b", "a", np.nan]).mean()
a 210.0
b 350.0
Name: Max Speed, dtype: float64

>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean()
a 210.0
b 350.0
NaN 20.0
Name: Max Speed, dtype: float64
"""
)
@Appender(generic._shared_docs["groupby"] % _shared_doc_kwargs)
Expand All @@ -1633,6 +1661,7 @@ def groupby(
group_keys: bool = True,
squeeze: bool = False,
observed: bool = False,
dropna: bool = True,
) -> "groupby_generic.SeriesGroupBy":

if level is None and by is None:
Expand All @@ -1649,6 +1678,7 @@ def groupby(
group_keys=group_keys,
squeeze=squeeze,
observed=observed,
dropna=dropna,
)

# ----------------------------------------------------------------------
Expand Down Expand Up @@ -4478,9 +4508,7 @@ def to_period(self, freq=None, copy=True):
hist = pandas.plotting.hist_series


Series._setup_axes(
["index"], docs={"index": "The index (axis labels) of the Series."},
)
Series._setup_axes(["index"], docs={"index": "The index (axis labels) of the Series."})
Series._add_numeric_operations()
Series._add_series_or_dataframe_operations()

Expand Down
Loading