Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add dropna in groupby to allow NaN in keys #30584

Merged
merged 59 commits into from
May 9, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
13b03a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Dec 31, 2019
98f6127
fix issue 3729
charlesdong1991 Dec 31, 2019
d5fd74c
fix conflicts
charlesdong1991 Dec 31, 2019
eb717ec
not check type
charlesdong1991 Dec 31, 2019
de2ee5d
Add groupby test for Series
charlesdong1991 Dec 31, 2019
def05cc
Add whatsnew note
charlesdong1991 Dec 31, 2019
2888807
Code change based on JR review
charlesdong1991 Jan 1, 2020
b357659
fix conflicts
charlesdong1991 Jan 1, 2020
dc4fef1
add forgotten commits
charlesdong1991 Jan 1, 2020
25482ec
add forgotten commit
charlesdong1991 Jan 1, 2020
015336d
Add dropna for series
charlesdong1991 Jan 1, 2020
ac2a79f
add doc example for Series
charlesdong1991 Jan 1, 2020
eb9a6f7
Add level example for series groupby
charlesdong1991 Jan 1, 2020
ffb70f8
Add doc example for frame groupby
charlesdong1991 Jan 1, 2020
b0e3cce
Code change based on JR reviews
charlesdong1991 Jan 2, 2020
a1d5510
add doc
charlesdong1991 Jan 2, 2020
11ef56a
move doc
charlesdong1991 Jan 2, 2020
b247a8b
NaN to NA
charlesdong1991 Jan 2, 2020
7cb027c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Jan 2, 2020
d730c4a
Fix linting
charlesdong1991 Jan 2, 2020
42c4934
fix rst issue
charlesdong1991 Jan 2, 2020
2ba79b9
fix rst issue
charlesdong1991 Jan 2, 2020
8b79b6c
refactor based on WA review
charlesdong1991 Jan 3, 2020
a4fdf2d
merge master and resolve conflicts
charlesdong1991 Feb 10, 2020
4ac15e3
remove blank
charlesdong1991 Feb 10, 2020
4ebbad3
code change on reviews
charlesdong1991 Feb 10, 2020
f141b80
use pd.testing
charlesdong1991 Feb 10, 2020
23ad19b
linting
charlesdong1991 Feb 10, 2020
bafc4a5
fixup
charlesdong1991 Feb 10, 2020
c98bafe
fixup
charlesdong1991 Feb 10, 2020
86a5958
doc
charlesdong1991 Feb 10, 2020
6cf31d7
validation
charlesdong1991 Feb 10, 2020
2b77f37
xfail windows
charlesdong1991 Feb 10, 2020
451ec97
rebase and resolve conflict
charlesdong1991 Feb 19, 2020
1089b18
fixup based on WA review
charlesdong1991 Feb 22, 2020
63da563
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Feb 22, 2020
1b3f22a
fix conflicts
charlesdong1991 Apr 7, 2020
3f360a9
reduce tests
charlesdong1991 Apr 7, 2020
5cabe4b
fix pep8
charlesdong1991 Apr 7, 2020
76ffb9f
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 11, 2020
6c126c7
rebase and docs fixes
charlesdong1991 Apr 11, 2020
6d61d6a
fixup doc
charlesdong1991 Apr 11, 2020
3630e8b
remove inferred type
charlesdong1991 Apr 11, 2020
1cec7f1
better comment
charlesdong1991 Apr 11, 2020
1a1bb49
remove xfail
charlesdong1991 Apr 11, 2020
7ea2e79
use fixture
charlesdong1991 Apr 11, 2020
13b1e9a
coelse type for windows build
charlesdong1991 Apr 11, 2020
92a7eed
fixup
charlesdong1991 Apr 11, 2020
1315a9d
fixup
charlesdong1991 Apr 11, 2020
a7959d5
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 15, 2020
9fec9a8
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
ffbae76
Doc fixup
charlesdong1991 Apr 27, 2020
ef90d7c
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 Apr 27, 2020
e219748
rebase and resolve conflict
charlesdong1991 Apr 27, 2020
2940908
Merge remote-tracking branch 'upstream/master' into fix_issue_3729
charlesdong1991 May 7, 2020
4ea6aa0
try merge master again
charlesdong1991 May 7, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -549,6 +549,7 @@ Other API changes
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`)
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`)
- :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`).
- :meth:`DataFrame.groupby` and :meth:`Series.groupby` have gained ``dropna`` argument in order to allow ``NaN`` values in group keys (:issue:`3729`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would need a subsection, this is a major new feature

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And i also updated on top in description.



.. _whatsnew_1000.api.documentation:
Expand Down
9 changes: 8 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -596,7 +596,11 @@ def _factorize_array(
)
@Appender(_shared_docs["factorize"])
def factorize(
values, sort: bool = False, na_sentinel: int = -1, size_hint: Optional[int] = None
values,
sort: bool = False,
na_sentinel: int = -1,
size_hint: Optional[int] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some independent tests for factorize with dropna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, I think i added two test cases already in test_algos. Is it what you mean? or you want to have different test cases?

dropna: Optional[bool] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be just bool?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to bool

) -> Tuple[np.ndarray, Union[np.ndarray, ABCIndex]]:
# Implementation notes: This method is responsible for 3 things
# 1.) coercing data to array-like (ndarray, Index, extension array)
Expand Down Expand Up @@ -630,6 +634,9 @@ def factorize(
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)
if dropna is False and (codes == na_sentinel).any():
uniques = np.append(uniques, [np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we push this down to cython, but ok here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this should be a dtype appropriate for the dtype

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think dtype is defined below in _construct_data? also, if added, for categorical values, will get ValueError: Categorial categories cannot be null error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, this has to be a None or NA value in there. otherwise, output of uniques does not have na value, and also, my python will crash with FatalError somehow

codes = np.where(codes == na_sentinel, len(uniques) - 1, codes)

uniques = _reconstruct_data(uniques, dtype, original)

Expand Down
8 changes: 8 additions & 0 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -7311,6 +7311,7 @@ def groupby(
group_keys: bool_t = True,
squeeze: bool_t = False,
observed: bool_t = False,
dropna: Optional[bool_t] = None,
):
"""
Group DataFrame or Series using a mapper or by a Series of columns.
Expand Down Expand Up @@ -7355,6 +7356,12 @@ def groupby(
If False: show all values for categorical groupers.

.. versionadded:: 0.23.0
dropna : bool or None, default None
If None or True, and if group keys contain NaN values, NaN values together
charlesdong1991 marked this conversation as resolved.
Show resolved Hide resolved
with row/column will be dropped.
If False, NaN values will also be treated as the key in groups

.. versionadded:: 1.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. versionadded:: 1.0.0
.. versionadded:: 1.1.0


Returns
-------
Expand Down Expand Up @@ -7433,6 +7440,7 @@ def groupby(
group_keys=group_keys,
squeeze=squeeze,
observed=observed,
dropna=dropna,
)

def asfreq(
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,7 @@ def __init__(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: Optional[bool] = None,
):

self._selection = selection
Expand All @@ -396,6 +397,8 @@ def __init__(
self.observed = observed
self.mutated = mutated

self.dropna = dropna if dropna is not None else True

if grouper is None:
from pandas.core.groupby.grouper import get_grouper

Expand All @@ -407,6 +410,7 @@ def __init__(
sort=sort,
observed=observed,
mutated=self.mutated,
dropna=self.dropna,
)

self.obj = obj
Expand Down Expand Up @@ -2543,6 +2547,7 @@ def get_groupby(
squeeze: bool = False,
observed: bool = False,
mutated: bool = False,
dropna: Optional[bool] = None,
) -> GroupBy:

klass: Type[GroupBy]
Expand All @@ -2557,6 +2562,8 @@ def get_groupby(
else:
raise TypeError(f"invalid type: {obj}")

dropna = dropna if dropna is not None else True

return klass(
obj=obj,
keys=by,
Expand All @@ -2571,4 +2578,5 @@ def get_groupby(
squeeze=squeeze,
observed=observed,
mutated=mutated,
dropna=dropna,
)
15 changes: 13 additions & 2 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,9 @@ def __new__(cls, *args, **kwargs):
cls = TimeGrouper
return super().__new__(cls)

def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
def __init__(
self, key=None, level=None, freq=None, axis=0, sort=False, dropna=None
):
self.key = key
self.level = level
self.freq = freq
Expand All @@ -112,6 +114,7 @@ def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
self.indexer = None
self.binner = None
self._grouper = None
self.dropna = dropna if dropna is not None else True

@property
def ax(self):
Expand All @@ -138,6 +141,7 @@ def _get_grouper(self, obj, validate: bool = True):
level=self.level,
sort=self.sort,
validate=validate,
dropna=self.dropna,
)
return self.binner, self.grouper, self.obj

Expand Down Expand Up @@ -250,6 +254,7 @@ def __init__(
sort: bool = True,
observed: bool = False,
in_axis: bool = False,
dropna: Optional[bool] = None,
):
self.name = name
self.level = level
Expand All @@ -261,6 +266,8 @@ def __init__(
self.observed = observed
self.in_axis = in_axis

self.dropna = dropna if dropna is not None else True

# right place for this?
if isinstance(grouper, (Series, Index)) and name is None:
self.name = grouper.name
Expand Down Expand Up @@ -413,7 +420,9 @@ def _make_codes(self) -> None:
codes = self.grouper.codes_info
uniques = self.grouper.result_index
else:
codes, uniques = algorithms.factorize(self.grouper, sort=self.sort)
codes, uniques = algorithms.factorize(
self.grouper, sort=self.sort, dropna=self.dropna
)
uniques = Index(uniques, name=self.name)
self._codes = codes
self._group_index = uniques
Expand All @@ -432,6 +441,7 @@ def get_grouper(
observed: bool = False,
mutated: bool = False,
validate: bool = True,
dropna: Optional[bool] = None,
) -> "Tuple[ops.BaseGrouper, List[Hashable], FrameOrSeries]":
"""
Create and return a BaseGrouper, which is an internal
Expand Down Expand Up @@ -621,6 +631,7 @@ def is_in_obj(gpr) -> bool:
sort=sort,
observed=observed,
in_axis=in_axis,
dropna=dropna,
)
if not isinstance(gpr, Grouping)
else gpr
Expand Down
134 changes: 134 additions & 0 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2025,3 +2025,137 @@ def test_groupby_crash_on_nunique(axis):
expected = expected.T

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a new file, test_groupby_dropna

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, moved to test_groupby_dropna

"dropna, tuples, outputs",
[
(
None,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [13.0, 123.0], "e": [13.0, 1.0]},
),
(
True,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [13.0, 123.0], "e": [13.0, 1.0]},
),
(
False,
[["A", "B"], ["A", np.nan], ["B", "A"]],
{
"c": [13.0, 12.3, 123.23],
"d": [13.0, 233.0, 123.0],
"e": [13.0, 12.0, 1.0],
},
),
],
)
def test_groupby_dropna_multi_index_dataframe(dropna, tuples, outputs):
# GH 3729
df_list = [
["A", "B", 12, 12, 12],
["A", None, 12.3, 233.0, 12],
["B", "A", 123.23, 123, 1],
["A", "B", 1, 1, 1.0],
]
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"])
grouped = df.groupby(["a", "b"], dropna=dropna).sum()

mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
expected = pd.DataFrame(outputs, index=mi)

tm.assert_frame_equal(grouped, expected, check_index_type=False)


@pytest.mark.parametrize(
"dropna, idx, outputs",
[
(None, ["A", "B"], {"b": [123.23, 13.0], "c": [123.0, 13.0], "d": [1.0, 13.0]}),
(True, ["A", "B"], {"b": [123.23, 13.0], "c": [123.0, 13.0], "d": [1.0, 13.0]}),
(
False,
["A", "B", np.nan],
{
"b": [123.23, 13.0, 12.3],
"c": [123.0, 13.0, 233.0],
"d": [1.0, 13.0, 12.0],
},
),
],
)
def test_groupby_dropna_normal_index_dataframe(dropna, idx, outputs):
# GH 3729
df_list = [
["B", 12, 12, 12],
[None, 12.3, 233.0, 12],
["A", 123.23, 123, 1],
["B", 1, 1, 1.0],
]
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d"])
grouped = df.groupby("a", dropna=dropna).sum()

expected = pd.DataFrame(outputs, index=pd.Index(idx, dtype="object", name="a"))

tm.assert_frame_equal(grouped, expected, check_index_type=False)


@pytest.mark.parametrize(
"dropna, idx, expected",
[
(None, ["a", "a", "b", np.nan], pd.Series([3, 3], index=["a", "b"])),
(True, ["a", "a", "b", np.nan], pd.Series([3, 3], index=["a", "b"])),
(
False,
["a", "a", "b", np.nan],
pd.Series([3, 3, 3], index=["a", "b", np.nan]),
),
],
)
def test_groupby_dropna_series(dropna, idx, expected):
ser = pd.Series([1, 2, 3, 3], index=idx)

result = ser.groupby(level=0, dropna=dropna).sum()
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize(
"dropna, tuples, outputs",
[
(
None,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [12.0, 123.0], "e": [1.0, 1.0]},
),
(
True,
[["A", "B"], ["B", "A"]],
{"c": [13.0, 123.23], "d": [12.0, 123.0], "e": [1.0, 1.0]},
),
(
False,
[["A", "B"], ["A", np.nan], ["B", "A"]],
{
"c": [13.0, 12.3, 123.23],
"d": [12.0, 233.0, 123.0],
"e": [1.0, 12.0, 1.0],
},
),
],
)
def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add an example that uses NaT (datetime & timedelta)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added! see in test_groupby_dropna

# GH 3729
df_list = [
["A", "B", 12, 12, 12],
["A", None, 12.3, 233.0, 12],
["B", "A", 123.23, 123, 1],
["A", "B", 1, 1, 1.0],
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we have NaN in 2 groups? does this work (e.g. different 1st level, but NaN) in 2nd. Also how is nan in first level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a lot more different scenarios!

df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"])
agg_dict = {"c": sum, "d": max, "e": "min"}
grouped = df.groupby(["a", "b"], dropna=dropna).agg(agg_dict)

mi = pd.MultiIndex.from_tuples(tuples, names=list("ab"))
expected = pd.DataFrame(outputs, index=mi)

tm.assert_frame_equal(grouped, expected, check_index_type=False)
Loading