-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add dropna in groupby to allow NaN in keys #30584
Changes from 9 commits
7e461a1
1314059
8bcb313
13b03a8
98f6127
d5fd74c
eb717ec
de2ee5d
def05cc
2888807
b357659
dc4fef1
25482ec
015336d
ac2a79f
eb9a6f7
ffb70f8
b0e3cce
a1d5510
11ef56a
b247a8b
7cb027c
d730c4a
42c4934
2ba79b9
8b79b6c
a4fdf2d
4ac15e3
4ebbad3
f141b80
23ad19b
bafc4a5
c98bafe
86a5958
6cf31d7
2b77f37
451ec97
1089b18
63da563
1b3f22a
3f360a9
5cabe4b
76ffb9f
6c126c7
6d61d6a
3630e8b
1cec7f1
1a1bb49
7ea2e79
13b1e9a
92a7eed
1315a9d
a7959d5
9fec9a8
ffbae76
ef90d7c
e219748
2940908
4ea6aa0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -596,7 +596,11 @@ def _factorize_array( | |
) | ||
@Appender(_shared_docs["factorize"]) | ||
def factorize( | ||
values, sort: bool = False, na_sentinel: int = -1, size_hint: Optional[int] = None | ||
values, | ||
sort: bool = False, | ||
na_sentinel: int = -1, | ||
size_hint: Optional[int] = None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add some independent tests for factorize with dropna There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. emm, I think i added two test cases already in |
||
dropna: Optional[bool] = None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shouldn't this be just bool? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. changed to |
||
) -> Tuple[np.ndarray, Union[np.ndarray, ABCIndex]]: | ||
# Implementation notes: This method is responsible for 3 things | ||
# 1.) coercing data to array-like (ndarray, Index, extension array) | ||
|
@@ -630,6 +634,9 @@ def factorize( | |
uniques, codes = safe_sort( | ||
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False | ||
) | ||
if dropna is False and (codes == na_sentinel).any(): | ||
uniques = np.append(uniques, [np.nan]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ideally we push this down to cython, but ok here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hmm, this should be a dtype appropriate for the dtype There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. right so instead of adding a null to the categories like you are doing, you just add the appropriate -1 entries in the codes which automatically handle the nullness There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. emm, this has to be a |
||
codes = np.where(codes == na_sentinel, len(uniques) - 1, codes) | ||
|
||
uniques = _reconstruct_data(uniques, dtype, original) | ||
|
||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -7311,6 +7311,7 @@ def groupby( | |||||
group_keys: bool_t = True, | ||||||
squeeze: bool_t = False, | ||||||
observed: bool_t = False, | ||||||
dropna: Optional[bool_t] = None, | ||||||
): | ||||||
""" | ||||||
Group DataFrame or Series using a mapper or by a Series of columns. | ||||||
|
@@ -7355,6 +7356,12 @@ def groupby( | |||||
If False: show all values for categorical groupers. | ||||||
|
||||||
.. versionadded:: 0.23.0 | ||||||
dropna : bool or None, default None | ||||||
If None or True, and if group keys contain NaN values, NaN values together | ||||||
charlesdong1991 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
with row/column will be dropped. | ||||||
If False, NaN values will also be treated as the key in groups | ||||||
|
||||||
.. versionadded:: 1.0.0 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Returns | ||||||
------- | ||||||
|
@@ -7433,6 +7440,7 @@ def groupby( | |||||
group_keys=group_keys, | ||||||
squeeze=squeeze, | ||||||
observed=observed, | ||||||
dropna=dropna, | ||||||
) | ||||||
|
||||||
def asfreq( | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2025,3 +2025,137 @@ def test_groupby_crash_on_nunique(axis): | |
expected = expected.T | ||
|
||
tm.assert_frame_equal(result, expected) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you make a new file, test_groupby_dropna There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, moved to |
||
"dropna, tuples, outputs", | ||
[ | ||
( | ||
None, | ||
[["A", "B"], ["B", "A"]], | ||
{"c": [13.0, 123.23], "d": [13.0, 123.0], "e": [13.0, 1.0]}, | ||
), | ||
( | ||
True, | ||
[["A", "B"], ["B", "A"]], | ||
{"c": [13.0, 123.23], "d": [13.0, 123.0], "e": [13.0, 1.0]}, | ||
), | ||
( | ||
False, | ||
[["A", "B"], ["A", np.nan], ["B", "A"]], | ||
{ | ||
"c": [13.0, 12.3, 123.23], | ||
"d": [13.0, 233.0, 123.0], | ||
"e": [13.0, 12.0, 1.0], | ||
}, | ||
), | ||
], | ||
) | ||
def test_groupby_dropna_multi_index_dataframe(dropna, tuples, outputs): | ||
# GH 3729 | ||
df_list = [ | ||
["A", "B", 12, 12, 12], | ||
["A", None, 12.3, 233.0, 12], | ||
["B", "A", 123.23, 123, 1], | ||
["A", "B", 1, 1, 1.0], | ||
] | ||
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"]) | ||
grouped = df.groupby(["a", "b"], dropna=dropna).sum() | ||
|
||
mi = pd.MultiIndex.from_tuples(tuples, names=list("ab")) | ||
expected = pd.DataFrame(outputs, index=mi) | ||
|
||
tm.assert_frame_equal(grouped, expected, check_index_type=False) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"dropna, idx, outputs", | ||
[ | ||
(None, ["A", "B"], {"b": [123.23, 13.0], "c": [123.0, 13.0], "d": [1.0, 13.0]}), | ||
(True, ["A", "B"], {"b": [123.23, 13.0], "c": [123.0, 13.0], "d": [1.0, 13.0]}), | ||
( | ||
False, | ||
["A", "B", np.nan], | ||
{ | ||
"b": [123.23, 13.0, 12.3], | ||
"c": [123.0, 13.0, 233.0], | ||
"d": [1.0, 13.0, 12.0], | ||
}, | ||
), | ||
], | ||
) | ||
def test_groupby_dropna_normal_index_dataframe(dropna, idx, outputs): | ||
# GH 3729 | ||
df_list = [ | ||
["B", 12, 12, 12], | ||
[None, 12.3, 233.0, 12], | ||
["A", 123.23, 123, 1], | ||
["B", 1, 1, 1.0], | ||
] | ||
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d"]) | ||
grouped = df.groupby("a", dropna=dropna).sum() | ||
|
||
expected = pd.DataFrame(outputs, index=pd.Index(idx, dtype="object", name="a")) | ||
|
||
tm.assert_frame_equal(grouped, expected, check_index_type=False) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"dropna, idx, expected", | ||
[ | ||
(None, ["a", "a", "b", np.nan], pd.Series([3, 3], index=["a", "b"])), | ||
(True, ["a", "a", "b", np.nan], pd.Series([3, 3], index=["a", "b"])), | ||
( | ||
False, | ||
["a", "a", "b", np.nan], | ||
pd.Series([3, 3, 3], index=["a", "b", np.nan]), | ||
), | ||
], | ||
) | ||
def test_groupby_dropna_series(dropna, idx, expected): | ||
ser = pd.Series([1, 2, 3, 3], index=idx) | ||
|
||
result = ser.groupby(level=0, dropna=dropna).sum() | ||
tm.assert_series_equal(result, expected) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"dropna, tuples, outputs", | ||
[ | ||
( | ||
None, | ||
[["A", "B"], ["B", "A"]], | ||
{"c": [13.0, 123.23], "d": [12.0, 123.0], "e": [1.0, 1.0]}, | ||
), | ||
( | ||
True, | ||
[["A", "B"], ["B", "A"]], | ||
{"c": [13.0, 123.23], "d": [12.0, 123.0], "e": [1.0, 1.0]}, | ||
), | ||
( | ||
False, | ||
[["A", "B"], ["A", np.nan], ["B", "A"]], | ||
{ | ||
"c": [13.0, 12.3, 123.23], | ||
"d": [12.0, 233.0, 123.0], | ||
"e": [1.0, 12.0, 1.0], | ||
}, | ||
), | ||
], | ||
) | ||
def test_groupby_dropna_multi_index_dataframe_agg(dropna, tuples, outputs): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add an example that uses NaT (datetime & timedelta) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added! see in |
||
# GH 3729 | ||
df_list = [ | ||
["A", "B", 12, 12, 12], | ||
["A", None, 12.3, 233.0, 12], | ||
["B", "A", 123.23, 123, 1], | ||
["A", "B", 1, 1, 1.0], | ||
] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if we have NaN in 2 groups? does this work (e.g. different 1st level, but NaN) in 2nd. Also how is nan in first level? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added a lot more different scenarios! |
||
df = pd.DataFrame(df_list, columns=["a", "b", "c", "d", "e"]) | ||
agg_dict = {"c": sum, "d": max, "e": "min"} | ||
grouped = df.groupby(["a", "b"], dropna=dropna).agg(agg_dict) | ||
|
||
mi = pd.MultiIndex.from_tuples(tuples, names=list("ab")) | ||
expected = pd.DataFrame(outputs, index=mi) | ||
|
||
tm.assert_frame_equal(grouped, expected, check_index_type=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would need a subsection, this is a major new feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And i also updated on top in description.