Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: replace dropna=False option with na_sentinel=None in factorize #35852

Merged
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
7e461a1
remove \n from docstring
charlesdong1991 Dec 3, 2018
1314059
fix conflicts
charlesdong1991 Jan 19, 2019
8bcb313
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jul 30, 2019
24c3ede
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Jan 14, 2020
dea38f2
fix issue 17038
charlesdong1991 Jan 14, 2020
cd9e7ac
revert change
charlesdong1991 Jan 14, 2020
e5e912b
revert change
charlesdong1991 Jan 14, 2020
045a76f
Merge remote-tracking branch 'upstream/master'
charlesdong1991 Apr 6, 2020
a61367b
Merge remote-tracking branch 'upstream/master' into add_dropna_factorize
charlesdong1991 Aug 22, 2020
2c451a9
add dropna doc for factorize
charlesdong1991 Aug 22, 2020
31ffca7
rephrase the doc
charlesdong1991 Aug 22, 2020
32d029d
flake8
charlesdong1991 Aug 22, 2020
ba93eb6
fixup
charlesdong1991 Aug 22, 2020
2330908
use NaN
charlesdong1991 Aug 22, 2020
b9850a2
add dropna in series.factorize
charlesdong1991 Aug 22, 2020
3a18c65
black
charlesdong1991 Aug 22, 2020
9d7f1e6
add test
charlesdong1991 Aug 22, 2020
97fd2e6
linting
charlesdong1991 Aug 22, 2020
68527ef
linting
charlesdong1991 Aug 22, 2020
817905c
doct
charlesdong1991 Aug 22, 2020
364aeae
fix black
charlesdong1991 Aug 22, 2020
7cd0cce
fixup
charlesdong1991 Aug 22, 2020
2368223
fix doctest
charlesdong1991 Aug 22, 2020
8ca0652
add whatsnew
charlesdong1991 Aug 22, 2020
b452513
linting
charlesdong1991 Aug 22, 2020
344c072
fix test
charlesdong1991 Aug 22, 2020
1a5c358
try one time
charlesdong1991 Aug 22, 2020
81a0a7e
Merge remote-tracking branch 'upstream/master' into add_dropna_factorize
charlesdong1991 Aug 22, 2020
4607953
hide dropna and use na_sentinel=None
charlesdong1991 Aug 27, 2020
a7d4abd
Merge remote-tracking branch 'upstream/master' into add_dropna_factorize
charlesdong1991 Aug 27, 2020
3ef1459
update whatsnew
charlesdong1991 Aug 27, 2020
fca7300
rename test function
charlesdong1991 Aug 27, 2020
f0a6556
remove dropna from factorize
charlesdong1991 Aug 27, 2020
c81e79e
update doc
charlesdong1991 Aug 27, 2020
37ca034
docstring
charlesdong1991 Aug 27, 2020
4f0f226
update doc
charlesdong1991 Aug 27, 2020
5fcabe7
add comment
charlesdong1991 Aug 27, 2020
b7cd915
code change on review
charlesdong1991 Aug 28, 2020
8a2a1f7
update doc
charlesdong1991 Aug 28, 2020
e0c7342
code change on review
charlesdong1991 Aug 28, 2020
0480d9f
minor move in whatsnew
charlesdong1991 Aug 28, 2020
5c87cd1
add default example
charlesdong1991 Sep 1, 2020
b70e595
Merge remote-tracking branch 'upstream/master' into add_dropna_factorize
charlesdong1991 Sep 1, 2020
076fc10
doc
charlesdong1991 Sep 1, 2020
c945457
one more try
charlesdong1991 Sep 1, 2020
e6c7434
explicit doc
charlesdong1991 Sep 1, 2020
7533ab0
Merge remote-tracking branch 'upstream/master' into add_dropna_factorize
charlesdong1991 Sep 2, 2020
bf8641a
add space
charlesdong1991 Sep 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions doc/source/whatsnew/v1.1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,14 @@ Bug fixes

.. ---------------------------------------------------------------------------

.. _whatsnew_112.other:

Other
~~~~~
- :meth:`factorize` now supports ``na_sentinel=None`` to include NaN in the uniques of the values and remove ``dropna`` keyword which was unintentionally exposed to public facing API in 1.1 version from :meth:`factorize`(:issue:`35667`)

.. ---------------------------------------------------------------------------

.. _whatsnew_112.contributors:

Contributors
Expand Down
27 changes: 23 additions & 4 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -526,9 +526,8 @@ def _factorize_array(
def factorize(
values,
sort: bool = False,
na_sentinel: int = -1,
na_sentinel: Optional[int] = -1,
size_hint: Optional[int] = None,
dropna: bool = True,
) -> Tuple[np.ndarray, Union[np.ndarray, ABCIndex]]:
"""
Encode the object as an enumerated type or categorical variable.
Expand All @@ -541,8 +540,11 @@ def factorize(
Parameters
----------
{values}{sort}
na_sentinel : int, default -1
Value to mark "not found".
na_sentinel : int or None, default -1
Value to mark "not found". If None, will not drop the NaN
from the uniques of the values.
charlesdong1991 marked this conversation as resolved.
Show resolved Hide resolved

.. versionchanged:: 1.1.2
{size_hint}\

Returns
Expand Down Expand Up @@ -620,6 +622,16 @@ def factorize(
array([0, 0, 1]...)
>>> uniques
Index(['a', 'c'], dtype='object')

If NaN is in the values, and we want to include NaN in the uniques of the
values, it can be achieved by setting ``na_sentinel=None``.

>>> values = np.array([1, 2, 1, np.nan])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the same example above this as well but w/o setting na_sentinel

>>> codes, uniques = pd.factorize(values, na_sentinel=None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally put a blank line as othewise this is very hard to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

>>> codes
array([0, 1, 0, 2])
>>> uniques
array([ 1., 2., nan])
"""
# Implementation notes: This method is responsible for 3 things
# 1.) coercing data to array-like (ndarray, Index, extension array)
Expand All @@ -633,6 +645,13 @@ def factorize(
values = _ensure_arraylike(values)
original = values

# GH35667, if na_sentinel=None, we will not dropna NaNs from the uniques
# of values, assign na_sentinel=-1 to replace code value for NaN.
dropna = True
if na_sentinel is None:
na_sentinel = -1
dropna = False

if is_extension_array_dtype(values.dtype):
values = extract_array(values)
codes, uniques = values.factorize(na_sentinel=na_sentinel)
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1398,7 +1398,7 @@ def memory_usage(self, deep=False):
"""
),
)
def factorize(self, sort=False, na_sentinel=-1):
def factorize(self, sort: bool = False, na_sentinel: Optional[int] = -1):
return algorithms.factorize(self, sort=sort, na_sentinel=na_sentinel)

_shared_docs[
Expand Down
7 changes: 6 additions & 1 deletion pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -585,8 +585,13 @@ def _make_codes(self) -> None:
codes = self.grouper.codes_info
uniques = self.grouper.result_index
else:
# GH35667, replace dropna=False with na_sentinel=None
if not self.dropna:
na_sentinel = None
else:
na_sentinel = -1
codes, uniques = algorithms.factorize(
self.grouper, sort=self.sort, dropna=self.dropna
self.grouper, sort=self.sort, na_sentinel=na_sentinel
jreback marked this conversation as resolved.
Show resolved Hide resolved
)
uniques = Index(uniques, name=self.name)
self._codes = codes
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/base/test_factorize.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,16 @@ def test_factorize(index_or_series_obj, sort):

tm.assert_numpy_array_equal(result_codes, expected_codes)
tm.assert_index_equal(result_uniques, expected_uniques)


def test_series_factorize_na_sentinel_none():
# GH35667
values = np.array([1, 2, 1, np.nan])
ser = pd.Series(values)
codes, uniques = ser.factorize(na_sentinel=None)

expected_codes = np.array([0, 1, 0, 2], dtype="int64")
expected_uniques = pd.Index([1.0, 2.0, np.nan])

tm.assert_numpy_array_equal(codes, expected_codes)
tm.assert_index_equal(uniques, expected_uniques)
44 changes: 9 additions & 35 deletions pandas/tests/test_algos.py
Original file line number Diff line number Diff line change
Expand Up @@ -340,73 +340,47 @@ def test_factorize_na_sentinel(self, sort, na_sentinel, data, uniques):
tm.assert_extension_array_equal(uniques, expected_uniques)

@pytest.mark.parametrize(
"data, dropna, expected_codes, expected_uniques",
"data, expected_codes, expected_uniques",
[
(
["a", None, "b", "a"],
True,
np.array([0, -1, 1, 0], dtype=np.dtype("intp")),
np.array(["a", "b"], dtype=object),
),
(
["a", np.nan, "b", "a"],
True,
np.array([0, -1, 1, 0], dtype=np.dtype("intp")),
np.array(["a", "b"], dtype=object),
),
(
["a", None, "b", "a"],
False,
np.array([0, 2, 1, 0], dtype=np.dtype("intp")),
np.array(["a", "b", np.nan], dtype=object),
),
(
["a", np.nan, "b", "a"],
False,
np.array([0, 2, 1, 0], dtype=np.dtype("intp")),
np.array(["a", "b", np.nan], dtype=object),
),
],
)
def test_object_factorize_dropna(
self, data, dropna, expected_codes, expected_uniques
def test_object_factorize_na_sentinel_none(
self, data, expected_codes, expected_uniques
):
codes, uniques = algos.factorize(data, dropna=dropna)
codes, uniques = algos.factorize(data, na_sentinel=None)

tm.assert_numpy_array_equal(uniques, expected_uniques)
tm.assert_numpy_array_equal(codes, expected_codes)

@pytest.mark.parametrize(
"data, dropna, expected_codes, expected_uniques",
"data, expected_codes, expected_uniques",
[
(
[1, None, 1, 2],
True,
np.array([0, -1, 0, 1], dtype=np.dtype("intp")),
np.array([1, 2], dtype="O"),
),
(
[1, np.nan, 1, 2],
True,
np.array([0, -1, 0, 1], dtype=np.dtype("intp")),
np.array([1, 2], dtype=np.float64),
),
(
[1, None, 1, 2],
False,
np.array([0, 2, 0, 1], dtype=np.dtype("intp")),
np.array([1, 2, np.nan], dtype="O"),
),
(
[1, np.nan, 1, 2],
False,
np.array([0, 2, 0, 1], dtype=np.dtype("intp")),
np.array([1, 2, np.nan], dtype=np.float64),
),
],
)
def test_int_factorize_dropna(self, data, dropna, expected_codes, expected_uniques):
codes, uniques = algos.factorize(data, dropna=dropna)
def test_int_factorize_na_sentinel_none(
self, data, expected_codes, expected_uniques
):
codes, uniques = algos.factorize(data, na_sentinel=None)

tm.assert_numpy_array_equal(uniques, expected_uniques)
tm.assert_numpy_array_equal(codes, expected_codes)
Expand Down