Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: algorithms.factorize moves null values when sort=False #46601

Merged
merged 36 commits into from
Aug 18, 2022
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
670c2e8
BUG: algos.factorizes moves null values when sort=False
rhshadrach Apr 1, 2022
98c6c18
Implementation for pyarrow < 4.0
rhshadrach Apr 19, 2022
007329b
fixup
rhshadrach Apr 20, 2022
ffaf20c
fixups
rhshadrach Apr 20, 2022
58e5556
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Apr 21, 2022
c600e9a
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Apr 23, 2022
f7326bd
test fixup
rhshadrach Apr 23, 2022
b0ec48a
type-hints
rhshadrach Apr 24, 2022
395c9cf
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Apr 24, 2022
d0796ed
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Apr 29, 2022
351eb0d
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Apr 30, 2022
cadab86
improvements
rhshadrach Apr 30, 2022
f44b7f3
remove override of _from_factorized in string_.py
rhshadrach Apr 30, 2022
f93f968
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach May 3, 2022
ef49c74
Rework na_sentinel/dropna/ignore_na
rhshadrach May 5, 2022
2a439eb
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach May 5, 2022
8378ba0
fixups
rhshadrach May 5, 2022
f2e24df
fixup for pyarrow < 4.0
rhshadrach May 6, 2022
dc20283
Merge branch 'main' into factorize_na
rhshadrach May 6, 2022
b51e88f
whatsnew note
rhshadrach May 6, 2022
4a36bf0
doc fixup
rhshadrach May 7, 2022
6db0685
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach May 7, 2022
ca53df0
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach May 11, 2022
372efe7
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jun 27, 2022
cf56135
fixups
rhshadrach Jun 27, 2022
0b85a3d
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 2, 2022
bc3f426
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 6, 2022
57a05a7
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 9, 2022
9c35dd0
fixup whatsnew note
rhshadrach Jul 11, 2022
a7c3538
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 11, 2022
b27bda0
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 11, 2022
b45ace7
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 18, 2022
c4cfbc6
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Jul 23, 2022
ecb182c
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
rhshadrach Aug 17, 2022
7143a52
whatsnew note; comment on old vs new API
rhshadrach Aug 17, 2022
82b61b6
Merge branch 'main' of https://github.com/pandas-dev/pandas into fact…
Aug 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,7 @@ Other enhancements
- Added ``validate`` argument to :meth:`DataFrame.join` (:issue:`46622`)
- A :class:`errors.PerformanceWarning` is now thrown when using ``string[pyarrow]`` dtype with methods that don't dispatch to ``pyarrow.compute`` methods (:issue:`42613`)
- Added ``numeric_only`` argument to :meth:`Resampler.sum`, :meth:`Resampler.prod`, :meth:`Resampler.min`, :meth:`Resampler.max`, :meth:`Resampler.first`, and :meth:`Resampler.last` (:issue:`46442`)
- The method :meth:`.ExtensionArray.factorize` has gained the argument ``dropna`` for determining how null values are to be treated. (:issue:`46601`)

.. ---------------------------------------------------------------------------
.. _whatsnew_150.notable_bug_fixes:
Expand Down Expand Up @@ -635,6 +636,7 @@ Groupby/resample/rolling
- Bug in :meth:`Rolling.var` and :meth:`Rolling.std` would give non-zero result with window of same values (:issue:`42064`)
- Bug in :meth:`.Rolling.var` would segfault calculating weighted variance when window size was larger than data size (:issue:`46760`)
- Bug in :meth:`Grouper.__repr__` where ``dropna`` was not included. Now it is (:issue:`46754`)
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` with ``dropna=False`` and ``sort=False`` would put any null groups at the end instead the order that they are encountered (:issue:`46584`)

Reshaping
^^^^^^^^^
Expand Down
12 changes: 6 additions & 6 deletions pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -658,7 +658,7 @@ cdef class {{name}}HashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, const {{dtype}}_t[:] values, Py_ssize_t na_sentinel=-1,
object na_value=None, object mask=None):
object na_value=None, object mask=None, ignore_na=True):
"""
Calculate unique values and labels (no sorting!)

Expand Down Expand Up @@ -690,7 +690,7 @@ cdef class {{name}}HashTable(HashTable):
"""
uniques_vector = {{name}}Vector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True, mask=mask,
na_value=na_value, ignore_na=ignore_na, mask=mask,
return_inverse=True)

def get_labels(self, const {{dtype}}_t[:] values, {{name}}Vector uniques,
Expand Down Expand Up @@ -1037,7 +1037,7 @@ cdef class StringHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None, object mask=None):
object na_value=None, object mask=None, ignore_na=True):
"""
Calculate unique values and labels (no sorting!)

Expand Down Expand Up @@ -1067,7 +1067,7 @@ cdef class StringHashTable(HashTable):
"""
uniques_vector = ObjectVector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
na_value=na_value, ignore_na=ignore_na,
return_inverse=True)

def get_labels(self, ndarray[object] values, ObjectVector uniques,
Expand Down Expand Up @@ -1290,7 +1290,7 @@ cdef class PyObjectHashTable(HashTable):
return_inverse=return_inverse)

def factorize(self, ndarray[object] values, Py_ssize_t na_sentinel=-1,
object na_value=None, object mask=None):
object na_value=None, object mask=None, ignore_na=True):
"""
Calculate unique values and labels (no sorting!)

Expand Down Expand Up @@ -1320,7 +1320,7 @@ cdef class PyObjectHashTable(HashTable):
"""
uniques_vector = ObjectVector()
return self._unique(values, uniques_vector, na_sentinel=na_sentinel,
na_value=na_value, ignore_na=True,
na_value=na_value, ignore_na=ignore_na,
return_inverse=True)

def get_labels(self, ndarray[object] values, ObjectVector uniques,
Expand Down
40 changes: 30 additions & 10 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,7 @@ def factorize_array(
size_hint: int | None = None,
na_value=None,
mask: npt.NDArray[np.bool_] | None = None,
dropna: bool = True,
) -> tuple[npt.NDArray[np.intp], np.ndarray]:
"""
Factorize a numpy array to codes and uniques.
Expand All @@ -523,6 +524,11 @@ def factorize_array(
If not None, the mask is used as indicator for missing values
(True = missing, False = valid) instead of `na_value` or
condition "val != val".
dropna : bool, default True
Whether NA values will appear in uniques. When False, NA values
will receive a nonnegative code instead of na_sentinel.

.. versionadded:: 1.5.0

Returns
-------
Expand All @@ -541,7 +547,11 @@ def factorize_array(

table = hash_klass(size_hint or len(values))
uniques, codes = table.factorize(
values, na_sentinel=na_sentinel, na_value=na_value, mask=mask
values,
na_sentinel=na_sentinel,
na_value=na_value,
rhshadrach marked this conversation as resolved.
Show resolved Hide resolved
mask=mask,
ignore_na=dropna,
)

# re-cast e.g. i8->dt64/td64, uint8->bool
Expand Down Expand Up @@ -728,25 +738,35 @@ def factorize(

if not isinstance(values.dtype, np.dtype):
# i.e. ExtensionDtype
codes, uniques = values.factorize(na_sentinel=na_sentinel)
# TODO: pass ignore_na=dropna. When sort is True we ignore_na here and append
# on the end because safe_sort does not handle null values in uniques
codes, uniques = values.factorize(
na_sentinel=na_sentinel, dropna=dropna or sort
)
else:
values = np.asarray(values) # convert DTA/TDA/MultiIndex
# TODO: pass ignore_na=dropna; see above
codes, uniques = factorize_array(
values, na_sentinel=na_sentinel, size_hint=size_hint
values,
na_sentinel=na_sentinel,
size_hint=size_hint,
dropna=dropna or sort,
)

if sort and len(uniques) > 0:
uniques, codes = safe_sort(
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False
)

code_is_na = codes == na_sentinel
if not dropna and code_is_na.any():
# na_value is set based on the dtype of uniques, and compat set to False is
# because we do not want na_value to be 0 for integers
na_value = na_value_for_dtype(uniques.dtype, compat=False)
uniques = np.append(uniques, [na_value])
codes = np.where(code_is_na, len(uniques) - 1, codes)
if not dropna and sort:
# TODO: Can remove if we pass ignore_na=dropna; see above
code_is_na = codes == na_sentinel
if code_is_na.any():
# na_value is set based on the dtype of uniques, and compat set to False is
# because we do not want na_value to be 0 for integers
na_value = na_value_for_dtype(uniques.dtype, compat=False)
uniques = np.append(uniques, [na_value])
codes = np.where(code_is_na, len(uniques) - 1, codes)

uniques = _reconstruct_data(uniques, original.dtype, original)

Expand Down
24 changes: 22 additions & 2 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from pandas.compat import (
pa_version_under1p01,
pa_version_under2p0,
pa_version_under4p0,
pa_version_under5p0,
pa_version_under6p0,
)
Expand Down Expand Up @@ -122,8 +123,14 @@ def dropna(self: ArrowExtensionArrayT) -> ArrowExtensionArrayT:
return type(self)(pc.drop_null(self._data))

@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]:
encoded = self._data.dictionary_encode()
def factorize(
self, na_sentinel: int = -1, dropna=True
) -> tuple[np.ndarray, ExtensionArray]:
if pa_version_under4p0:
encoded = self._data.dictionary_encode()
else:
null_encoding = "mask" if dropna else "encode"
encoded = self._data.dictionary_encode(null_encoding=null_encoding)
indices = pa.chunked_array(
[c.indices for c in encoded.chunks], type=encoded.type.index_type
).to_pandas()
Expand All @@ -133,6 +140,19 @@ def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]:

if encoded.num_chunks:
uniques = type(self)(encoded.chunk(0).dictionary)
if not dropna and pa_version_under4p0:
# Insert na with the proper code
na_mask = indices.values == na_sentinel
na_index = na_mask.argmax()
if na_mask[na_index]:
uniques = uniques.insert(na_index, self.dtype.na_value)
na_code = 0 if na_index == 0 else indices[:na_index].argmax() + 1
if na_sentinel < 0:
# codes can never equal na_sentinel and be >= na_code
indices[indices >= na_code] += 1
else:
indices[(indices >= na_code) & (indices != na_sentinel)] += 1
indices[indices == na_sentinel] = na_code
else:
uniques = type(self)(pa.array([], type=encoded.type.value_type))

Expand Down
11 changes: 9 additions & 2 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1002,14 +1002,21 @@ def _values_for_factorize(self) -> tuple[np.ndarray, Any]:
"""
return self.astype(object), np.nan

def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]:
def factorize(
self, na_sentinel: int = -1, dropna: bool = True
) -> tuple[np.ndarray, ExtensionArray]:
"""
Encode the extension array as an enumerated type.

Parameters
----------
na_sentinel : int, default -1
Value to use in the `codes` array to indicate missing values.
dropna : bool, default True
Whether NA values will appear in uniques. When False, NA values
will receive a nonnegative code instead of na_sentinel.

.. versionadded:: 1.5.0

Returns
-------
Expand Down Expand Up @@ -1044,7 +1051,7 @@ def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]:
arr, na_value = self._values_for_factorize()

codes, uniques = factorize_array(
arr, na_sentinel=na_sentinel, na_value=na_value
arr, na_sentinel=na_sentinel, na_value=na_value, dropna=dropna
)

uniques_ea = self._from_factorized(uniques, self)
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -1878,7 +1878,7 @@ def _with_freq(self, freq):

# --------------------------------------------------------------

def factorize(self, na_sentinel=-1, sort: bool = False):
def factorize(self, na_sentinel: int = -1, dropna: bool = True, sort: bool = False):
if self.freq is not None:
# We must be unique, so can short-circuit (and retain freq)
codes = np.arange(len(self), dtype=np.intp)
Expand All @@ -1888,7 +1888,8 @@ def factorize(self, na_sentinel=-1, sort: bool = False):
uniques = uniques[::-1]
return codes, uniques
# FIXME: shouldn't get here; we are ignoring sort
return super().factorize(na_sentinel=na_sentinel)
result = super().factorize(na_sentinel=na_sentinel, dropna=dropna)
return result


# -------------------------------------------------------------------
Expand Down
33 changes: 30 additions & 3 deletions pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -869,16 +869,43 @@ def searchsorted(
return self._data.searchsorted(value, side=side, sorter=sorter)

@doc(ExtensionArray.factorize)
def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, ExtensionArray]:
def factorize(
self, na_sentinel: int = -1, dropna: bool = True
) -> tuple[np.ndarray, ExtensionArray]:
arr = self._data
mask = self._mask

codes, uniques = factorize_array(arr, na_sentinel=na_sentinel, mask=mask)
codes, uniques = factorize_array(
arr, na_sentinel=na_sentinel, mask=mask, dropna=True
)

# check that factorize_array correctly preserves dtype.
assert uniques.dtype == self.dtype.numpy_dtype, (uniques.dtype, self.dtype)

uniques_ea = type(self)(uniques, np.zeros(len(uniques), dtype=bool))
# Make room for a null value if we're not ignoring it and it exists
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to share any of this with the ArrowArray version? not for this PR, but could have a TODO

Copy link
Member Author

@rhshadrach rhshadrach Apr 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will add a TODO. Once we drop support for pyarrow < 4.0 we won't need this logic in ArrowArray, but 4.0 is only a year old at this point so that will be a while.

size = len(uniques) if dropna or not mask.any() else len(uniques) + 1
uniques_mask = np.zeros(size, dtype=bool)
if not dropna:
na_index = mask.argmax()
if mask[na_index]:
# Insert na with the proper code
if na_index == 0:
na_code = np.intp(0)
else:
# error: Slice index must be an integer or None
# https://github.com/python/mypy/issues/2410
na_code = codes[:na_index].argmax() + 1 # type: ignore[misc]
if na_sentinel < 0:
# codes can never equal na_sentinel and be >= na_code
codes[codes >= na_code] += 1
else:
codes[(codes >= na_code) & (codes != na_sentinel)] += 1
codes[codes == na_sentinel] = na_code
# dummy value for uniques; not used since uniques_mask will be True
uniques = np.insert(uniques, na_code, 0)
uniques_mask[na_code] = True
uniques_ea = type(self)(uniques, uniques_mask)

return codes, uniques_ea

@doc(ExtensionArray._values_for_argsort)
Expand Down
9 changes: 7 additions & 2 deletions pandas/core/arrays/sparse/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -847,13 +847,18 @@ def _values_for_factorize(self):
# Still override this for hash_pandas_object
return np.asarray(self), self.fill_value

def factorize(self, na_sentinel: int = -1) -> tuple[np.ndarray, SparseArray]:
def factorize(
self, na_sentinel: int = -1, dropna: bool = True
) -> tuple[np.ndarray, SparseArray]:
# Currently, ExtensionArray.factorize -> Tuple[ndarray, EA]
# The sparsity on this is backwards from what Sparse would want. Want
# ExtensionArray.factorize -> Tuple[EA, EA]
# Given that we have to return a dense array of codes, why bother
# implementing an efficient factorize?
codes, uniques = algos.factorize(np.asarray(self), na_sentinel=na_sentinel)
na_sentinel_arg = na_sentinel if dropna else None
codes, uniques = algos.factorize(np.asarray(self), na_sentinel=na_sentinel_arg)
if not dropna:
codes[codes == -1] = na_sentinel
uniques_sp = SparseArray(uniques, dtype=self.dtype)
return codes, uniques_sp

Expand Down
7 changes: 7 additions & 0 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,13 @@ def _values_for_factorize(self):
arr[mask] = -1
return arr, -1

@classmethod
def _from_factorized(cls, values, original):
assert values.dtype == original._ndarray.dtype
# When dropna (i.e. ignore_na) is False, can get -1 from nulls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way we could avoid this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - yes. Changed _values_for_factorize from -1 to None.

values[values == -1] = None
return original._from_backing_data(values)

def __setitem__(self, key, value):
value = extract_array(value, extract_numpy=True)
if isinstance(value, type(self)):
Expand Down
5 changes: 3 additions & 2 deletions pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -657,9 +657,10 @@ def group_index(self) -> Index:

@cache_readonly
def _codes_and_uniques(self) -> tuple[npt.NDArray[np.signedinteger], ArrayLike]:
if self._passed_categorical:
if self._dropna and self._passed_categorical:
# we make a CategoricalIndex out of the cat grouper
# preserving the categories / ordered attributes
# preserving the categories / ordered attributes;
# doesn't (yet) handle dropna=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH ref for the "yet"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #46909, will add a reference in this comment.

cat = self.grouping_vector
categories = cat.categories

Expand Down
Loading