-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: algorithms.factorize moves null values when sort=False #46601
Changes from 28 commits
670c2e8
98c6c18
007329b
ffaf20c
58e5556
c600e9a
f7326bd
b0ec48a
395c9cf
d0796ed
351eb0d
cadab86
f44b7f3
f93f968
ef49c74
2a439eb
8378ba0
f2e24df
dc20283
b51e88f
4a36bf0
6db0685
ca53df0
372efe7
cf56135
0b85a3d
bc3f426
57a05a7
9c35dd0
a7c3538
b27bda0
b45ace7
c4cfbc6
ecb182c
7143a52
82b61b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -509,7 +509,7 @@ def f(c, v): | |
|
||
def factorize_array( | ||
values: np.ndarray, | ||
na_sentinel: int = -1, | ||
na_sentinel: int | None = -1, | ||
size_hint: int | None = None, | ||
na_value: object = None, | ||
mask: npt.NDArray[np.bool_] | None = None, | ||
|
@@ -540,6 +540,10 @@ def factorize_array( | |
codes : ndarray[np.intp] | ||
uniques : ndarray | ||
""" | ||
ignore_na = na_sentinel is not None | ||
if not ignore_na: | ||
na_sentinel = -1 | ||
|
||
original = values | ||
if values.dtype.kind in ["m", "M"]: | ||
# _get_hashtable_algo will cast dt64/td64 to i8 via _ensure_data, so we | ||
|
@@ -552,7 +556,11 @@ def factorize_array( | |
|
||
table = hash_klass(size_hint or len(values)) | ||
uniques, codes = table.factorize( | ||
values, na_sentinel=na_sentinel, na_value=na_value, mask=mask | ||
values, | ||
na_sentinel=na_sentinel, | ||
na_value=na_value, | ||
rhshadrach marked this conversation as resolved.
Show resolved
Hide resolved
|
||
mask=mask, | ||
ignore_na=ignore_na, | ||
) | ||
|
||
# re-cast e.g. i8->dt64/td64, uint8->bool | ||
|
@@ -737,10 +745,7 @@ def factorize( | |
|
||
# GH35667, if na_sentinel=None, we will not dropna NaNs from the uniques | ||
# of values, assign na_sentinel=-1 to replace code value for NaN. | ||
dropna = True | ||
if na_sentinel is None: | ||
na_sentinel = -1 | ||
dropna = False | ||
dropna = na_sentinel is not None | ||
|
||
if ( | ||
isinstance(values, (ABCDatetimeArray, ABCTimedeltaArray)) | ||
|
@@ -753,38 +758,58 @@ def factorize( | |
|
||
elif not isinstance(values.dtype, np.dtype): | ||
if ( | ||
na_sentinel == -1 | ||
and "use_na_sentinel" in inspect.signature(values.factorize).parameters | ||
): | ||
na_sentinel == -1 or na_sentinel is None | ||
) and "use_na_sentinel" in inspect.signature(values.factorize).parameters: | ||
# Avoid using catch_warnings when possible | ||
# GH#46910 - TimelikeOps has deprecated signature | ||
codes, uniques = values.factorize( # type: ignore[call-arg] | ||
use_na_sentinel=True | ||
use_na_sentinel=na_sentinel is not None | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. im trying to follow this and keep getting lost here. is any of this going to get simpler once deprecations are enforced? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes - but how much depends on what you mean by "this". There is a lot of playing around with arguments (what I think you're highlighting here) that will all go away. But what I regard as the main complication this PR introduces, noted in the TODO immediately below, will not go away just with the deprecation. For that TODO, some changes to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I looked at safe_sort again, and @phofl already solved much of the issue in #47331. I opened a PR into this branch here to see what the execution of the TODO mentioned above would look like: rhshadrach#2 However, changing safe_sort in this way will induce another behavior change:
No other tests besides the one changed failed locally. While I do think this is a bugfix, I'd like to study more what impact it has on concat/reshaping/indexing and I think it may need some discussion. In particular, I don't think it should be done in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel - friendly ping. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed both that the "feature" behavior looks more correct and that it should be done separate from this PR. Taking another look now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Coming in fairly cold), I think I'm getting lost too here. I am not fully comprehending why we check There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yea - there are some gymnastics here no doubt. The idea behind this block is to avoid using catch_warnings since it's possible. Old API: na_sentinel is either an integer or None The correspondence is: use_na_sentinel False is equivalent to na_sentinel being None Note there is no option in the new API for na_sentinel being anything other than -1 or None in the old API. So we can use the new argument precisely when (a) the function has said argument and (b) na_sentinel is either -1 or None. In such a case, the correspondence from na_sentinel to use_na_sentinel is given by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perfect, just the explanation I needed. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add this correspondence as a comment to the top of this function; we can remove it when the deprecation is enforced. |
||
) | ||
else: | ||
na_sentinel_arg = -1 if na_sentinel is None else na_sentinel | ||
with warnings.catch_warnings(): | ||
# We've already warned above | ||
warnings.filterwarnings("ignore", ".*use_na_sentinel.*", FutureWarning) | ||
codes, uniques = values.factorize(na_sentinel=na_sentinel) | ||
codes, uniques = values.factorize(na_sentinel=na_sentinel_arg) | ||
|
||
else: | ||
values = np.asarray(values) # convert DTA/TDA/MultiIndex | ||
# TODO: pass na_sentinel=na_sentinel to factorize_array. When sort is True and | ||
# na_sentinel is None we append NA on the end because safe_sort does not | ||
# handle null values in uniques. | ||
if na_sentinel is None and sort: | ||
na_sentinel_arg = -1 | ||
elif na_sentinel is None: | ||
na_sentinel_arg = None | ||
else: | ||
na_sentinel_arg = na_sentinel | ||
codes, uniques = factorize_array( | ||
values, na_sentinel=na_sentinel, size_hint=size_hint | ||
values, | ||
na_sentinel=na_sentinel_arg, | ||
size_hint=size_hint, | ||
) | ||
|
||
if sort and len(uniques) > 0: | ||
if na_sentinel is None: | ||
# TODO: Can remove when na_sentinel=na_sentinel as in TODO above | ||
na_sentinel = -1 | ||
uniques, codes = safe_sort( | ||
uniques, codes, na_sentinel=na_sentinel, assume_unique=True, verify=False | ||
) | ||
|
||
code_is_na = codes == na_sentinel | ||
if not dropna and code_is_na.any(): | ||
# na_value is set based on the dtype of uniques, and compat set to False is | ||
# because we do not want na_value to be 0 for integers | ||
na_value = na_value_for_dtype(uniques.dtype, compat=False) | ||
uniques = np.append(uniques, [na_value]) | ||
codes = np.where(code_is_na, len(uniques) - 1, codes) | ||
if not dropna and sort: | ||
# TODO: Can remove entire block when na_sentinel=na_sentinel as in TODO above | ||
if na_sentinel is None: | ||
na_sentinel_arg = -1 | ||
else: | ||
na_sentinel_arg = na_sentinel | ||
code_is_na = codes == na_sentinel_arg | ||
if code_is_na.any(): | ||
# na_value is set based on the dtype of uniques, and compat set to False is | ||
# because we do not want na_value to be 0 for integers | ||
na_value = na_value_for_dtype(uniques.dtype, compat=False) | ||
uniques = np.append(uniques, [na_value]) | ||
codes = np.where(code_is_na, len(uniques) - 1, codes) | ||
|
||
uniques = _reconstruct_data(uniques, original.dtype, original) | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"be passed" -> "takes" or "accepts", i.e. avoid passive voice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - fixed.