-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: add support for nan*
reductions
#621
Comments
Good question, thanks @betatim. I think we want a general pattern here that can apply to any reduction. That's a bit of an unsolved problem even in NumPy, where we rejected more niche This discussion seems relevant: pytorch/pytorch#38349 (comment). SciPy's
A generic implementation is pretty fiddly def nanreduction(func, x, axis=None):
if axis is None:
return func(x[~np.isnan(x)])
else:
# normalize axis, then:
out_shape = list(x.shape).remove(x.shape[axis])
out = np.empty_like(x, shape=out_shape)
for i in range(axis):
# call `func` in a loop here. this is pretty annoying to get right
... NumPy's machinery is pure Python so can be looked at for guidance. It does also raise questions like how to deal with all-nan slices (e.g., https://github.com/numpy/numpy/blob/main/numpy/lib/nanfunctions.py#L355-L360); those warnings tend to not be useful. |
I think the reduction should allow to pass an explicit scalar value (e.g. typically 0.) as a kwarg and raise an error if all-nan slices occur while this kwarg was left to |
I'm not sure whether the better default is to return >>> np.nanmin([np.nan])
RuntimeWarning: All-NaN axis encountered
...
nan |
I think I had a better idea: def nanreduction(func : Callable) -> Callable:
# Takes any reduction in the standard as input, and returns a `nan<reduction>`
# function that does the expected thing. This is fully functional and has a
# small API surface, and as a bonus would be more general than the
# 14 `nan*` functions that numpy has (numpy has rejected requests for other
# nan-functions because of the impact on API size).
# Usage equivalent to `nanmax(x)`:
nanreduction(xp.max)(x) The implementation would be quite straightforward, it could be a simple dict lookup that maps |
Sounds good to me but ideally with an extra parameter to control the scalar value used in case of all nans. The default behavior could be to raise an exception. |
Ugh, this seems like a pretty obvious bug in NumPy: >>> np.max([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.max([], initial=3)
3.0
>>> np.nanmax([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.nanmax([], initial=3)
<ipython-input-13-63ff44c70a84>:1: RuntimeWarning: All-NaN axis encountered
np.nanmax([], initial=3)
nan Other than that, I think your request is basically to support the |
Indeed, I did not know about |
@betatim you want want to express your opinion on the proposed API above :) |
Sounds reasonable to me. I'm not sure I understand the comment:
The way I understand the proposal is that the Array API provides |
@betatim what I meant was that the implementation in numpy could look something like: _nanreductions_mapping = dict(
np.max: np.nanmax,
np.min: np.nanmin,
np.mean: np.nanmean,
# and so on for all existing nan* funcs
)
def nanreduction(func : Callable) -> Callable:
try:
return _nanreductions_mapping[func]
except KeyError:
# generic implementation, create new function that:
# 1. creates array with correct output shape
# 2. per axis, fills output array with func(x[~isnan(x)]) |
The functional approach feels a little odd to me. Functions that return other functions aren't typically used in Python, except as decorators. Is the idea that this could in principle be applied to a user defined reduction? I'm honestly not seeing why this is better than just adding a keyword to the various reductions. For empty/all nan reductions, I think it should behave the same as the normal reduction. For |
Yes, and also to reductions in
The response I expect to that from numpy folks at least (and it'd be my own too) is "why a keyword, we already covered this with separate functions". If it's a more general API instead, the case is that it can apply to all reductions (including those outside of numpy), and there's a reason to introduce it beyond array API standard support. |
I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard. |
cc: @seberg for vis |
NumPy 2.0 could be a time to move away from separate functions for Given the precedence in dataframe-land (e.g., pandas), supporting a |
I am not in favor of making 2.0 an excuse for such a change, but I am happy to entertain it in general and deprecate the The question is really how it would interact with the ufunc machinery and be implemented in it. Maybe it is close to the existing |
You can already do that right now?
|
Not exactly:
the handling around having zero non-nan entries may not be trivial without an identity. |
I agree with this.
I guess that is relevant from a pragmatic implementation-in-numpy perspective, but should it matter much for this discussion? There's only a subset of reductions (the ones with a binary ufunc equivalent) that are under the hood implemented by forwarding to
Agreed - time to post to the numpy-discussion list perhaps. Most implementations, whether as a keyword like pandas/Matlab or functional like Julia, only provide boolean options: "skip" or "include". I'll note that Matlab does have multiple flags for treating NA/NaN/NaT differently. SciPy on the other hand provides a string keyword, with skip/include/raise options. Here is the design doc for the implementation: http://scipy.github.io/devdocs/dev/api-dev/nan_policy.html. The "raise" option also comes up in other places, like screening for There's also the need for some functions to specify identity value (e.g., with an Does anyone else see the need to include SciPy's "raise" flavor? I suspect it's better handled with a separate function or keyword, that checks and raises if any |
Just to say it would be really nice for xarray if this were solved, because practically all our aggregations try to skip NaNs by default by using |
Of the options discussed thus far, my personal preference is following SciPy in having a "nan_policy" kwarg. The "omit" and "propagate" options map cleanly to While decorators as a design pattern are common in Python, we don't have any precedent for the type of decorator/factory function proposed above in prominent array libraries. While that, in and of itself, is not disqualifying, we should consider whether such a proposal would get buy-in, not just in NumPy, but elsewhere in the ecosystem, and whether the path to standardization might be better achieved through more established design patterns already present in array libraries and the current standard. Thus far, when we've wanted to overload behavior, we've either parameterized through kwargs or we've split into separate functions (e.g., With particular regard to NumPy, an advantage of SciPy's NumPy (and others) would, of course, be free to keep the So in short, I'd like to propose that we move to adopt SciPy's precedent of a
Adding |
nanmin
and friends to the Array API?nan*
reductions
I think a keyword would be a very reasonable outcome. What would perhaps make sense to untangle the chicken-and-egg problem here is:
That offers a way forward, while avoiding adding something in the standard that has a potentially large implementation burden (which would help no one, and make future versions of the standard harder to adopt if the work isn't done). |
I've worked on making I'd agree that a standard Either way, I think it's relevant to mention here: in scipy/scipy#20363, I drafted a function that accepts an array-API compatible namespace and returns a (nearly) array-API compatible masked array namespace (composed of calls from the provided namespace). This could be useful either as a default implementation for the |
Is there advice around handling NaNs and how to translate Numpy code to using the Array API?
In particular I have code like
np.nanmin(X, axis=0)
that I would like to rewrite so that it works with Numpy, Torch, etc arrays.To me the "obvious" translation seems to be
xp.min(X[~xp.isnan(X)], axis=0)
but this doesn't work ifX
has a shape like(7500, 4)
(herexp
is the namespace of the arrayX
). Another option I looked for is awhere=
argument tomin()
, but that doesn't exist unfortunately.Does anyone have advice on this topic or knows if there is work planned on this?
The text was updated successfully, but these errors were encountered: