RFC: add support for `nan*` reductions #621

betatim · 2023-04-18T12:56:55Z

Is there advice around handling NaNs and how to translate Numpy code to using the Array API?

In particular I have code like np.nanmin(X, axis=0) that I would like to rewrite so that it works with Numpy, Torch, etc arrays.

To me the "obvious" translation seems to be xp.min(X[~xp.isnan(X)], axis=0) but this doesn't work if X has a shape like (7500, 4) (here xp is the namespace of the array X). Another option I looked for is a where= argument to min(), but that doesn't exist unfortunately.

Does anyone have advice on this topic or knows if there is work planned on this?

The text was updated successfully, but these errors were encountered:

rgommers · 2023-04-18T14:19:24Z

Good question, thanks @betatim. I think we want a general pattern here that can apply to any reduction. That's a bit of an unsolved problem even in NumPy, where we rejected more niche nan* feature requests like nanptp and nancov.

This discussion seems relevant: pytorch/pytorch#38349 (comment). SciPy's nan_policy is now in better shape, but I'm still not sure on what the best approach in general is. Options include:

a skip_nan keyword
a context manager or other generic method to do achieve the same effect
a compat library approach with a fast path for libraries known to offer specific functions like nanmin and a generic fallback implementation that iterates over the specified axis=.

A generic implementation is pretty fiddly

def nanreduction(func, x, axis=None):
    if axis is None:
        return func(x[~np.isnan(x)])
    else:
        # normalize axis, then:
        out_shape = list(x.shape).remove(x.shape[axis])
        out = np.empty_like(x, shape=out_shape)
        for i in range(axis):
            # call `func` in a loop here. this is pretty annoying to get right
            ...

NumPy's machinery is pure Python so can be looked at for guidance. It does also raise questions like how to deal with all-nan slices (e.g., https://github.com/numpy/numpy/blob/main/numpy/lib/nanfunctions.py#L355-L360); those warnings tend to not be useful.

ogrisel · 2023-04-20T12:46:10Z

how to deal with all-nan slices

I think the reduction should allow to pass an explicit scalar value (e.g. typically 0.) as a kwarg and raise an error if all-nan slices occur while this kwarg was left to None.

rgommers · 2023-04-21T14:25:10Z

I think the reduction should allow to pass an explicit scalar value (e.g. typically 0.) as a kwarg and raise an error if all-nan slices occur while this kwarg was left to None.

I'm not sure whether the better default is to return nan or raise an exception, but it should be one or the other I think. These warnings are quite annoying:

>>> np.nanmin([np.nan])
RuntimeWarning: All-NaN axis encountered
...
nan

rgommers · 2023-05-18T18:44:05Z

I think I had a better idea:

def nanreduction(func : Callable) -> Callable:
    # Takes any reduction in the standard as input, and returns a `nan<reduction>`
    # function that does the expected thing. This is fully functional and has a
    # small API surface, and as a bonus would be more general than the 
    # 14 `nan*` functions that numpy has (numpy has rejected requests for other
    # nan-functions because of the impact on API size).

# Usage equivalent to `nanmax(x)`:
nanreduction(xp.max)(x)

The implementation would be quite straightforward, it could be a simple dict lookup that maps max to nanmax, and so on.

@betatim @ogrisel WDYT?

ogrisel · 2023-05-19T06:31:46Z

Sounds good to me but ideally with an extra parameter to control the scalar value used in case of all nans. The default behavior could be to raise an exception.

rgommers · 2023-05-19T16:39:42Z

but ideally with an extra parameter to control the scalar value used in case of all nans

Ugh, this seems like a pretty obvious bug in NumPy:

>>> np.max([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.max([], initial=3)
3.0

>>> np.nanmax([])
ValueError: zero-size array to reduction operation maximum which has no identity
>>> np.nanmax([], initial=3)
<ipython-input-13-63ff44c70a84>:1: RuntimeWarning: All-NaN axis encountered
  np.nanmax([], initial=3)
nan

Other than that, I think your request is basically to support the initial keyword in all reductions, right?

ogrisel · 2023-05-22T11:45:39Z

Other than that, I think your request is basically to support the initial keyword in all reductions, right?

Indeed, I did not know about initial. I think the warning could be omitted.

ogrisel · 2023-05-22T13:47:47Z

@betatim you want want to express your opinion on the proposed API above :)

betatim · 2023-05-22T19:30:14Z

Sounds reasonable to me. I'm not sure I understand the comment:

The implementation would be quite straightforward, it could be a simple dict lookup that maps max to nanmax, and so on.

The way I understand the proposal is that the Array API provides nanreduction and that it is left to the user to use that together with xp.max to make themselves a nanmax.

rgommers · 2023-05-23T06:37:59Z

@betatim what I meant was that the implementation in numpy could look something like:

_nanreductions_mapping = dict(
    np.max: np.nanmax,
    np.min: np.nanmin,
    np.mean: np.nanmean,
     # and so on for all existing nan* funcs
)
def nanreduction(func : Callable) -> Callable:
    try:
        return _nanreductions_mapping[func]
    except KeyError:
        # generic implementation, create new function that:
        # 1. creates array with correct output shape
        # 2. per axis, fills output array with func(x[~isnan(x)])

asmeurer · 2023-05-23T20:15:18Z

The functional approach feels a little odd to me. Functions that return other functions aren't typically used in Python, except as decorators. Is the idea that this could in principle be applied to a user defined reduction?

I'm honestly not seeing why this is better than just adding a keyword to the various reductions.

For empty/all nan reductions, I think it should behave the same as the normal reduction. For min and max, we specify that it's implementation-defined, https://data-apis.org/array-api/latest/API_specification/generated/array_api.max.html, but for mean and std we say it must be nan https://data-apis.org/array-api/draft/API_specification/generated/array_api.mean.html https://data-apis.org/array-api/draft/API_specification/generated/array_api.std.html.

rgommers · 2023-05-24T18:19:00Z

The functional approach feels a little odd to me. Functions that return other functions aren't typically used in Python, except as decorators. Is the idea that this could in principle be applied to a user defined reduction?

Yes, and also to reductions in numpy et al. that do not have a nanxxx variant.

I'm honestly not seeing why this is better than just adding a keyword to the various reductions.

The response I expect to that from numpy folks at least (and it'd be my own too) is "why a keyword, we already covered this with separate functions". If it's a more general API instead, the case is that it can apply to all reductions (including those outside of numpy), and there's a reason to introduce it beyond array API standard support.

asmeurer · 2023-05-24T20:01:57Z

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

leofang · 2023-05-31T02:53:40Z

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

cc: @seberg for vis

kgryte · 2023-05-31T04:40:53Z

The response I expect to that from numpy folks at least (and it'd be my own too) is "why a keyword, we already covered this with separate functions".

NumPy 2.0 could be a time to move away from separate functions for nan* variants and consolidate into a single API (e.g., mean() with a kwarg). This is the approach now followed by MATLAB, the presumed original inspiration for the nan* variants. Julia avoided NaN variants altogether; instead, delegating to function composition (e.g., mean(skipmissing(...))).

Given the precedence in dataframe-land (e.g., pandas), supporting a skipna or similar kwarg may be most consistent with community conventions and "modern" trends.

seberg · 2023-05-31T06:14:32Z

I am not in favor of making 2.0 an excuse for such a change, but I am happy to entertain it in general and deprecate the nan... functions slowly.

The question is really how it would interact with the ufunc machinery and be implemented in it. Maybe it is close to the existing where= or maybe it needs more thoughts and a concrete proposal.

jorisvandenbossche · 2023-05-31T06:21:26Z

Maybe it is close to the existing where=

You can already do that right now?

In [3]: arr = np.array([1.0, 2.0, np.nan, 4.0])

In [4]: np.sum(arr, where=~np.isnan(arr))
Out[4]: 7.0

In [5]: np.mean(arr, where=~np.isnan(arr))
Out[5]: 2.3333333333333335

seberg · 2023-05-31T06:28:20Z

Not exactly:

In [2]: np.minimum.reduce(arr, where=~np.isnan(arr))
ValueError: reduction operation 'minimum' does not have an identity, so to use a where mask one has to specify 'initial'

the handling around having zero non-nan entries may not be trivial without an identity.

rgommers · 2023-06-15T11:38:15Z

I am not in favor of making 2.0 an excuse for such a change, but I am happy to entertain it in general and deprecate the nan... functions slowly.

I agree with this. nan* is heavily used, so they could be kept for a long time (or forever), to not be too disruptive.

The question is really how it would interact with the ufunc machinery and be implemented in it.

I guess that is relevant from a pragmatic implementation-in-numpy perspective, but should it matter much for this discussion? There's only a subset of reductions (the ones with a binary ufunc equivalent) that are under the hood implemented by forwarding to <ufunc>.reduce (e.g. , min -> minimum.reduce, sum -> add.reduce), but for others like std, mean, ptp that doesn't apply. It seems to me like whatever is decided upon for a public API should be implementable either in the numpy ufunc machinery or as a preprocessing step. The latter is what is already done in pure Python in nanfunctions, so that can be reused, no performance will be lost.

I would be curious to hear the thoughts from numpy folks on this API. That should be done before anything is added to the standard.

Agreed - time to post to the numpy-discussion list perhaps.

Most implementations, whether as a keyword like pandas/Matlab or functional like Julia, only provide boolean options: "skip" or "include". I'll note that Matlab does have multiple flags for treating NA/NaN/NaT differently.

SciPy on the other hand provides a string keyword, with skip/include/raise options. Here is the design doc for the implementation: http://scipy.github.io/devdocs/dev/api-dev/nan_policy.html. The "raise" option also comes up in other places, like screening for nans and infs in linalg functions (a la check_finite), and that's bound to come up in the near future I suspect.

There's also the need for some functions to specify identity value (e.g., with an initial keyword) for some reductions, as pointed out in previous comments. That makes it a one-keyword or two-keyword pattern, depending on whether the identity value is natural or must be user-specified.

Does anyone else see the need to include SciPy's "raise" flavor? I suspect it's better handled with a separate function or keyword, that checks and raises if any nan's and/or inf's are present.

TomNicholas · 2023-12-27T18:19:49Z

Just to say it would be really nice for xarray if this were solved, because practically all our aggregations try to skip NaNs by default by using nan*, so a strictly array-API-compliant type fails on all these aggregations (unless the user opts out of skipping NaNs explicitly).

kgryte · 2024-04-04T06:59:54Z

Of the options discussed thus far, my personal preference is following SciPy in having a "nan_policy" kwarg. The "omit" and "propagate" options map cleanly to nan* and non-nan versions and the "raise" option is useful when wanting to assert against NaNs unintentionally creeping in during preceding calculations.

While decorators as a design pattern are common in Python, we don't have any precedent for the type of decorator/factory function proposed above in prominent array libraries. While that, in and of itself, is not disqualifying, we should consider whether such a proposal would get buy-in, not just in NumPy, but elsewhere in the ecosystem, and whether the path to standardization might be better achieved through more established design patterns already present in array libraries and the current standard.

Thus far, when we've wanted to overload behavior, we've either parameterized through kwargs or we've split into separate functions (e.g., unique*). As a straw example, we haven't, e.g., in the statistical reduction APIs, standardized factory functions for retaining dimensions (e.g., def keepdims_reduction(f: Callable) -> Callable) or similar parameterization. And generally, across the ecosystem, we've seen adoption of kwargs to control na (or nan) behavior; e.g., pandas and skipna or SciPy and nan_policy). We've also seen parameterization used in other languages (e.g., MATLAB).

With particular regard to NumPy, an advantage of SciPy's nan_policy is that it would ensure consistency in NaN handling between the two libraries, thus ensuring users have only one design pattern to remember when moving between NumPy (e.g., np.mean and scipy.stats.hmean). If standardized, this would ensure the mental model is (more or less) the same across both array producing and array consuming libraries (e.g., pandas, SciPy).

NumPy (and others) would, of course, be free to keep the nan* variants for backward compatibility and internally dispatch from np.mean(x, nan_policy="omit") to np.nanmean(x), so the implementation lift is not overly burdensome. The new behavior, of course, would be nan_policy="raise", and there is a question as to whether that is supported by default in NumPy in its ufunc machinary; however, the fact that SciPy can support suggests that this is doable, given SciPy's reliance on ufuncs.

So in short, I'd like to propose that we move to adopt SciPy's precedent of a nan_policy kwarg and add to the following reduction APIs:

min
max
mean
prod
std
sum
var
cumulative_sum

Adding nan_policy support across the ecosystem is likely to take time; however, I think we should be able to implement in the array-api-compat layer using a combination of standardized APIs (e.g., xp.where + xp.mean) and, on a library-by-library basis, already available specialized APIs (e.g., np.nanmean).

rgommers · 2024-04-04T20:16:12Z

I think a keyword would be a very reasonable outcome. What would perhaps make sense to untangle the chicken-and-egg problem here is:

Come to a preliminary conclusion here on the desired API
Open a draft PR with the spec addition - to review/approve, but blocked for merging until we have an implementation at least in NumPy (or a PR that's in good shape)
Someone has to propose the addition for numpy and write the implementation
Once we have both the spec and an implementation and everything seems happy, merge the spec for the next version of the standard

That offers a way forward, while avoiding adding something in the standard that has a potentially large implementation burden (which would help no one, and make future versions of the standard harder to adopt if the work isn't done).

mdhaber · 2024-04-24T18:16:35Z

I've worked on making nan_policy much more consistent in scipy.stats over the past few years. It is now accepted by most reduction functions / hypothesis tests, and all functions now follow the same rules for omit.

I'd agree that a standard nan_policy-like keyword argument would be appreciated, especially for the choice between behaviors like propagate, raise, and (sometimes requested) warn. However, I think it's unfortunate that nan is overloaded to mean both "undefined output" and "missing input", since they may need to be treated differently. Ideally, I would suggest (in lieu of a separate sentinel value) the use of a mask for missing data rather than an omit option. But I recognize that the array-API cannot necessarily change the way programmers use nan, so an omit option may be considered essential.

Either way, I think it's relevant to mention here: in scipy/scipy#20363, I drafted a function that accepts an array-API compatible namespace and returns a (nearly) array-API compatible masked array namespace (composed of calls from the provided namespace). This could be useful either as a default implementation for the nan_policy='omit' option in array-api-compat or as a way to avoid the need for all libraries to implement an omit behavior.

betatim mentioned this issue Apr 19, 2023

Make more of the "tools" of scikit-learn Array API compatible scikit-learn/scikit-learn#26024

Open

tomwhite mentioned this issue Apr 20, 2023

array api - Add tests for aggregations pydata/xarray#7424

Draft

4 tasks

rgommers added the API extension Adds new functions or objects to the API. label May 18, 2023

kgryte mentioned this issue Jun 29, 2023

Tracking issue for the 2023 revision of the array API specification #643

Closed

17 tasks

rgommers mentioned this issue Aug 2, 2023

ENH: Overhaul of NumPy main namespace [NEP 52] numpy/numpy#24306

Closed

kgryte mentioned this issue Oct 19, 2023

NamedArray tracking issue pydata/xarray#8238

Open

20 tasks

keewis mentioned this issue Mar 14, 2024

Missing array-api support for some stats functions? pydata/xarray#8834

Open

5 tasks

kgryte changed the title ~~How to translate nanmin and friends to the Array API?~~ RFC: add support for nan* reductions Apr 4, 2024

kgryte added the RFC Request for comments. Feature requests and proposed changes. label Apr 4, 2024

kgryte added the Needs Discussion Needs further discussion. label Apr 4, 2024

kgryte added this to the v2024 milestone Apr 4, 2024

ogrisel mentioned this issue Jun 6, 2024

RFC: add quantile? #795

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: add support for `nan*` reductions #621

RFC: add support for `nan*` reductions #621

betatim commented Apr 18, 2023

rgommers commented Apr 18, 2023

ogrisel commented Apr 20, 2023

rgommers commented Apr 21, 2023

rgommers commented May 18, 2023 •

edited

Loading

ogrisel commented May 19, 2023

rgommers commented May 19, 2023

ogrisel commented May 22, 2023

ogrisel commented May 22, 2023

betatim commented May 22, 2023

rgommers commented May 23, 2023

asmeurer commented May 23, 2023 •

edited

Loading

rgommers commented May 24, 2023

asmeurer commented May 24, 2023

leofang commented May 31, 2023

kgryte commented May 31, 2023 •

edited

Loading

seberg commented May 31, 2023

jorisvandenbossche commented May 31, 2023

seberg commented May 31, 2023

rgommers commented Jun 15, 2023

TomNicholas commented Dec 27, 2023

kgryte commented Apr 4, 2024 •

edited

Loading

rgommers commented Apr 4, 2024

mdhaber commented Apr 24, 2024 •

edited

Loading

RFC: add support for nan* reductions #621

RFC: add support for nan* reductions #621

Comments

betatim commented Apr 18, 2023

rgommers commented Apr 18, 2023

ogrisel commented Apr 20, 2023

rgommers commented Apr 21, 2023

rgommers commented May 18, 2023 • edited Loading

ogrisel commented May 19, 2023

rgommers commented May 19, 2023

ogrisel commented May 22, 2023

ogrisel commented May 22, 2023

betatim commented May 22, 2023

rgommers commented May 23, 2023

asmeurer commented May 23, 2023 • edited Loading

rgommers commented May 24, 2023

asmeurer commented May 24, 2023

leofang commented May 31, 2023

kgryte commented May 31, 2023 • edited Loading

seberg commented May 31, 2023

jorisvandenbossche commented May 31, 2023

seberg commented May 31, 2023

rgommers commented Jun 15, 2023

TomNicholas commented Dec 27, 2023

kgryte commented Apr 4, 2024 • edited Loading

rgommers commented Apr 4, 2024

mdhaber commented Apr 24, 2024 • edited Loading

RFC: add support for `nan*` reductions #621

RFC: add support for `nan*` reductions #621

rgommers commented May 18, 2023 •

edited

Loading

asmeurer commented May 23, 2023 •

edited

Loading

kgryte commented May 31, 2023 •

edited

Loading

kgryte commented Apr 4, 2024 •

edited

Loading

mdhaber commented Apr 24, 2024 •

edited

Loading