-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array API standard and Numpy compatibility #400
Comments
Thanks for opening this issue here @vnmabus. I agree with your comments on the NumPy issue that it's more a NumPy topic/problem than an API standard one. However, visibility is good because (a) this is going to be important to resolve since it may present a hurdle to adoption, and (b) other libraries like CuPy and Dask are copying NumPy's approach (not surprising, that's their general API design approach). Let me add a short summary and a few cross-links here:
I suggest to continue the discussion around |
I wonder if we could at least extend the Array API standard to allow for arrays with novel dtypes for storage but not computation. For example, you could convert a NumPy object array into an This would be quite helpful for downstream libraries like Xarray that do need basic operations on objects arrays, e.g., to handle strings. The actual string specific computation could be performed on NumPy arrays, of course, but it would be nice to be able to switch the core manipulation routines to use array standard APIs. |
I think this is always allowed? Every existing library is going to provide a superset of the standard, and we don't require that exceptions are raised for things that are not included in the standard. So string/object dtypes should be perfectly fine. Given that object or string dtypes are mostly numpy-specific, I don't see how it would help to explicitly name them in the standard though.
So this is another issue: it's whether the |
@shoyer I followed up on this in numpy/numpy#21135 (comment). Thoughts there would be much appreciated. |
@asmeurer this issue is basically the existing discussion on the two points. I will summarize my view here:
So, we need a non-minimal namespace that does not need to be fully compatible with the array-API. We could make that So, as an alternative, maybe it should be:
That could even live inside NumPy. The other point is the There would need to be some best-practices (i.e. get consent from the library you wish to coerce). I would do no actual coercion, if |
I guess this was a reference to a conversation at SciPy'22? I feel like I'm missing some context or assumptions made in this reply.
The unpack/repack is annoying and not desired, but I would not a priori say it's not useful to do so (EDIT: as of right now, of course if numpy had array API support in its main namespace, that'd be much better). If as an author of an array-consuming library (e.g., scikit-learn) you want to (a) support multiple array libraries and (b) not have duplicate code paths in many places, then you kinda need this namespace and unpack/repack. Good points regarding "normal promotion" is different - that is not portable and hence not supportable in a code path that supports multiple array libraries. So you should not be relying on that - use explicit casts instead of cross-kind casting.
Agreed
This I don't quite get. You are actively trying to change/improve the NumPy promotion rules because they are indeed, well, wonky:) Maybe you're reasoning from a different goal here than writing portable code in a downstream library, but I think casting rules in any namespace you want to use here should be compliant.
This is something I disagree with as a goal. There's a reason that no other array library allows this; it's hard to even be precise about what this means, and it's an endless source of problems and bug reports. What CuPy, PyTorch, JAX et al. all do is better imho. Having to be explicit about conversions and not mixing different array objects directly is healthy, and not a missing feature.
but it doesn't? |
Yes it was a reference to SciPy'22, although I hope it isn't particularly detached from here, all things have been mentioned before.
Well you may be able to if it was not a minimal implementation. The main point is that libraries should not have to do (many) backcompat breaks for existing NumPy code. Even rare/strange BC issues seem bound to hinder adoption to existing libraries quite seriously, and the minimal implementation has a lot of BC limitations compared to vanilla NumPy.
The array-API already leaves many promotions as undefined. So you already get different dtypes/precision results out depending on what you pass in (if you mix dtypes). The current NumPy promotion just adds a few quirks to that. So the user already needs to be a bit careful about promotion if they globally swap out the array-object. I am not convinced that there is a problem with being pragmatic about it. Especially from a library adoption point of view.
I am reasoning purely from the goal of getting portable code into existing downstream libraries. The point is that the dtype promotion rules don't matter much in practice: Currently, For downstream/sklearn it doesn't matter if they get a "wrong" namespace as long as they get unchanged behavior. That is what both the SciPy and the sklearn PRs that tried to adopt the array-api ended up doing, so it seems like a pragmatic approach. They just did it independently and "incrementally" (e.g. only adding In other words: libraries need a namespace that gives easy NumPy compatibility right now (with as little backward compat concerns as possible). This would give the library the ability to have a single code path that supports NumPy (unmodified) and any API compatible object. Yes, that might be explicitly not "compliant" for NumPy arrays.
I don't care too much about this, but |
For the most part, I have come to @rgommers's perspective on preferring explicit promotion. This is the way JAX handles things, for example, and it really is a joy to be able to reliably compose different array types. (I similarly prefer explicit broadcasting with In JAX, it is easy to avoid implicit array promotion because different array types are created via transformations (at a single function call) rather by creating array objects separately. This works great for "computational oriented" code like most deep learning programs, and means you always have a well defined "order" in which array wrapping should happen. On the other hand, libraries like Xarray that are focused on "data oriented" use-cases don't have the luxury of being able to use transformations for handling different array types. Implicit casting definitely seems more appealing here (so users can write more generic code), but it still has some serious scaling issues if more than a few array types are involved. For example users want to be able to compose stacks of wrapped arrays like cupy/dask/pint inside Xarray objects (pydata/xarray#5648). I'm still not entirely sure what the right long term fix looks like, though definitely leaving it up to array library separately with protocols like |
Completely agreed, the fewer changes they have to make the better.
Okay, we're on the same page there. And with my SciPy maintainer hat on, I'm very much interested in the details of how that looks too.
This is probably not true beyond numpy usage (pass in mixed-type pytorch tensors for example, and it won't work), and is where the
Thanks @seberg, yes this makes sense to me and is quite important to have. Something to also consider here, given the "right now", is that it's probably pointless for this namespace to live in NumPy itself - because it cannot be used unconditionally for several years, given the need to support older numpy versions. Having that "compat namespace" in scikit-learn, SciPy etc. avoided this. So probably what is needed to achieve this is a separate package that can simply be vendored into each library that needs it. |
Sorry for resurrecting this for a specific question, but what would some sort of " For example, as things stand, strings are just outright not supported. from numpy import array_api as np
np.asarray(['foo', 'bar'])
"""
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 np.asarray(['foo', 'bar'])
File ~/Projects/Theis/array-api-tests/venv/lib/python3.10/site-packages/numpy/array_api/_creation_functions.py:72, in asarray(obj, dtype, device, copy)
70 raise OverflowError("Integer out of bounds for array dtypes")
71 res = np.asarray(obj, dtype=dtype)
---> 72 return Array._new(res)
File ~/Projects/Theis/array-api-tests/venv/lib/python3.10/site-packages/numpy/array_api/_array_object.py:81, in Array._new(cls, x)
79 x = np.asarray(x)
80 if x.dtype not in _all_dtypes:
---> 81 raise TypeError(
82 f"The array_api namespace does not support the dtype '{x.dtype}'"
83 )
84 obj._array = x
85 return obj
TypeError: The array_api namespace does not support the dtype '<U3'
""" To me this makes sense given the API specification and its intention. My reaction was similar to @shoyer that the spec would need to be extended to support this somehow. However @rgommers seems to suggest otherwise
I could see allowing the few operations that make sense for strings from the array-api, like addition=string concatenation (although a lot of this is currently in |
The answer is different depending on what you mean by "array-api-compliant". If you mean 'how could an array-provider library both support the array API and string dtypes', then that is already possible. See numpy/numpy#25542 (and NEP 55 I suppose!). If you mean 'how could an array-consumer library use string dtypes with arrays, regardless of which array type they get (as long as it conforms to the standard)', then yes, the standard would have to be extended. This would likely have to be an optional extension though (note that the only extensions so far are just extra namespaces of functions, this would be a bit more given extra dtypes), as Ralf said:
|
The first question is whether there would be other libraries than NumPy that would implement it. The main purpose of the standard is interoperability between libraries. |
So what it sounds like is that while string arrays can work with libraries that are array-api compliant, the actual array container for a string array will not be array-api compliant. Is this accurate? Thanks for the replies! |
Since string dtypes are not in the standard, it is slightly confusing to ask whether a certain container would be compliant or not. The tl;dr is:
|
Right, this was my impression. So it sounds like the answer is basically "no" at the moment to the question of whether or not one can create an interoperable container but "yes" to whether or not a library (which has an array-api compliant Array object) can handle them. |
Yes, and said array api compliant Array object is allowed to have additional attributes and methods to support features like this (it just isn't required/guaranteed to). |
Thanks! Really appreciate the fast replies (making my life infinitely easier). |
I think this issue can be closed now that NEP 56 was accepted! |
Good point @lucascolley, closing. Thanks everyone! |
This issue is created as a continuation of numpy#21135, as a request from mattip.
The idea is to discuss between the array API standard and the NumPy communities how to make code that is compatible with both the array API standard and Numpy functionalities, in order to avoid code duplication and facilitate the move towards the standard.
The text was updated successfully, but these errors were encountered: