-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python API, semantics and implementation details for DLPack #106
Conversation
duplicate change with that in the complex64/128 PR
|
||
- **stream**: _Optional\[int\]_ | ||
|
||
- If given, the CUDA or ROCm stream number the consumer will use. Default is `None`, which means the legacy default stream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is different from how stream
is specified in https://numba.readthedocs.io/en/latest/cuda/cuda_array_interface.html#python-interface-specification. I actually don't understand that spec - it says for None
that no synchronization is needed, it uses 1/2 for legacy/per-thread default stream and other integers for non-default streams. Which seems odd - what if the stream number of a non-default stream in use is 2
for example?
Using:
None
: legacy default stream0
: per-thread default stream1, 2, ...
non-default stream numbers
seems to make more sense. @leofang am I missing something there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgommers None
is actually confusing across libraries. For example, in Numba None
means there is no Numba's default stream (to be distinguished from "CUDA's (whichever) default stream"), whereas in CuPy None
simply refers to CUDA's default stream, which in turn is 1
(the legacy default stream), though we're on the process of adopting 2
(the per-thread default stream). 0
is not acceptable either for the same reason: it's semantically unclear depending on how the libraries containing CUDA code are compiled and the runtime behavior defined in the Python hooks.
Note that in CUDA you don't get to choose the stream numbers --- CUDA macro-defines 1
for cudaStreamLegacy
and 2
for cudaStreamPerThread
, which CAI v3 followed. Any user/non-default stream created via cudaStreamCreate()
is guaranteed to start on or after 3
. (In fact the CUDA driver would reserve a stream pool internally, so the actual start number is way after 3
.) I hope this makes CAI v3 clearer to you.
I'll try to catch up the rest of the discussions here as well as in the DLPack repo after Monday (tomorrow)...😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw fun fact: in HIP syncing over the stream 1
or 2
would lead to segfault, as HIP does not support them: cupy/cupy#4458 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation and links @leofang. I updated it to match __cuda_array_interface__
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @rgommers. Apology, I realized my first sentence wasn't complete; it should have been
None
is actually confusing across libraries, so we decided to use it to mark the situation "there is no need to do any kind of synchronization for whatever reason, just take the device pointer and do your work".
This matches `__cuda_array_interface__`.
I think this is complete enough for now. dmlc/dlpack#57 is converging. I'd like to merge this PR if everyone is happy with that, so it shows up in the rendered html docs. And then we can focus on summarizing dmlc/dlpack#57 in that repo, I think that should cover the topic. |
I would like to render the document and read through this, please. Please allow me a couple of hours. |
I'll try to revisit this tonight. |
I still think it might be useful to map For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior. The default behavior of always sync to the default stream would be a good one. |
Thanks @tqchen, makes sense. If that'd really be confusing with Numba as @leofang says, the alternative may be to specify that
instead of
or
That may be a bit more explicit. |
Thanks @rgommers . I believe the main purpose of default behavior is to serve as a recommendation of starting point to the developers. Option1 means we do not have a recommendation for developers in the case of CUDA/Rocm. On the other hand, there is a common starting point for developers to develop GPU kernels. I learn GPU programming by launching kernel without the stream argument. In those cases, the behavior relies on the legacy default stream. Although the stream 0 have ambiguity in CUDA as @leofang mentioned, the default behavior of 0 still falls back to the legacy default stream. Additionally, it seems that Rocm also only currently support the legacy default stream behavior. The above reason would still favor option3, because from the experience of CUDA/Rocm development, there is a default starting point (of using legacy stream), having the default to match that starting point is good. It also makes default API work for both CPU, cuda and rocm case. |
Sorry, @tqchen, I don't get it. I thought having gone through the long discussion in dmlc/dlpack#57 we finally are close to an agreement that we sync over the Consumer's stream, but what you're saying above is just in contrary to everything we've discussed. IIUC you're saying both the Producer and the Consumer need to sync once (the former with the default stream, and the latter with its own stream), or did I misread something? |
I don't think so, nothing really changes. The only difference is whether we let people who do use the legacy default stream say |
Right, I think we have agreed on the API convention. The main discussion point is how do we specify the recommended default value behavior. |
Nit:
-> "DLPack describes the memory layout of strided, n-dimensional arrays." |
So Cupy's The only way out is to rely on cupy's |
That's a good question @oleksandr-pavlyk. When the producer and consumer don't overlap in device support at all, then indeed there's little that can be done other than error. Most libraries have a There is one case where things could work in principle, but don't: that is when the producer would support (e.g.) CPU and GPU, and the consumer only CPU. Then for GPU arrays now it doesn't work, but it could be made to work (with a device transfer) if the consumer could signal that device support to the producer. The stance this protocol takes is that it must be zero-copy, or else only the consumer library can decide to copy (recommended not to though). We could make the discussion on that in the Semantics section a little more extensive. |
It may be good to ask libraries to implement |
I am not sure I am very clear on the warning about I though |
See data-apis/consortium-feedback#1 (comment) and comments below it. |
In other places I think we have favoured raising exceptions as well, rather than doing device transfers. Once you get an error as a user, doing the manual transfer is probably fine with a library-specific method like |
Ah OK @rgommers @tqchen. Sorry for being nerve-wrecked, it's just that this statement
does not look right to me, and so any conclusion for the default such as
is not sound. I disagree on two things here:
A different question from the above discussion: Are we focusing only on CUDA/ROCm in this PR and leaving other architectures like OpenCL etc for the future, so that we can talk about streams? |
Thanks @leofang The current API design allows the producer/consumer to pass in streams, and implement the optimization mentioned in this point. So it won't block any of the features, the producer and consumer are free to implement your proposal, by passing in explicit stream of interest and checking the stream. The main topic of interest, however, is to also recommend a default case when developers want simpler implementations (e.g. the program only works on default stream since that was a common way to get started). For those developers, None that default to legacy stream makes sense. Additionally, I believe we are saying "None defaults to (legacy) default stream". In CUDA that means 1(or 0 when you use the default flag in nvcc), and rocm it means 0. In short, they are the stream choice when you compile a kernel that does not contain a stream field. Having such a choice won't block any of the more advanced use-cases that are being mentioned. But would also help beginners to adopt things without worrying about the choice of the stream argument. |
Thanks @rgommers the proposal looks good to me |
One last minor thing on establishing the stream ordering: Can we provide an env var, say,
The most notable example is mpi4py, in which we have no control over any CUDA functionality (it's handled internally by either UCX or the underlying MPI library), so as a Consumer we cannot provide any stream to establish the stream order. Any synchronization (in both MPI and CUDA senses), if needed, must be explicitly done by the Users. |
Your rationale for a backdoor makes sense, but can we just use |
@rgommers You meant Consumer should set it to -1 and Producer should explicitly check if -1 is given, right? Fine with me. An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion. |
Yes indeed. |
I'm actually not quite sure I understand this. Can you think of cases where an array library will be used both in settings where it does and doesn't know if the user must be involved in stream handling? In that case the library must expose some API to the user to control its stream handling. Which could be an env var or a regular API (e.g. add |
As discussed in yesterday's call, I was mainly thinking that an env var could give superusers a knot to tweak the behavior, but it's probably better just restrict to the need of library implementors, for which setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done a full pass review yet, will do so asap, but in case I forget: I think we better set a minimal requirement for DLPACK_VERSION
so that everyone is importing the latest header.
cc: @tqchen
Given that |
The one other comment on the new |
Update PR for all remaining comments. I'd like to merge this by the end of the weekend, and if there are more comments after that do a follow-up PR. Reason: this should be visible in the html docs, and we're about to send the adoption proposal for NumPy out for review, where I'm sure there'll be some discussion around DLPack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for two minor points.
Also fix a couple of small textual things.
Okay in it goes. Thanks @tqchen, @leofang, @kkraus14 and @oleksandr-pavlyk! |
Related to the discussion at data-apis/consortium-feedback#1. This addresses all open items on that discussion. It's kind of tricky to figure out what to add here though, and what to refer to DLPack itself for.
We should upstream some of this documentation to DLPack as well, so this becomes a summary. Right now DLPack doesn't have any documentation on Python-level API, and some of the other content here isn't very clearly documented yet either (it mostly came from explanations of @tqchen in the issue linked above).
Given that reviewers may not want to build the Sphinx docs, here's a screenshot of the last half (which has visuals):