Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python API, semantics and implementation details for DLPack #106

Merged
merged 12 commits into from
Feb 21, 2021

Conversation

rgommers
Copy link
Member

@rgommers rgommers commented Dec 30, 2020

Related to the discussion at data-apis/consortium-feedback#1. This addresses all open items on that discussion. It's kind of tricky to figure out what to add here though, and what to refer to DLPack itself for.

We should upstream some of this documentation to DLPack as well, so this becomes a summary. Right now DLPack doesn't have any documentation on Python-level API, and some of the other content here isn't very clearly documented yet either (it mostly came from explanations of @tqchen in the issue linked above).

Given that reviewers may not want to build the Sphinx docs, here's a screenshot of the last half (which has visuals):

image


- **stream**: _Optional\[int\]_

- If given, the CUDA or ROCm stream number the consumer will use. Default is `None`, which means the legacy default stream.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is different from how stream is specified in https://numba.readthedocs.io/en/latest/cuda/cuda_array_interface.html#python-interface-specification. I actually don't understand that spec - it says for None that no synchronization is needed, it uses 1/2 for legacy/per-thread default stream and other integers for non-default streams. Which seems odd - what if the stream number of a non-default stream in use is 2 for example?

Using:

  • None: legacy default stream
  • 0: per-thread default stream
  • 1, 2, ... non-default stream numbers
    seems to make more sense. @leofang am I missing something there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rgommers None is actually confusing across libraries. For example, in Numba None means there is no Numba's default stream (to be distinguished from "CUDA's (whichever) default stream"), whereas in CuPy None simply refers to CUDA's default stream, which in turn is 1 (the legacy default stream), though we're on the process of adopting 2 (the per-thread default stream). 0 is not acceptable either for the same reason: it's semantically unclear depending on how the libraries containing CUDA code are compiled and the runtime behavior defined in the Python hooks.

Note that in CUDA you don't get to choose the stream numbers --- CUDA macro-defines 1 for cudaStreamLegacy and 2 for cudaStreamPerThread, which CAI v3 followed. Any user/non-default stream created via cudaStreamCreate() is guaranteed to start on or after 3. (In fact the CUDA driver would reserve a stream pool internally, so the actual start number is way after 3.) I hope this makes CAI v3 clearer to you.

I'll try to catch up the rest of the discussions here as well as in the DLPack repo after Monday (tomorrow)...😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw fun fact: in HIP syncing over the stream 1 or 2 would lead to segfault, as HIP does not support them: cupy/cupy#4458 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and links @leofang. I updated it to match __cuda_array_interface__.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @rgommers. Apology, I realized my first sentence wasn't complete; it should have been

None is actually confusing across libraries, so we decided to use it to mark the situation "there is no need to do any kind of synchronization for whatever reason, just take the device pointer and do your work".

This matches `__cuda_array_interface__`.
@rgommers rgommers added the RFC Request for comments. Feature requests and proposed changes. label Jan 12, 2021
@rgommers
Copy link
Member Author

I think this is complete enough for now. dmlc/dlpack#57 is converging. I'd like to merge this PR if everyone is happy with that, so it shows up in the rendered html docs. And then we can focus on summarizing dmlc/dlpack#57 in that repo, I think that should cover the topic.

@oleksandr-pavlyk
Copy link
Contributor

I would like to render the document and read through this, please. Please allow me a couple of hours.

@leofang
Copy link
Contributor

leofang commented Jan 12, 2021

I'll try to revisit this tonight.

@tqchen
Copy link

tqchen commented Jan 13, 2021

I still think it might be useful to map None to (legacy) default stream (the default configuration of cuda/rocm's stream 0). Since no synchronization might be ambiguous in GPU case.

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior. The default behavior of always sync to the default stream would be a good one.

@rgommers
Copy link
Member Author

rgommers commented Jan 13, 2021

Thanks @tqchen, makes sense. If that'd really be confusing with Numba as @leofang says, the alternative may be to specify that stream must not be None if the consumer supports CUDA/ROCm.

  1. - None: CUDA/ROCm streams are not supported,

instead of

  1. - None: no synchronization (default),

or

  1. - None: legacy default stream (default),

That may be a bit more explicit.

@tqchen
Copy link

tqchen commented Jan 13, 2021

Thanks @rgommers .

I believe the main purpose of default behavior is to serve as a recommendation of starting point to the developers. Option1 means we do not have a recommendation for developers in the case of CUDA/Rocm.

On the other hand, there is a common starting point for developers to develop GPU kernels. I learn GPU programming by launching kernel without the stream argument. In those cases, the behavior relies on the legacy default stream. Although the stream 0 have ambiguity in CUDA as @leofang mentioned, the default behavior of 0 still falls back to the legacy default stream. Additionally, it seems that Rocm also only currently support the legacy default stream behavior.

The above reason would still favor option3, because from the experience of CUDA/Rocm development, there is a default starting point (of using legacy stream), having the default to match that starting point is good. It also makes default API work for both CPU, cuda and rocm case.

@leofang
Copy link
Contributor

leofang commented Jan 13, 2021

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior. The default behavior of always sync to the default stream would be a good one.

Sorry, @tqchen, I don't get it. I thought having gone through the long discussion in dmlc/dlpack#57 we finally are close to an agreement that we sync over the Consumer's stream, but what you're saying above is just in contrary to everything we've discussed. IIUC you're saying both the Producer and the Consumer need to sync once (the former with the default stream, and the latter with its own stream), or did I misread something?

@rgommers
Copy link
Member Author

what you're saying above is just in contrary to everything we've discussed.

I don't think so, nothing really changes. The only difference is whether we let people who do use the legacy default stream say stream=None as an alias for stream=1, or do we force them to say stream=1.

@tqchen
Copy link

tqchen commented Jan 13, 2021

Right, I think we have agreed on the API convention. The main discussion point is how do we specify the recommended default value behavior.

@oleksandr-pavlyk
Copy link
Contributor

Nit:

DLPack describe the memory layout of strided, n-dimensional arrays.

-> "DLPack describes the memory layout of strided, n-dimensional arrays."

@oleksandr-pavlyk
Copy link
Contributor

oleksandr-pavlyk commented Jan 13, 2021

So from_dlpack is implemented by "consumer", and __dlpack__ is implemented by "producer". Say producer is cupy, and consumer if numpy.

Cupy's __dlpack__ will rightfully return a capsule with DLDeviceType set to a CUDA GPU code. CPU only consumer, that is NumPy, will not know what to do with this data, and must raise an exception.

The only way out is to rely on cupy's to_numpy function. The standard should specify how to to deal with this scenario. Have I missed it?

@rgommers
Copy link
Member Author

That's a good question @oleksandr-pavlyk. When the producer and consumer don't overlap in device support at all, then indeed there's little that can be done other than error. Most libraries have a to_numpy-like method, but that's clearly not appropriate for a standard (specific to one library, so why add to_numpy and not to_jax?).

There is one case where things could work in principle, but don't: that is when the producer would support (e.g.) CPU and GPU, and the consumer only CPU. Then for GPU arrays now it doesn't work, but it could be made to work (with a device transfer) if the consumer could signal that device support to the producer.

The stance this protocol takes is that it must be zero-copy, or else only the consumer library can decide to copy (recommended not to though).

We could make the discussion on that in the Semantics section a little more extensive.

@oleksandr-pavlyk
Copy link
Contributor

It may be good to ask libraries to implement to_bytearray, or some such, which can be used to deal with exchange between libraries with no overlapping device support. It would be like copy via host mechanism.

@oleksandr-pavlyk
Copy link
Contributor

I am not sure I am very clear on the warning about device_id in DLPack.

I though CUDA_VISIBLE_DEVICES and the like can influence what device run-time associate for device_id=0 for a given process. Once the process has been launched, the mapping is fixed and can not change, but I am open to stand corrected.

@rgommers
Copy link
Member Author

I though CUDA_VISIBLE_DEVICES and the like can influence what device run-time associate for device_id=0 for a given process. Once the process has been launched, the mapping is fixed and can not change, but I am open to stand corrected.

See data-apis/consortium-feedback#1 (comment) and comments below it.

@rgommers
Copy link
Member Author

It may be good to ask libraries to implement to_bytearray, or some such, which can be used to deal with exchange between libraries with no overlapping device support. It would be like copy via host mechanism.

In other places I think we have favoured raising exceptions as well, rather than doing device transfers. Once you get an error as a user, doing the manual transfer is probably fine with a library-specific method like to_numpy. to_bytearray is something that won't be used much even if all libraries would want to implement it.

@leofang
Copy link
Contributor

leofang commented Jan 13, 2021

Ah OK @rgommers @tqchen. Sorry for being nerve-wrecked, it's just that this statement

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior.

does not look right to me, and so any conclusion for the default such as

The default behavior of always sync to the default stream would be a good one.

is not sound.

I disagree on two things here:

  1. "The default should be always syncing (on whatever stream)": I think once the stream argument is accepted by all parties, it should be fairly straightforward to avoid syncing whenever possible. In particular, there are use cases in which two libraries working on the same stream, which is known to the user. In such cases syncing is just unnecessary and making syncing as default would make it harder to opt out. Though I feel less strongly on this if it's deemed critical to unblock things.
  2. "If we pick a default stream to sync over, it better be stream 0":
    • On CUDA: It would be very difficult, if not impossible, if Library A is built via setting nvcc --default-stream or -DCUDA_API_PER_THREAD_DEFAULT_STREAM such that stream 0 = 1 (legacy) but Library B has stream 0 = 2 (per-thread), and a user tries to debug a synchronization bug. It's subtle and likely an implementation detail that users usually don't need to worry about. I believe at least RAPIDS is on the process to making stream 2 their default (@kkraus14 probably can comment on this), so such a scenario could happen in the near future.
    • On ROCm: Not choosing a default also avoids the problem that ROCm does not have stream 1 or 2 working.

A different question from the above discussion: Are we focusing only on CUDA/ROCm in this PR and leaving other architectures like OpenCL etc for the future, so that we can talk about streams?

@tqchen
Copy link

tqchen commented Jan 13, 2021

Thanks @leofang

The current API design allows the producer/consumer to pass in streams, and implement the optimization mentioned in this point. So it won't block any of the features, the producer and consumer are free to implement your proposal, by passing in explicit stream of interest and checking the stream.

The main topic of interest, however, is to also recommend a default case when developers want simpler implementations (e.g. the program only works on default stream since that was a common way to get started). For those developers, None that default to legacy stream makes sense.

Additionally, I believe we are saying "None defaults to (legacy) default stream". In CUDA that means 1(or 0 when you use the default flag in nvcc), and rocm it means 0. In short, they are the stream choice when you compile a kernel that does not contain a stream field. Having such a choice won't block any of the more advanced use-cases that are being mentioned. But would also help beginners to adopt things without worrying about the choice of the stream argument.

@tqchen
Copy link

tqchen commented Feb 17, 2021

Thanks @rgommers the proposal looks good to me

@leofang
Copy link
Contributor

leofang commented Feb 17, 2021

One last minor thing on establishing the stream ordering: Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer? This is for the need of Consumer libraries which cannot provide any meaningful construct to the Producer. In revising CUDA Array Interface we left such a "backdoor" to ignore streams:

Use of this exception should be avoided where possible, as it is provided for libraries that cannot implement the synchronization semantics without the involvement of the User - for example, those interfacing with third-party libraries oblivious to the CUDA Array Interface.

The most notable example is mpi4py, in which we have no control over any CUDA functionality (it's handled internally by either UCX or the underlying MPI library), so as a Consumer we cannot provide any stream to establish the stream order. Any synchronization (in both MPI and CUDA senses), if needed, must be explicitly done by the Users.

@rgommers
Copy link
Member Author

Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer?

Your rationale for a backdoor makes sense, but can we just use stream=-1? An env var seems pretty ugly API-wise.

@leofang
Copy link
Contributor

leofang commented Feb 17, 2021

Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer?

Your rationale for a backdoor makes sense, but can we just use stream=-1? An env var seems pretty ugly API-wise.

@rgommers You meant Consumer should set it to -1 and Producer should explicitly check if -1 is given, right? Fine with me. An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

@rgommers
Copy link
Member Author

You meant Consumer should set it to -1 and Producer should explicitly check if -1 is given, right?

Yes indeed.

@rgommers
Copy link
Member Author

An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

I'm actually not quite sure I understand this. Can you think of cases where an array library will be used both in settings where it does and doesn't know if the user must be involved in stream handling?

In that case the library must expose some API to the user to control its stream handling. Which could be an env var or a regular API (e.g. add disable_stream_handling= to from_dlpack). Either way I think that's up to the library to figure out, and it shouldn't be in the public API of the array API standard I believe.

@leofang
Copy link
Contributor

leofang commented Feb 19, 2021

An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

I'm actually not quite sure I understand this. Can you think of cases where an array library will be used both in settings where it does and doesn't know if the user must be involved in stream handling?

As discussed in yesterday's call, I was mainly thinking that an env var could give superusers a knot to tweak the behavior, but it's probably better just restrict to the need of library implementors, for which setting stream to -1 or any sentinel object is enough.

Copy link
Contributor

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done a full pass review yet, will do so asap, but in case I forget: I think we better set a minimal requirement for DLPACK_VERSION so that everyone is importing the latest header.

cc: @tqchen

@rgommers
Copy link
Member Author

I haven't done a full pass review yet, will do so asap, but in case I forget: I think we better set a minimal requirement for DLPACK_VERSION so that everyone is importing the latest header.

Given that v0.4 is ABI-compatible with v0.2 and v0.3, and that (I assume) an ABI change will come with a major version increase, I guess the required range is 0.2 <= DLPACK_VERSION < 1.0. Latest would be nice, but it shouldn't matter - once we require complex number support, it needs to be >= 0.4 though.

@rgommers
Copy link
Member Author

The one other comment on the new __dlpack_device__ was to change the integer for device ID to a Python enum - I'm looking at that now.

@rgommers
Copy link
Member Author

Update PR for all remaining comments. I'd like to merge this by the end of the weekend, and if there are more comments after that do a follow-up PR. Reason: this should be visible in the html docs, and we're about to send the adoption proposal for NumPy out for review, where I'm sure there'll be some discussion around DLPack.

Copy link
Contributor

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for two minor points.

spec/design_topics/data_interchange.md Outdated Show resolved Hide resolved
spec/design_topics/data_interchange.md Show resolved Hide resolved
Also fix a couple of small textual things.
@rgommers rgommers merged commit e2474ce into data-apis:main Feb 21, 2021
@rgommers
Copy link
Member Author

Okay in it goes. Thanks @tqchen, @leofang, @kkraus14 and @oleksandr-pavlyk!

@leofang
Copy link
Contributor

leofang commented Feb 21, 2021

Thanks @rgommers @tqchen for carrying out most of the hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments. Feature requests and proposed changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants