Add Python API, semantics and implementation details for DLPack #106

rgommers · 2020-12-30T22:17:14Z

Related to the discussion at data-apis/consortium-feedback#1. This addresses all open items on that discussion. It's kind of tricky to figure out what to add here though, and what to refer to DLPack itself for.

We should upstream some of this documentation to DLPack as well, so this becomes a summary. Right now DLPack doesn't have any documentation on Python-level API, and some of the other content here isn't very clearly documented yet either (it mostly came from explanations of @tqchen in the issue linked above).

Given that reviewers may not want to build the Sphinx docs, here's a screenshot of the last half (which has visuals):

duplicate change with that in the complex64/128 PR

spec/design_topics/data_interchange.md

rgommers · 2021-01-02T16:04:29Z

spec/API_specification/array_object.md

+
+-   **stream**: _Optional\[int\]_
+
+    -   If given, the CUDA or ROCm stream number the consumer will use. Default is `None`, which means the legacy default stream.


Note that this is different from how stream is specified in https://numba.readthedocs.io/en/latest/cuda/cuda_array_interface.html#python-interface-specification. I actually don't understand that spec - it says for None that no synchronization is needed, it uses 1/2 for legacy/per-thread default stream and other integers for non-default streams. Which seems odd - what if the stream number of a non-default stream in use is 2 for example?

Using:

None: legacy default stream

0: per-thread default stream

1, 2, ... non-default stream numbers
seems to make more sense. @leofang am I missing something there?

@rgommers None is actually confusing across libraries. For example, in Numba None means there is no Numba's default stream (to be distinguished from "CUDA's (whichever) default stream"), whereas in CuPy None simply refers to CUDA's default stream, which in turn is 1 (the legacy default stream), though we're on the process of adopting 2 (the per-thread default stream). 0 is not acceptable either for the same reason: it's semantically unclear depending on how the libraries containing CUDA code are compiled and the runtime behavior defined in the Python hooks.

Note that in CUDA you don't get to choose the stream numbers --- CUDA macro-defines 1 for cudaStreamLegacy and 2 for cudaStreamPerThread, which CAI v3 followed. Any user/non-default stream created via cudaStreamCreate() is guaranteed to start on or after 3. (In fact the CUDA driver would reserve a stream pool internally, so the actual start number is way after 3.) I hope this makes CAI v3 clearer to you.

I'll try to catch up the rest of the discussions here as well as in the DLPack repo after Monday (tomorrow)...😅

btw fun fact: in HIP syncing over the stream 1 or 2 would lead to segfault, as HIP does not support them: cupy/cupy#4458 (comment).

Thanks for the explanation and links @leofang. I updated it to match __cuda_array_interface__.

Thanks, @rgommers. Apology, I realized my first sentence wasn't complete; it should have been

None is actually confusing across libraries, so we decided to use it to mark the situation "there is no need to do any kind of synchronization for whatever reason, just take the device pointer and do your work".

This matches `__cuda_array_interface__`.

rgommers · 2021-01-12T16:27:03Z

I think this is complete enough for now. dmlc/dlpack#57 is converging. I'd like to merge this PR if everyone is happy with that, so it shows up in the rendered html docs. And then we can focus on summarizing dmlc/dlpack#57 in that repo, I think that should cover the topic.

oleksandr-pavlyk · 2021-01-12T19:19:15Z

I would like to render the document and read through this, please. Please allow me a couple of hours.

leofang · 2021-01-12T20:59:42Z

I'll try to revisit this tonight.

tqchen · 2021-01-13T01:23:26Z

I still think it might be useful to map None to (legacy) default stream (the default configuration of cuda/rocm's stream 0). Since no synchronization might be ambiguous in GPU case.

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior. The default behavior of always sync to the default stream would be a good one.

rgommers · 2021-01-13T14:41:32Z

Thanks @tqchen, makes sense. If that'd really be confusing with Numba as @leofang says, the alternative may be to specify that stream must not be None if the consumer supports CUDA/ROCm.

- None: CUDA/ROCm streams are not supported,

instead of

- None: no synchronization (default),

or

- None: legacy default stream (default),

That may be a bit more explicit.

tqchen · 2021-01-13T14:53:21Z

Thanks @rgommers .

I believe the main purpose of default behavior is to serve as a recommendation of starting point to the developers. Option1 means we do not have a recommendation for developers in the case of CUDA/Rocm.

On the other hand, there is a common starting point for developers to develop GPU kernels. I learn GPU programming by launching kernel without the stream argument. In those cases, the behavior relies on the legacy default stream. Although the stream 0 have ambiguity in CUDA as @leofang mentioned, the default behavior of 0 still falls back to the legacy default stream. Additionally, it seems that Rocm also only currently support the legacy default stream behavior.

The above reason would still favor option3, because from the experience of CUDA/Rocm development, there is a default starting point (of using legacy stream), having the default to match that starting point is good. It also makes default API work for both CPU, cuda and rocm case.

leofang · 2021-01-13T16:32:28Z

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior. The default behavior of always sync to the default stream would be a good one.

Sorry, @tqchen, I don't get it. I thought having gone through the long discussion in dmlc/dlpack#57 we finally are close to an agreement that we sync over the Consumer's stream, but what you're saying above is just in contrary to everything we've discussed. IIUC you're saying both the Producer and the Consumer need to sync once (the former with the default stream, and the latter with its own stream), or did I misread something?

rgommers · 2021-01-13T17:58:52Z

what you're saying above is just in contrary to everything we've discussed.

I don't think so, nothing really changes. The only difference is whether we let people who do use the legacy default stream say stream=None as an alias for stream=1, or do we force them to say stream=1.

tqchen · 2021-01-13T18:11:02Z

Right, I think we have agreed on the API convention. The main discussion point is how do we specify the recommended default value behavior.

oleksandr-pavlyk · 2021-01-13T20:36:31Z

Nit:

DLPack describe the memory layout of strided, n-dimensional arrays.

-> "DLPack describes the memory layout of strided, n-dimensional arrays."

oleksandr-pavlyk · 2021-01-13T20:46:30Z

So from_dlpack is implemented by "consumer", and __dlpack__ is implemented by "producer". Say producer is cupy, and consumer if numpy.

Cupy's __dlpack__ will rightfully return a capsule with DLDeviceType set to a CUDA GPU code. CPU only consumer, that is NumPy, will not know what to do with this data, and must raise an exception.

The only way out is to rely on cupy's to_numpy function. The standard should specify how to to deal with this scenario. Have I missed it?

rgommers · 2021-01-13T20:57:28Z

That's a good question @oleksandr-pavlyk. When the producer and consumer don't overlap in device support at all, then indeed there's little that can be done other than error. Most libraries have a to_numpy-like method, but that's clearly not appropriate for a standard (specific to one library, so why add to_numpy and not to_jax?).

There is one case where things could work in principle, but don't: that is when the producer would support (e.g.) CPU and GPU, and the consumer only CPU. Then for GPU arrays now it doesn't work, but it could be made to work (with a device transfer) if the consumer could signal that device support to the producer.

The stance this protocol takes is that it must be zero-copy, or else only the consumer library can decide to copy (recommended not to though).

We could make the discussion on that in the Semantics section a little more extensive.

oleksandr-pavlyk · 2021-01-13T21:03:57Z

It may be good to ask libraries to implement to_bytearray, or some such, which can be used to deal with exchange between libraries with no overlapping device support. It would be like copy via host mechanism.

oleksandr-pavlyk · 2021-01-13T21:07:27Z

I am not sure I am very clear on the warning about device_id in DLPack.

I though CUDA_VISIBLE_DEVICES and the like can influence what device run-time associate for device_id=0 for a given process. Once the process has been launched, the mapping is fixed and can not change, but I am open to stand corrected.

rgommers · 2021-01-13T21:13:15Z

I though CUDA_VISIBLE_DEVICES and the like can influence what device run-time associate for device_id=0 for a given process. Once the process has been launched, the mapping is fixed and can not change, but I am open to stand corrected.

See data-apis/consortium-feedback#1 (comment) and comments below it.

rgommers · 2021-01-13T21:15:45Z

It may be good to ask libraries to implement to_bytearray, or some such, which can be used to deal with exchange between libraries with no overlapping device support. It would be like copy via host mechanism.

In other places I think we have favoured raising exceptions as well, rather than doing device transfers. Once you get an error as a user, doing the manual transfer is probably fine with a library-specific method like to_numpy. to_bytearray is something that won't be used much even if all libraries would want to implement it.

leofang · 2021-01-13T21:17:33Z

Ah OK @rgommers @tqchen. Sorry for being nerve-wrecked, it's just that this statement

For example, when producer's data comes from a non-default stream. the producer need to sync the data to the default stream, in order to be consumed. No syncing in such case would leads to undefined behavior.

does not look right to me, and so any conclusion for the default such as

The default behavior of always sync to the default stream would be a good one.

is not sound.

I disagree on two things here:

"The default should be always syncing (on whatever stream)": I think once the stream argument is accepted by all parties, it should be fairly straightforward to avoid syncing whenever possible. In particular, there are use cases in which two libraries working on the same stream, which is known to the user. In such cases syncing is just unnecessary and making syncing as default would make it harder to opt out. Though I feel less strongly on this if it's deemed critical to unblock things.
"If we pick a default stream to sync over, it better be stream 0":
- On CUDA: It would be very difficult, if not impossible, if Library A is built via setting nvcc --default-stream or -DCUDA_API_PER_THREAD_DEFAULT_STREAM such that stream 0 = 1 (legacy) but Library B has stream 0 = 2 (per-thread), and a user tries to debug a synchronization bug. It's subtle and likely an implementation detail that users usually don't need to worry about. I believe at least RAPIDS is on the process to making stream 2 their default (@kkraus14 probably can comment on this), so such a scenario could happen in the near future.
- On ROCm: Not choosing a default also avoids the problem that ROCm does not have stream 1 or 2 working.

A different question from the above discussion: Are we focusing only on CUDA/ROCm in this PR and leaving other architectures like OpenCL etc for the future, so that we can talk about streams?

tqchen · 2021-01-13T23:32:44Z

Thanks @leofang

The current API design allows the producer/consumer to pass in streams, and implement the optimization mentioned in this point. So it won't block any of the features, the producer and consumer are free to implement your proposal, by passing in explicit stream of interest and checking the stream.

The main topic of interest, however, is to also recommend a default case when developers want simpler implementations (e.g. the program only works on default stream since that was a common way to get started). For those developers, None that default to legacy stream makes sense.

Additionally, I believe we are saying "None defaults to (legacy) default stream". In CUDA that means 1(or 0 when you use the default flag in nvcc), and rocm it means 0. In short, they are the stream choice when you compile a kernel that does not contain a stream field. Having such a choice won't block any of the more advanced use-cases that are being mentioned. But would also help beginners to adopt things without worrying about the choice of the stream argument.

spec/design_topics/data_interchange.md

tqchen · 2021-02-17T13:45:53Z

Thanks @rgommers the proposal looks good to me

spec/API_specification/array_object.md

leofang · 2021-02-17T20:32:19Z

One last minor thing on establishing the stream ordering: Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer? This is for the need of Consumer libraries which cannot provide any meaningful construct to the Producer. In revising CUDA Array Interface we left such a "backdoor" to ignore streams:

Use of this exception should be avoided where possible, as it is provided for libraries that cannot implement the synchronization semantics without the involvement of the User - for example, those interfacing with third-party libraries oblivious to the CUDA Array Interface.

The most notable example is mpi4py, in which we have no control over any CUDA functionality (it's handled internally by either UCX or the underlying MPI library), so as a Consumer we cannot provide any stream to establish the stream order. Any synchronization (in both MPI and CUDA senses), if needed, must be explicitly done by the Users.

spec/design_topics/data_interchange.md

rgommers · 2021-02-17T20:58:05Z

Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer?

Your rationale for a backdoor makes sense, but can we just use stream=-1? An env var seems pretty ugly API-wise.

leofang · 2021-02-17T21:32:40Z

Can we provide an env var, say, PYTHON_ARRAY_API_IGNORE_STREAM, for suppressing any potential synchronization done by Producer?

Your rationale for a backdoor makes sense, but can we just use stream=-1? An env var seems pretty ugly API-wise.

@rgommers You meant Consumer should set it to -1 and Producer should explicitly check if -1 is given, right? Fine with me. An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

rgommers · 2021-02-17T22:13:27Z

You meant Consumer should set it to -1 and Producer should explicitly check if -1 is given, right?

Yes indeed.

rgommers · 2021-02-18T10:58:42Z

An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

I'm actually not quite sure I understand this. Can you think of cases where an array library will be used both in settings where it does and doesn't know if the user must be involved in stream handling?

In that case the library must expose some API to the user to control its stream handling. Which could be an env var or a regular API (e.g. add disable_stream_handling= to from_dlpack). Either way I think that's up to the library to figure out, and it shouldn't be in the public API of the array API standard I believe.

leofang · 2021-02-19T21:05:31Z

An env var gives Users somewhat a handle to control the behavior, but if a User forgets to set it, it's doomed, so this is a tradeoff for which I have no strong opinion.

I'm actually not quite sure I understand this. Can you think of cases where an array library will be used both in settings where it does and doesn't know if the user must be involved in stream handling?

As discussed in yesterday's call, I was mainly thinking that an env var could give superusers a knot to tweak the behavior, but it's probably better just restrict to the need of library implementors, for which setting stream to -1 or any sentinel object is enough.

leofang

I haven't done a full pass review yet, will do so asap, but in case I forget: I think we better set a minimal requirement for DLPACK_VERSION so that everyone is importing the latest header.

cc: @tqchen

rgommers · 2021-02-20T16:24:26Z

I haven't done a full pass review yet, will do so asap, but in case I forget: I think we better set a minimal requirement for DLPACK_VERSION so that everyone is importing the latest header.

Given that v0.4 is ABI-compatible with v0.2 and v0.3, and that (I assume) an ABI change will come with a major version increase, I guess the required range is 0.2 <= DLPACK_VERSION < 1.0. Latest would be nice, but it shouldn't matter - once we require complex number support, it needs to be >= 0.4 though.

rgommers · 2021-02-20T16:25:19Z

The one other comment on the new __dlpack_device__ was to change the integer for device ID to a Python enum - I'm looking at that now.

rgommers · 2021-02-20T17:03:56Z

Update PR for all remaining comments. I'd like to merge this by the end of the weekend, and if there are more comments after that do a follow-up PR. Reason: this should be visible in the html docs, and we're about to send the adoption proposal for NumPy out for review, where I'm sure there'll be some discussion around DLPack.

leofang

LGTM except for two minor points.

spec/design_topics/data_interchange.md

Also fix a couple of small textual things.

rgommers · 2021-02-21T15:09:25Z

Okay in it goes. Thanks @tqchen, @leofang, @kkraus14 and @oleksandr-pavlyk!

leofang · 2021-02-21T20:13:11Z

Thanks @rgommers @tqchen for carrying out most of the hard work!

rgommers added 2 commits December 30, 2020 23:05

Update data interchange section with Python API and semantics

30276e6

temporary commit to enable MyST feature

79bb689

duplicate change with that in the complex64/128 PR

rgommers mentioned this pull request Dec 30, 2020

[RFC] Adopt DLPack as cross-language C ABI stable data structure for array exchange data-apis/consortium-feedback#1

Closed

tqchen reviewed Dec 31, 2020

View reviewed changes

spec/design_topics/data_interchange.md Outdated Show resolved Hide resolved

leofang mentioned this pull request Jan 2, 2021

Import DLPack header file & Fix multiple issues cupy/cupy#4517

Merged

Add DLPack synchronization semantics; add from_dlpack/__dlpack__ to API

4e2c49d

rgommers commented Jan 2, 2021

View reviewed changes

This was referenced Jan 6, 2021

Specify synchronization semantics dmlc/dlpack#57

Closed

Define __cuda_array_interface__ equivalent for ROC GPUs? numba/numba#4566

Closed

Update stream numbering for __dlpack__

0b701c4

This matches `__cuda_array_interface__`.

rgommers added the RFC Request for comments. Feature requests and proposed changes. label Jan 12, 2021

tqchen reviewed Jan 13, 2021

View reviewed changes

spec/design_topics/data_interchange.md Show resolved Hide resolved

Add more device-specific notes for CUDA/ROCm stream handling

a719b18

kkraus14 reviewed Feb 17, 2021

View reviewed changes

spec/API_specification/array_object.md Outdated Show resolved Hide resolved

kkraus14 reviewed Feb 17, 2021

View reviewed changes

spec/design_topics/data_interchange.md Outdated Show resolved Hide resolved

Fix issue where producer/consumer were reversed

693b15a

Improve the description of the stream keyword for __dlpack__

897ca2e

leofang requested changes Feb 19, 2021

View reviewed changes

leofang mentioned this pull request Feb 20, 2021

Update DLPack support cupy/cupy#4695

Merged

rgommers added 3 commits February 20, 2021 17:51

Update __dlpack_device__ to use IntEnum for device type

d3b9a79

Add -1 as a sentinel value for DLPack stream handling

75261cc

Add supported DLPack version range.

5cde9aa

leofang reviewed Feb 21, 2021

View reviewed changes

spec/design_topics/data_interchange.md Outdated Show resolved Hide resolved

spec/design_topics/data_interchange.md Show resolved Hide resolved

Add details on strides null and size 0 arrays.

603ad2e

Also fix a couple of small textual things.

rgommers merged commit e2474ce into data-apis:main Feb 21, 2021

leofang mentioned this pull request May 19, 2021

DLPack support for NumPy numpy/numpy#19013

Closed

tirthasheshpatel mentioned this pull request Feb 23, 2022

Remove (migrated) DLPack docs from Data Interchange page #396

Merged

kkraus14 mentioned this pull request Apr 11, 2023

GH-34971: [Format] Add non-CPU version of C Data Interface apache/arrow#34972

Merged

rgommers mentioned this pull request Dec 6, 2023

GH-33984: [C++][Python] DLPack implementation for Arrow Arrays (producer) apache/arrow#38472

Merged

tirthasheshpatel mentioned this pull request Feb 21, 2024

Update DLPack diagram in the docs dmlc/dlpack#137

Closed


		- stream: _Optional\[int\]_

		- If given, the CUDA or ROCm stream number the consumer will use. Default is `None`, which means the legacy default stream.

Add Python API, semantics and implementation details for DLPack #106

Add Python API, semantics and implementation details for DLPack #106

Conversation

rgommers commented Dec 30, 2020 • edited Loading

rgommers Jan 2, 2021

Choose a reason for hiding this comment

leofang Jan 3, 2021

Choose a reason for hiding this comment

leofang Jan 3, 2021

Choose a reason for hiding this comment

rgommers Jan 12, 2021

Choose a reason for hiding this comment

leofang Jan 13, 2021

Choose a reason for hiding this comment

rgommers commented Jan 12, 2021

oleksandr-pavlyk commented Jan 12, 2021

leofang commented Jan 12, 2021

tqchen commented Jan 13, 2021 • edited Loading

rgommers commented Jan 13, 2021 • edited Loading

tqchen commented Jan 13, 2021 • edited Loading

leofang commented Jan 13, 2021

rgommers commented Jan 13, 2021

tqchen commented Jan 13, 2021

oleksandr-pavlyk commented Jan 13, 2021

oleksandr-pavlyk commented Jan 13, 2021 • edited Loading

rgommers commented Jan 13, 2021

oleksandr-pavlyk commented Jan 13, 2021

oleksandr-pavlyk commented Jan 13, 2021

rgommers commented Jan 13, 2021

rgommers commented Jan 13, 2021

leofang commented Jan 13, 2021 • edited Loading

tqchen commented Jan 13, 2021

tqchen commented Feb 17, 2021

leofang commented Feb 17, 2021 • edited Loading

rgommers commented Feb 17, 2021

leofang commented Feb 17, 2021

rgommers commented Feb 17, 2021

rgommers commented Feb 18, 2021

leofang commented Feb 19, 2021

leofang left a comment

Choose a reason for hiding this comment

rgommers commented Feb 20, 2021

rgommers commented Feb 20, 2021

rgommers commented Feb 20, 2021

leofang left a comment

Choose a reason for hiding this comment

rgommers commented Feb 21, 2021

leofang commented Feb 21, 2021

rgommers commented Dec 30, 2020 •

edited

Loading

tqchen commented Jan 13, 2021 •

edited

Loading

rgommers commented Jan 13, 2021 •

edited

Loading

tqchen commented Jan 13, 2021 •

edited

Loading

oleksandr-pavlyk commented Jan 13, 2021 •

edited

Loading

leofang commented Jan 13, 2021 •

edited

Loading

leofang commented Feb 17, 2021 •

edited

Loading