-
Notifications
You must be signed in to change notification settings - Fork 1
SYCL_USM_ARRAY_INTERFACE protocol for DPPY #1
Comments
@diptorupd I am not sure I understand the ambiguity argument. I understand that we want to support consumers who do not (want to) use dpctl. Using types defined in dpctl hence should be avoided. Adding a dpctl.queue and dpctl.context would actually introduce the possibility for ambiguity while the suggested protocol is not ambiguous. Can you explain where the ambiguity is? With respect to data-readiness we should consider explicit dependence semantics. It seems natural to use SYCL's USM events and it could also help providing a cleaner/more explicit async feature in numba. |
Relevant to this issue is DLPack's discussion of synchronization semantics. |
I would rephrase "Provide information to a consumer about how to copy the data." as "Provide a consumer with information necessary to copy the data." It's at the consumer's discretion whether to copy via p2p, or via host. SUAI is not ambiguous about item 2, but does introduce a dependency on The DL teams request was to support named |
Memory data location is known only for
The context does not contain the whole information about data: it does not know about the device where data were (potentially) allocated and therefore where it can be accessed more efficiently. So it's strange to have context as one of the options. Why can't we replace
Will it solve the problems if we extend |
@oleksandr-pavlyk , why capsules shall be single use only? How then sharing the same resources between dppy packages will be organized? |
I suggest to rephrase that sentence into "Enable a producer to inform a consumer how to access data, i.e. shared USM pointer and SYCL context the pointer is bound to". The internal workings of USM shared memory are not pertinent for ability to share data.
Per SYCL, all you need is USM pointer and a The SUAI does not insist on have a queue to enable passing just the minimal information required. I do not anticipate this usage to be common. I also think the most common used value for
We could, but the feedback I received from community (data-apis/consortium-feedback#1 (comment), data-apis/consortium-feedback#1 (comment), dmlc/dlpack#57 (comment)) is that imposing on consumers of SYCL USM arrays to work with SYCL native objects like events is discouraged. Instead, it was proposed that synchronization is performed by consumer handing the producer an in-order queue, the producer submits into a there a trivial kernel which depends on its internal events. The consumer submits a trivial kernel into the in-order queue, and uses the event returns in its production, possibly out-of-order queues, thus achieving the needed synchronization. |
Producer creates a copy, packages its reference and class destructor in a capsule and shares it with a consumer. Rinse and repeat for different consumer packages. Regarding why "single use": because subsequent operations on the queue modify internal structure of the queue. The queue may well be thread safe, and it is OK. The practice to rename capsule after use is from CUDF, see data-apis/consortium-feedback#1 (comment). The link to source code in that comment is stale, please use https://github.com/rapidsai/cudf/blob/branch-0.19/python/cudf/cudf/_lib/interop.pyx#L34 |
I'm sorry, but I don't see real problems using wrappers over native sycl objects like events internally. We already have wrappers over dpctl.wait(obj.__sycl_usm_array_iface__['eventlist']) |
The feedback I received from DLPack maintainers encouraged us to work out interoperability protocol that avoids imposing on Essentially the packages exchange tokens (representable as native Python objects) that can be traded with offload runtime for the actual run-time objects. This does not mean we can not exchange these objects internally, but there should be a common path where this is not required, like working with root devices and their default contexts.
We already have |
We need a protocol similar to the
__cuda_array_interface__
(CAI) protocol for DPPY. @oleksandr-pavlyk has already drafted an initial version of a CAI-like protocol called__sycl_usm_array_interface__
(SUAI) in [1]. This ticket opens the discussion on finalizing SUAI before we publish it publicly.There are three main goals for CAI (and by corollary SUAI) protocol:
cuPointerGetAttribute
is typically used, and for SYCLget_pointer_type
is a similar function for USM pointers.The third point has stalled CAI’s adoption in TensorFlow [2] and led to a year-long debate [6] before the v3 revision of CAI [4] was finalized in Numba 0.53.
Limitations in current SUAI:
A) SUAI is ambiguous about point 2).
The
syclobj
attribute of SUAI can be either adpctl.SyclQueue
, ordpctl.SyclContext
, or a 3-tuple to identify a specific device.Although our proposal allows for multiple ways for specifying the
sycl::queue
orsycl::context
, this is not foolproof. The decision as to how the SUAI is parsed at the consumer end cannot be controlled by a producer. A producer may return adpctl.SyclContext
inside SUAI, but a consumer that does not usedpctl
will be unable to consume the pointer.This is also pertinent for cases where the consumer is completely oblivious of the SYCL API and relies on some lower level implementation to deal with a CUDA device/ SYCL USM pointer. In [4], a user (leofang) provides mpi4py as specific example. Mpi4py is completely CUDA unaware and relies on the underlying MPI library to deal with the CUDA pointer in whichever way the MPI library sees fit.
I feel we should either separate out the various options for
syclobj
and add them as separate attributes to the dictionary at the cost of making the protocol too verbose, or explore an alternative that is similar to CAI v3 and stores a capsule containing a SYCL queue in SUAI.B) We have not addressed the synchronization requirements that CAI v3 addresses.
How do we guarantee that the consumer can safely use the data pointer?
CAI addresses the issue in two way:
The synchronization question needs to be addressed. Either by explicitly stating that SUAI is a blocking protocol or designing something similar to CAI.
[1] https://github.com/IntelPython/dpctl/wiki/Zero-copy-data-exchange-using-SYCL-USM
[2] tensorflow/tensorflow#29039
[3] https://numba.pydata.org/numba-doc/dev/cuda/cuda_array_interface.html
[4] https://gmarkall.github.io/numba-rtd-theme/cuda/cuda_array_interface.html
[5] numba/numba#4933
[6] numba/numba#5162
The text was updated successfully, but these errors were encountered: