Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

sreerajkksd · 2024-09-17T09:51:06Z

Hi, I'm trying to build the pybind11 extension mentioned under onemkl_gemv example DPCTL build with CUDA:
https://github.com/IntelPython/dpctl/tree/master/examples/pybind11/onemkl_gemv

Example mentioned fails to run all test cases:

The build works with the following changes, but some tests are still failing:

	--- a/examples/pybind11/onemkl_gemv/CMakeLists.txt
	+++ b/examples/pybind11/onemkl_gemv/CMakeLists.txt
	@@ -41,6 +41,9 @@ pybind11_add_module(${py_module_name}
	     ${_sources}
	 )
	 add_sycl_to_target(TARGET ${py_module_name} SOURCES ${_sources})
	+target_compile_options(${py_module_name} PRIVATE -fsycl-targets=nvptx64-nvidia-cuda)
	+target_link_options(${py_module_name} PRIVATE -fsycl-targets=nvptx64-nvidia-cuda)
	+
	 target_compile_definitions(${py_module_name} PRIVATE -DMKL_ILP64)
	 target_include_directories(${py_module_name}
	     PUBLIC ${MKL_INCLUDE_DIR} sycl_gemm

I also had to add an additional flag as well while building sycl_gemv:

-DDpctl_DIR=<DPCTL_DIR>/cmake

Sample reproducer:

SYCL_PI_TRACE=1 python3 -c 'import dpctl; import dpctl.tensor as dpt; import numpy as np; from sycl_gemm import gemv; q = dpctl.SyclQueue(); Mnp, vnp = np.random.randn(5, 3), np.random.randn(3); M = dpt.asarray(Mnp, sycl_queue=q); v = dpt.asarray(vnp, sycl_queue=q); r = dpt.empty((5,), dtype=v.dtype, sycl_queue=q); hev, ev = gemv(q, M, v, r, []); hev.wait(); rnp = dpt.asnumpy(r);'

While executing this, it failed with:

SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_opencl.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_level_zero.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_cuda.so [ PluginVersion: 15.49.1 ]
SYCL_PI_TRACE[basic]: Plugin found and successfully loaded: libpi_unified_runtime.so [ PluginVersion: 15.47.1 ]
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Requested device_type: info::device_type::automatic
SYCL_PI_TRACE[all]: Selected device: -> final score = 1500
SYCL_PI_TRACE[all]:   platform: NVIDIA CUDA BACKEND
SYCL_PI_TRACE[all]:   device: NVIDIA A100 80GB PCIe
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Native API failed. Native API returns: -42 (PI_ERROR_INVALID_BINARY) -42 (PI_ERROR_INVALID_BINARY)

Coming back to the source which is invoked, the failure happens when executing the following code(github):

   if (v_typenum == api.UAR_DOUBLE_) {
        using T = double;
        sycl::event gemv_ev = oneapi::mkl::blas::row_major::gemv(
            q, oneapi::mkl::transpose::nontrans, n, m, T(1),
            reinterpret_cast<T *>(mat_typeless_ptr), m,
            reinterpret_cast<T *>(v_typeless_ptr), 1, T(0),
            reinterpret_cast<T *>(r_typeless_ptr), 1, depends);
        res_ev = gemv_ev;
    }

... and SYCL_PI_TRACE=-1 reported:

    ---> piextDeviceSelectBinary(
            <unknown> : 0x67c2de0
            <unknown> : 0x68d3780
            <unknown> : 1
            <unknown> : 0x7ffcb6131ebc
    ) --->  pi_result : -42
            [out]<unknown> ** : 0x68d3780[ 0x7f37efe416b0 ... ]

python -m dpctl --full-list report the following:

> python -m dpctl --full-list                                                                                                                         1s
Platform  0 ::
    Name        Intel(R) OpenCL
    Version     OpenCL 3.0 LINUX
    Vendor      Intel(R) Corporation
    Backend     opencl
    Num Devices 1
      # 0
        Name                Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
        Version             2024.18.7.0.11_160000
        Filter string       opencl:cpu:0
Platform  1 ::
    Name        NVIDIA CUDA BACKEND
    Version     CUDA 12.5
    Vendor      NVIDIA Corporation
    Backend     ext_oneapi_cuda
    Num Devices 1
      # 0
        Name                NVIDIA A100 80GB PCIe
        Version             CUDA 12.5
        Filter string       cuda:gpu:0

The text was updated successfully, but these errors were encountered:

oleksandr-pavlyk · 2024-09-17T23:00:14Z

@sreerajkksd Thank you for the interest, I'll try to answer superficially, and refer you to our poster at SciPy 2024, https://intelpython.github.io/portable-data-parallel-extensions-scipy-2024/

The poster companion material https://github.com/IntelPython/example-portable-data-parallel-extensions/tree/main contains examples of building Python extensions using DPC++ and targeting NVidia GPUs, also one including oneMKL.

This example in DPCTL is written to be built with oneAPI MKL library (https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html) . The BLAS portion of this library provides implementations for x86-64 CPUs and for SPIR-capable devices. In particular, the library does not contain offload sections for Nvidia GPUs and for AMD GPUs.

The oneMKL interface library, https://github.com/oneapi-src/oneMKL, is C++ library that uses oneAPI MKL library for CPU and SPIR devices, and cuBLAS/cuSOLVER for NVidia GPUs, and rocBLAS/rocSOLVER for AMD GPUs. It need to be built, and I'd refer to the poster material and documentation for more details.

It is a good idea to provide references to said material in the README of this dpctl example though! Thanks for the suggestion

ndgrigorian · 2024-09-18T02:24:49Z

I also had to add an additional flag as well while building sycl_gemv:
-DDpctl_DIR=<DPCTL_DIR>/cmake

Yes, I added the following option:
-DDpctl_ROOT=$(python -m dpctl --cmakedir)
as well when building.

The example should be updated accounting for this as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

sreerajkksd commented Sep 17, 2024

oleksandr-pavlyk commented Sep 17, 2024 •

edited

Loading

ndgrigorian commented Sep 18, 2024

Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

Update the documentation for building Pybind11 SYCL Backend with CUDA #1843

Comments

sreerajkksd commented Sep 17, 2024

oleksandr-pavlyk commented Sep 17, 2024 • edited Loading

ndgrigorian commented Sep 18, 2024

oleksandr-pavlyk commented Sep 17, 2024 •

edited

Loading