Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example calling cuRAND #40

Merged
merged 6 commits into from
Jun 27, 2022
Merged

Add example calling cuRAND #40

merged 6 commits into from
Jun 27, 2022

Conversation

gmarkall
Copy link
Member

@gmarkall gmarkall commented Dec 6, 2021

This is an example based on the cuRAND device API example from the cuRAND documentation.

This requires numba/numba#7621

@seibert
Copy link
Contributor

seibert commented Dec 6, 2021

This is great! I hope we can eventually bring this shim into the CUDA target itself and offer a well-supported path for using cuRAND.

Instead of putting logic in the shim file and calling it from relatively
bare kernels in Python, the logic is now moved into Python. This needed
a Numba extension adding to support the curandState data type natively
in Numba.
@gmarkall
Copy link
Member Author

gmarkall commented Dec 8, 2021

@seibert the shim functions in the previous commit had some of the logic of the example in them - I've updated the shim so that it's just a shim layer that wraps each cuRAND function (well, the two that are supported for now), and moved all the logic into the Numba kernels.

It would be good to move this into Numba eventually (ASAP). One thing that makes me feel a little uncertain at the moment is the way we handle arrays of cuRAND states - because the C/C++ functions generally take a pointer to a given state, which is usually computed like &states[i] in C/C++, I made it so that getitem and setitem on a CurandStates object behave a little differently to normal - getitem always returns a new CurandStates object that's pointing at the indicated element - this can then be passed to a cuRAND function. setitem actually copies the state from the RHS of the getitem into the indicated element. So, there's a little bit of asymmetry here. This might not be too problematic because the states are mainly treated as opaque objects by the user, but it makes me feel a bit uneasy that the design may not be clean / consistent.

Do you have any thoughts on the API, whether it seems suitable, or if there's a cleaner way to design / implement the API?

@seibert
Copy link
Contributor

seibert commented Dec 8, 2021

The xoroshiro RNG defines a custom dtype for the state of a given generator, which allows the parallel initialization function on the host to return a GPU array of RNG states of this dtype. Since you can't getitem a reference to an individual NumPy array element, all of the API functions take the full RNG state array and an integer index to a particular state as arguments. Then the device function implementation can modify the state in place:

def xoroshiro128p_next(states, index):
    '''Return the next random uint64 and advance the RNG in states[index].
    :type states: 1D array, dtype=xoroshiro128p_dtype
    :param states: array of RNG states
    :type index: int64
    :param index: offset in states to update
    :rtype: uint64
    '''
    index = int64(index)
    s0 = states[index]['s0']
    s1 = states[index]['s1']
    result = s0 + s1

    s1 ^= s0
    states[index]['s0'] = uint64(rotl(s0, uint32(55))) ^ s1 ^ (s1 << uint32(14))
    states[index]['s1'] = uint64(rotl(s1, uint32(36)))

    return result

I think this is probably the best option without confusing the user with non-standard Python semantics.

@gmarkall
Copy link
Member Author

gmarkall commented Dec 8, 2021

Ah, that's a great point - so I think I understand from your suggestion that I can just make a call that would have been like:

curand_init(seed, seq, 0, &states[i]);

in C as:

curand_init(seed, seq, 0, states, i)

in Python. This also reduces the amount of work in implementing the extension, which I like 😄. Will update this PR accordingly.

@seibert
Copy link
Contributor

seibert commented Dec 8, 2021

Exactly. Numba already uses the same pattern in the API for CUDA atomic functions, again since there isn't a natural way to pass an address. Instead the user passes the device array and an offset into the array.

This avoids having weird deviations from Python semantics for cuRAND
states.
@gmarkall
Copy link
Member Author

gmarkall commented Dec 8, 2021

Code now updated - the one remaining thing I will have a look at is how to give users a way to copy a given state into local memory, which is an optimization used in the examples. But now we're not deviating from Python / C/C++ API semantics, this should be straightforward to implement.

@gmarkall
Copy link
Member Author

Now that numba/numba#7621 is merged, this should be ready to review (I can't set the label myself as I don't have permissions).

Copy link
Member

@sklam sklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Add instruction about cuda-python binding and pointing to NUMBA_CUDA_USE_NVIDIA_BINDING.
  2. I got the following error when running it:
/usr/local/cuda/include/curand_mrg32k3a.h(2138): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode

/usr/local/cuda/include/curand_mrg32k3a.h(2721): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode

/usr/local/cuda/include/curand_mrg32k3a.h(3367): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode

/usr/local/cuda/include/cuda.h(53): catastrophic error: cannot open source file "stdlib.h"

 and 1 catastrophic error detected in the compilation of "shim.cu".
Compilation terminated.

Maybe I am running on hardware and toolkit that is too old

@gmarkall
Copy link
Member Author

@sklam What hardware / toolkit are you running with?

If CUDA Python is not found, then a message directing the user to the
installation instructions is presented.
@gmarkall
Copy link
Member Author

Add instruction about cuda-python binding and pointing to NUMBA_CUDA_USE_NVIDIA_BINDING.

To address this, I've added a check for CUDA Python, and configuration of Numba to use the binding if it is present. If it's not present, the user is direction to install it.

@sklam
Copy link
Member

sklam commented Jun 16, 2022

@sklam What hardware / toolkit are you running with?


__OS Information__
Platform Name                                 : Linux-4.4.0-87-generic-x86_64-with-glibc2.23
Platform Release                              : 4.4.0-87-generic
OS Name                                       : Linux
OS Version                                    : #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017
OS Specific Version                           : ?
Libc Version                                  : glibc 2.23
...
__CUDA Information__
CUDA Device Initialized                       : True
CUDA Driver Version                           : 11.2
CUDA Runtime Version                          : 11.0
CUDA NVIDIA Bindings Available                : True
CUDA NVIDIA Bindings In Use                   : False
CUDA Detect Output:
Found 2 CUDA devices
id 0     b'GeForce GTX 1080'                              [SUPPORTED]
                      Compute Capability: 6.1
                           PCI Device ID: 0
                              PCI Bus ID: 101
                                    UUID: GPU-15943929-e7ca-0156-3746-96a4146da70d
                                Watchdog: Disabled
             FP32/FP64 Performance Ratio: 32
id 1           b'Tesla K40c'                 [SUPPORTED (DEPRECATED)]
                      Compute Capability: 3.5
                           PCI Device ID: 0
                              PCI Bus ID: 23
                                    UUID: GPU-a916b1e7-41a8-310e-7426-a2a3e7ffb2ee
                                Watchdog: Disabled
             FP32/FP64 Performance Ratio: 3
Summary:
	2/2 devices are supported

CUDA Libraries Test Output:
Finding nvvm from Conda environment
	named  libnvvm.so.3.3.0
	trying to open library...	ok
Finding cudart from Conda environment
	named  libcudart.so.11.0.221
	trying to open library...	ok
Finding cudadevrt from Conda environment
	named  libcudadevrt.a
Finding libdevice from Conda environment
	trying to open library...	ok

@gmarkall
Copy link
Member Author

@sklam You have the CUDA 11.0 toolkit from the conda environment - is the version of the toolkit at /usr/local/cuda also 11.0, or is it something older?

@gmarkall
Copy link
Member Author

Also, does the example work if you apply the following change to your Numba installation?

diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
index 3bd84978f..cc91124fb 100644
--- a/numba/cuda/cudadrv/driver.py
+++ b/numba/cuda/cudadrv/driver.py
@@ -2767,7 +2767,8 @@ class NvrtcProgram:
         #   being optimized away.
         arch = f'--gpu-architecture=compute_{major}{minor}'.encode()
         include = f'-I{config.CUDA_INCLUDE_PATH}'.encode()
-        opts = [arch, include, b'-rdc', b'true']
+        opts = [arch, include, b'-rdc', b'true', b'-default-device',
+                b'-std=c++11']
 
         # Compile the program
         err, = nvrtc.nvrtcCompileProgram(self._program, len(opts), opts)

@sklam
Copy link
Member

sklam commented Jun 22, 2022

RE: #40 (comment)

It's CUDA Version 10.0.130.

RE: #40 (comment)
with the patch, it gives

  File "/path/to/numba/numba/cuda/cudadrv/driver.py", line 2789, in __init__
    raise NvrtcError(msg)
numba.cuda.cudadrv.driver.NvrtcError: NVRTC Compilation failure whilst compiling shim.cu:

/usr/local/cuda/include/cuda.h(53): catastrophic error: cannot open source file "stdlib.h"

1 catastrophic error detected in the compilation of "shim.cu".
Compilation terminated.

@gmarkall
Copy link
Member Author

@sklam Thanks for the info - since nvRTC uses headers from the CUDA installation, I think this is attempting to use older headers with a newer toolkit (I'd imagine that this probably won't work at all with CUDA < 11). Does this need a note saying that you need at least CUDA 11.0 adding to the comments?

@sklam
Copy link
Member

sklam commented Jun 24, 2022

@gmarkall, yes, let's leave a note in the comment about it. Just in case people are confused about the error.

@gmarkall
Copy link
Member Author

@sklam Thanks, I've now added a relevant comment.

Copy link
Member

@sklam sklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a machine for testing this right now. All my machines have old CTK. But I trust @gmarkall to have fully tested this. The example will be great to have for the numba 0.56 release.

@sklam sklam merged commit c423f54 into numba:master Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants