-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example calling cuRAND #40
Conversation
This is great! I hope we can eventually bring this shim into the CUDA target itself and offer a well-supported path for using cuRAND. |
Instead of putting logic in the shim file and calling it from relatively bare kernels in Python, the logic is now moved into Python. This needed a Numba extension adding to support the curandState data type natively in Numba.
@seibert the shim functions in the previous commit had some of the logic of the example in them - I've updated the shim so that it's just a shim layer that wraps each cuRAND function (well, the two that are supported for now), and moved all the logic into the Numba kernels. It would be good to move this into Numba eventually (ASAP). One thing that makes me feel a little uncertain at the moment is the way we handle arrays of cuRAND states - because the C/C++ functions generally take a pointer to a given state, which is usually computed like Do you have any thoughts on the API, whether it seems suitable, or if there's a cleaner way to design / implement the API? |
The xoroshiro RNG defines a custom dtype for the state of a given generator, which allows the parallel initialization function on the host to return a GPU array of RNG states of this dtype. Since you can't def xoroshiro128p_next(states, index):
'''Return the next random uint64 and advance the RNG in states[index].
:type states: 1D array, dtype=xoroshiro128p_dtype
:param states: array of RNG states
:type index: int64
:param index: offset in states to update
:rtype: uint64
'''
index = int64(index)
s0 = states[index]['s0']
s1 = states[index]['s1']
result = s0 + s1
s1 ^= s0
states[index]['s0'] = uint64(rotl(s0, uint32(55))) ^ s1 ^ (s1 << uint32(14))
states[index]['s1'] = uint64(rotl(s1, uint32(36)))
return result I think this is probably the best option without confusing the user with non-standard Python semantics. |
Ah, that's a great point - so I think I understand from your suggestion that I can just make a call that would have been like: curand_init(seed, seq, 0, &states[i]); in C as: curand_init(seed, seq, 0, states, i) in Python. This also reduces the amount of work in implementing the extension, which I like 😄. Will update this PR accordingly. |
Exactly. Numba already uses the same pattern in the API for CUDA atomic functions, again since there isn't a natural way to pass an address. Instead the user passes the device array and an offset into the array. |
This avoids having weird deviations from Python semantics for cuRAND states.
Code now updated - the one remaining thing I will have a look at is how to give users a way to copy a given state into local memory, which is an optimization used in the examples. But now we're not deviating from Python / C/C++ API semantics, this should be straightforward to implement. |
Now that numba/numba#7621 is merged, this should be ready to review (I can't set the label myself as I don't have permissions). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Add instruction about cuda-python binding and pointing to NUMBA_CUDA_USE_NVIDIA_BINDING.
- I got the following error when running it:
/usr/local/cuda/include/curand_mrg32k3a.h(2138): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
/usr/local/cuda/include/curand_mrg32k3a.h(2721): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
/usr/local/cuda/include/curand_mrg32k3a.h(3367): error: A namespace scope variable without memory space annotations (__device__/__constant__/__shared__/__managed__) is considered a host variable, and host variables are not allowed in JIT mode. Consider using -default-device flag to process unannotated namespace scope variables as __device__ variables in JIT mode
/usr/local/cuda/include/cuda.h(53): catastrophic error: cannot open source file "stdlib.h"
and 1 catastrophic error detected in the compilation of "shim.cu".
Compilation terminated.
Maybe I am running on hardware and toolkit that is too old
@sklam What hardware / toolkit are you running with? |
If CUDA Python is not found, then a message directing the user to the installation instructions is presented.
To address this, I've added a check for CUDA Python, and configuration of Numba to use the binding if it is present. If it's not present, the user is direction to install it. |
|
@sklam You have the CUDA 11.0 toolkit from the conda environment - is the version of the toolkit at /usr/local/cuda also 11.0, or is it something older? |
Also, does the example work if you apply the following change to your Numba installation? diff --git a/numba/cuda/cudadrv/driver.py b/numba/cuda/cudadrv/driver.py
index 3bd84978f..cc91124fb 100644
--- a/numba/cuda/cudadrv/driver.py
+++ b/numba/cuda/cudadrv/driver.py
@@ -2767,7 +2767,8 @@ class NvrtcProgram:
# being optimized away.
arch = f'--gpu-architecture=compute_{major}{minor}'.encode()
include = f'-I{config.CUDA_INCLUDE_PATH}'.encode()
- opts = [arch, include, b'-rdc', b'true']
+ opts = [arch, include, b'-rdc', b'true', b'-default-device',
+ b'-std=c++11']
# Compile the program
err, = nvrtc.nvrtcCompileProgram(self._program, len(opts), opts) |
RE: #40 (comment) It's RE: #40 (comment)
|
@sklam Thanks for the info - since nvRTC uses headers from the CUDA installation, I think this is attempting to use older headers with a newer toolkit (I'd imagine that this probably won't work at all with CUDA < 11). Does this need a note saying that you need at least CUDA 11.0 adding to the comments? |
@gmarkall, yes, let's leave a note in the comment about it. Just in case people are confused about the error. |
@sklam Thanks, I've now added a relevant comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a machine for testing this right now. All my machines have old CTK. But I trust @gmarkall to have fully tested this. The example will be great to have for the numba 0.56 release.
This is an example based on the cuRAND device API example from the cuRAND documentation.
This requires numba/numba#7621