From 62935d2499cf26255362b744f9a6517ce95b548b Mon Sep 17 00:00:00 2001 From: Christina Koutsou <74819775+kchristin22@users.noreply.github.com> Date: Wed, 20 Nov 2024 23:45:45 +0200 Subject: [PATCH] Add doc page for usage of Clad with CUDA (#1144) * Add usage of Clad with CUDA doc page * Fix formatting of CUDA doc page * Fix title underlying in CUDA doc * Remove Implementation details section for now --- docs/userDocs/source/index.rst | 1 + .../source/user/UsingCladOnCUDACode.rst | 64 +++++++++++++++++++ 2 files changed, 65 insertions(+) create mode 100644 docs/userDocs/source/user/UsingCladOnCUDACode.rst diff --git a/docs/userDocs/source/index.rst b/docs/userDocs/source/index.rst index 69da583e3..f5a7fb4a9 100644 --- a/docs/userDocs/source/index.rst +++ b/docs/userDocs/source/index.rst @@ -91,6 +91,7 @@ The User Guide user/UsingEnzymeWithinClad user/UsingVectorMode.rst user/UsingImmediateMode + user/UsingCladOnCUDACode user/FAQ user/DevelopersDocumentation user/IntroductionToClangForCladContributors diff --git a/docs/userDocs/source/user/UsingCladOnCUDACode.rst b/docs/userDocs/source/user/UsingCladOnCUDACode.rst new file mode 100644 index 000000000..bb079266a --- /dev/null +++ b/docs/userDocs/source/user/UsingCladOnCUDACode.rst @@ -0,0 +1,64 @@ +Automatically compute reverse-mode derivatives of CUDA functions with Clad +****************************************************************************** + +Clad offers the ability to differentiate CUDA functions in reverse-mode. Specifically, +Clad can differentiate functions that are marked with either the `__device__` or `__global__` specifier. + +For the kernels, since they are void functions, the output parameter must be included in the parameter list of `clad::gradient`. +To execute the kernel, the user has to call the `execute_kernel` method of the `CladFunction` object and provide the grid configuration. +The size of the shared memory to dynamically allocate for the kernel and the stream to use for its execution are appointed the default values of `0` and `nulltptr` respectively, +if not explicitly specified. Note that either none of these two arguments or both of them must be provided to the `execute_kernel` call. +Clad does not handle cases where only one of the two is provided, even if the order is correct. + +.. code-block:: cpp + + #include "clad/Differentiator/Differentiator.h" + + auto kernel_grad = clad::gradient(kernel, "in, out"); // compute the derivative of out w.r.t in + // Option 1: + kernel_grad.execute_kernel(gridDim, blockDim, sharedMem, stream, in, out, in_grad, out_grad); + // Option 2: + kernel_grad.execute_kernel(gridDim, blockDim, in, out, in_grad, out_grad); + + +CUDA features supported by Clad +================================================ + +Clad supports the following CUDA features: +* The commonly used CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, `gridDim` and `warpSize` +* The CUDA host functions `cudaMalloc`, `cudaMemcpy` and `cudaFree` + +To use CUDA math functions, the user must define the equivalent pullback function in Clad's CUDA custom derivatives: + +.. code-block:: cpp + + // In `clad/include/clad/Differentiator/BuiltinDerivativesCUDA.cuh` + + namespace clad { + + namespace custom_derivatives { + + __device__ inline void __fdividef_pullback(float a, float b, float d_y, + float* d_a, float* d_b) { + *d_a += (1.F / b) * d_y; + *d_b += (-a / (b * b)) * d_y; + } + + } + } + + + +CUDA features not yet supported by Clad +================================================ + +The following CUDA features are not yet supported: +* The use of shared memory within the original function +* Synchronization primitives like `__syncthreads()` and `cudaDeviceSynchronize()` +* Other CUDA host functions apart from those listed in the previous section + + +Demos +================================================ + +For examples of using Clad with CUDA, see the `clad/demos/CUDA` folder. \ No newline at end of file