Skip to content

Commit

Permalink
Add doc page for usage of Clad with CUDA (vgvassilev#1144)
Browse files Browse the repository at this point in the history
* Add usage of Clad with CUDA doc page

* Fix formatting of CUDA doc page

* Fix title underlying in CUDA doc

* Remove Implementation details section for now
  • Loading branch information
kchristin22 authored Nov 20, 2024
1 parent 8d916fe commit 62935d2
Show file tree
Hide file tree
Showing 2 changed files with 65 additions and 0 deletions.
1 change: 1 addition & 0 deletions docs/userDocs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ The User Guide
user/UsingEnzymeWithinClad
user/UsingVectorMode.rst
user/UsingImmediateMode
user/UsingCladOnCUDACode
user/FAQ
user/DevelopersDocumentation
user/IntroductionToClangForCladContributors
Expand Down
64 changes: 64 additions & 0 deletions docs/userDocs/source/user/UsingCladOnCUDACode.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
Automatically compute reverse-mode derivatives of CUDA functions with Clad
******************************************************************************

Clad offers the ability to differentiate CUDA functions in reverse-mode. Specifically,
Clad can differentiate functions that are marked with either the `__device__` or `__global__` specifier.

For the kernels, since they are void functions, the output parameter must be included in the parameter list of `clad::gradient`.
To execute the kernel, the user has to call the `execute_kernel` method of the `CladFunction` object and provide the grid configuration.
The size of the shared memory to dynamically allocate for the kernel and the stream to use for its execution are appointed the default values of `0` and `nulltptr` respectively,
if not explicitly specified. Note that either none of these two arguments or both of them must be provided to the `execute_kernel` call.
Clad does not handle cases where only one of the two is provided, even if the order is correct.

.. code-block:: cpp
#include "clad/Differentiator/Differentiator.h"
auto kernel_grad = clad::gradient(kernel, "in, out"); // compute the derivative of out w.r.t in
// Option 1:
kernel_grad.execute_kernel(gridDim, blockDim, sharedMem, stream, in, out, in_grad, out_grad);
// Option 2:
kernel_grad.execute_kernel(gridDim, blockDim, in, out, in_grad, out_grad);
CUDA features supported by Clad
================================================

Clad supports the following CUDA features:
* The commonly used CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, `gridDim` and `warpSize`
* The CUDA host functions `cudaMalloc`, `cudaMemcpy` and `cudaFree`

To use CUDA math functions, the user must define the equivalent pullback function in Clad's CUDA custom derivatives:

.. code-block:: cpp
// In `clad/include/clad/Differentiator/BuiltinDerivativesCUDA.cuh`
namespace clad {
namespace custom_derivatives {
__device__ inline void __fdividef_pullback(float a, float b, float d_y,
float* d_a, float* d_b) {
*d_a += (1.F / b) * d_y;
*d_b += (-a / (b * b)) * d_y;
}
}
}
CUDA features not yet supported by Clad
================================================

The following CUDA features are not yet supported:
* The use of shared memory within the original function
* Synchronization primitives like `__syncthreads()` and `cudaDeviceSynchronize()`
* Other CUDA host functions apart from those listed in the previous section


Demos
================================================

For examples of using Clad with CUDA, see the `clad/demos/CUDA` folder.

0 comments on commit 62935d2

Please sign in to comment.