From 62935d2499cf26255362b744f9a6517ce95b548b Mon Sep 17 00:00:00 2001
From: Christina Koutsou <74819775+kchristin22@users.noreply.github.com>
Date: Wed, 20 Nov 2024 23:45:45 +0200
Subject: [PATCH] Add doc page for usage of Clad with CUDA (#1144)

* Add usage of Clad with CUDA doc page

* Fix formatting of CUDA doc page

* Fix title underlying in CUDA doc

* Remove Implementation details section for now
---
 docs/userDocs/source/index.rst                |  1 +
 .../source/user/UsingCladOnCUDACode.rst       | 64 +++++++++++++++++++
 2 files changed, 65 insertions(+)
 create mode 100644 docs/userDocs/source/user/UsingCladOnCUDACode.rst

diff --git a/docs/userDocs/source/index.rst b/docs/userDocs/source/index.rst
index 69da583e3..f5a7fb4a9 100644
--- a/docs/userDocs/source/index.rst
+++ b/docs/userDocs/source/index.rst
@@ -91,6 +91,7 @@ The User Guide
    user/UsingEnzymeWithinClad
    user/UsingVectorMode.rst
    user/UsingImmediateMode
+   user/UsingCladOnCUDACode
    user/FAQ
    user/DevelopersDocumentation
    user/IntroductionToClangForCladContributors
diff --git a/docs/userDocs/source/user/UsingCladOnCUDACode.rst b/docs/userDocs/source/user/UsingCladOnCUDACode.rst
new file mode 100644
index 000000000..bb079266a
--- /dev/null
+++ b/docs/userDocs/source/user/UsingCladOnCUDACode.rst
@@ -0,0 +1,64 @@
+Automatically compute reverse-mode derivatives of CUDA functions with Clad
+******************************************************************************
+
+Clad offers the ability to differentiate CUDA functions in reverse-mode. Specifically,
+Clad can differentiate functions that are marked with either the `__device__` or `__global__` specifier.
+
+For the kernels, since they are void functions, the output parameter must be included in the parameter list of `clad::gradient`.
+To execute the kernel, the user has to call the `execute_kernel` method of the `CladFunction` object and provide the grid configuration.
+The size of the shared memory to dynamically allocate for the kernel and the stream to use for its execution are appointed the default values of `0` and `nulltptr` respectively, 
+if not explicitly specified. Note that either none of these two arguments or both of them must be provided to the `execute_kernel` call. 
+Clad does not handle cases where only one of the two is provided, even if the order is correct.
+
+.. code-block:: cpp
+
+    #include "clad/Differentiator/Differentiator.h"
+
+    auto kernel_grad = clad::gradient(kernel, "in, out"); // compute the derivative of out w.r.t in
+    // Option 1:
+    kernel_grad.execute_kernel(gridDim, blockDim, sharedMem, stream, in, out, in_grad, out_grad);
+    // Option 2:
+    kernel_grad.execute_kernel(gridDim, blockDim, in, out, in_grad, out_grad);
+
+
+CUDA features supported by Clad
+================================================
+
+Clad supports the following CUDA features:
+* The commonly used CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, `gridDim` and `warpSize` 
+* The CUDA host functions `cudaMalloc`, `cudaMemcpy` and `cudaFree`
+
+To use CUDA math functions, the user must define the equivalent pullback function in Clad's CUDA custom derivatives:
+
+.. code-block:: cpp
+
+    // In `clad/include/clad/Differentiator/BuiltinDerivativesCUDA.cuh`
+
+    namespace clad {
+
+    namespace custom_derivatives {
+
+    __device__ inline void __fdividef_pullback(float a, float b, float d_y,
+                                            float* d_a, float* d_b) {
+    *d_a += (1.F / b) * d_y;
+    *d_b += (-a / (b * b)) * d_y;
+    }
+
+    }
+    }
+
+
+
+CUDA features not yet supported by Clad
+================================================
+
+The following CUDA features are not yet supported:
+* The use of shared memory within the original function
+* Synchronization primitives like `__syncthreads()` and `cudaDeviceSynchronize()`
+* Other CUDA host functions apart from those listed in the previous section
+
+
+Demos
+================================================
+
+For examples of using Clad with CUDA, see the `clad/demos/CUDA` folder.
\ No newline at end of file