diff --git a/docs/userDocs/source/index.rst b/docs/userDocs/source/index.rst
index 69da583e3..f5a7fb4a9 100644
--- a/docs/userDocs/source/index.rst
+++ b/docs/userDocs/source/index.rst
@@ -91,6 +91,7 @@ The User Guide
    user/UsingEnzymeWithinClad
    user/UsingVectorMode.rst
    user/UsingImmediateMode
+   user/UsingCladOnCUDACode
    user/FAQ
    user/DevelopersDocumentation
    user/IntroductionToClangForCladContributors
diff --git a/docs/userDocs/source/user/UsingCladOnCUDACode.rst b/docs/userDocs/source/user/UsingCladOnCUDACode.rst
new file mode 100644
index 000000000..49372991d
--- /dev/null
+++ b/docs/userDocs/source/user/UsingCladOnCUDACode.rst
@@ -0,0 +1,67 @@
+Automatically compute reverse-mode derivatives of CUDA functions with Clad
+**********************************************************
+
+Clad offers the ability to differentiate CUDA functions in reverse-mode. Specifically,
+Clad can differentiate functions that are marked with either the `__device__` or `__global__` specifier.
+
+For the kernels, since they are void functions, the output parameter must be included in the parameter list of `clad::gradient`.
+To execute the kernel, the user has to call the `execute_kernel` method of the `CladFunction` object and provide the grid configuration.
+The size of the shared memory to dynamically allocate for the kernel and the stream to use for its execution are appointed the default values of `0` and `nulltptr` respectively, 
+if not explicitly specified. Note that either none of these two arguments or both of them must be provided to the `execute_kernel` call. 
+Clad does not handle cases where only one of the two is provided, even if the order is correct.
+
+.. code-block:: cpp
+    #include "clad/Differentiator/Differentiator.h"
+
+    auto kernel_grad = clad::gradient(kernel, "in, out"); // compute the derivative of out w.r.t in
+    // Option 1:
+    kernel_grad.execute_kernel(gridDim, blockDim, sharedMem, stream, in, out, in_grad, out_grad);
+    // Option 2:
+    kernel_grad.execute_kernel(gridDim, blockDim, in, out, in_grad, out_grad);
+
+
+CUDA features supported by Clad
+================================================
+
+Clad supports the following CUDA features:
+* The commonly used CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, `gridDim` and `warpSize` 
+* The CUDA host functions `cudaMalloc`, `cudaMemcpy` and `cudaFree`
+
+To use CUDA math functions, the user must define the equivalent pullback function in Clad's CUDA custom derivatives:
+
+.. code-block:: cpp
+    // *In `clad/include/clad/Differentiator/BuiltinDerivativesCUDA.cuh`*
+
+    namespace clad {
+
+    namespace custom_derivatives {
+
+    __device__ inline void __fdividef_pullback(float a, float b, float d_y,
+                                            float* d_a, float* d_b) {
+    *d_a += (1.F / b) * d_y;
+    *d_b += (-a / (b * b)) * d_y;
+    }
+
+    }
+    }
+
+
+
+CUDA features not yet supported by Clad
+================================================
+
+The following CUDA features are not yet supported:
+* The use of shared memory within the original function
+* Synchronization primitives like `__syncthreads()` and `cudaDeviceSynchronize()`
+* Other CUDA host functions apart from those listed in the previous section
+
+
+Implementation details
+================================================
+
+For a brief overview of the implementation, please refer to the team's related `blog post <https://compiler-research.org/blogs/gsoc24_christina_koutsou_project_final_blog/>`_.
+
+Demos
+================================================
+
+For examples of using Clad with CUDA, see the `clad/demos/CUDA` folder.
\ No newline at end of file