Add changes for 6e27d0f

KernelTuner · Apr 25, 2024 · 4afc975 · 4afc975
1 parent d1f76eb
commit 4afc975
Show file tree

Hide file tree

Showing 11 changed files with 57 additions and 51 deletions.
diff --git a/_sources/examples/basic.rst.txt b/_sources/examples/basic.rst.txt
@@ -30,8 +30,8 @@ Code Explanation
    :lineno-start: 8
 
 First, we need to define a ``KernelBuilder`` instance.
-A ``KernelBuilder`` is essentially a `blueprint` that describes the information required to compile the CUDA kernel.
-The constructor takes the name of the kernel function and the `.cu` file where the code is located.
+A ``KernelBuilder`` is essentially a ``blueprint`` that describes the information required to compile the CUDA kernel.
+The constructor takes the name of the kernel function and the ``.cu`` file where the code is located.
 Optionally, we can also provide the kernel source as the third parameter.
 
 
@@ -40,15 +40,15 @@ Optionally, we can also provide the kernel source as the third parameter.
    :lineno-start: 11
 
 CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory. 
-Here, we define two tunable parameters: the number of threads per blocks and the number of elements processed per thread.
+Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.
 
 
 
 .. literalinclude:: basic.cpp
    :lines: 15-16
    :lineno-start: 15
 
-The values returned by ``tune`` are placeholder objecs.
+The values returned by ``tune`` are placeholder objects.
 These objects can be combined using C++ operators to create new expressions objects.
 Note that ``elements_per_block`` does not actually contain a specific value;
 instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
@@ -64,7 +64,7 @@ The following properties are supported:
 
 * ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
 * ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
-* ``grid_divsor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
+* ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
 * ``template_args``: This property specifies template arguments, which can be type names and integral values.
 * ``define``: Define preprocessor constants.
 * ``shared_memory``: Specify the amount of shared memory required, in bytes.

diff --git a/_sources/examples/pragma.rst.txt b/_sources/examples/pragma.rst.txt
@@ -2,7 +2,7 @@ Pragma Kernels
 ===========================
 
 In the previous examples, we demonstrated how a tunable kernel can be specified by defining a ``KernelBuilder`` instance in the host-side code.
-While this API offers flexiblity, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
+While this API offers flexibility, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
 
 Kernel Launcher also provides a way to define kernel specifications directly in the CUDA code by using pragma directives to annotate the kernel code.
 Although this method is less flexible than the ``KernelBuilder`` API, it is much more convenient and suitable for most CUDA kernels.
@@ -30,7 +30,7 @@ The kernel contains the following ``pragma`` directives:
    :lineno-start: 1
 
 The tune directives specify the tunable parameters: ``threads_per_block`` and ``items_per_thread``.
-Since ``items_per_thread`` is also the name of the template parameter, so it is passed to the kernel as a compile-time constant via this parameter.
+Since ``items_per_thread`` is also the name of the template parameter, it is passed to the kernel as a compile-time constant via this parameter.
 The value of ``threads_per_block`` is not passed to the kernel but is used by subsequent pragmas.
 
 .. literalinclude:: vector_add_annotated.cu
@@ -44,7 +44,7 @@ In this case, the constant ``items_per_block`` is defined as the product of ``th
    :lines: 4-6
    :lineno-start: 4
 
-The ``problem_size`` directive defines the problem size (as discussed in as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
+The ``problem_size`` directive defines the problem size (as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
 Alternatively, ``grid_size`` can be used to specify the grid size directly.
 
 
@@ -67,7 +67,7 @@ In this example, the tuning key is ``"vector_add_" + T``, where ``T`` is the nam
 Host Code
 ---------
 
-The below code shows how to call the kernel from the host in C++::
+The code below shows how to call the kernel from the host in C++::
 
     #include "kernel_launcher/pragma.h"
     using namespace kl = kernel_launcher;

diff --git a/_sources/examples/registry.rst.txt b/_sources/examples/registry.rst.txt
@@ -7,11 +7,11 @@ Kernel Registry
 .. The kernel registry essentially acts like a global cache of compiled kernels.
 
 In the previous example, we saw how to use wisdom files by creating a ``WisdomKernel`` object.
-This object will compile the kernel code on the first call and the keep the kernel loaded as long as the object exists.
+This object will compile the kernel code on the first call and then keep the kernel loaded as long as the object exists.
 Typically, one would define the ``WisdomKernel`` object as part of a class or as a global variable.
 
 However, in certain scenarios, it is inconvenient or impractical to store ``WisdomKernel`` objects.
-In these cases, it is possible to use the ``KernelRegistry``, that essentially acts like a global table of compiled kernel instances.
+In these cases, it is possible to use the ``KernelRegistry`` that essentially acts like a global table of compiled kernel instances.
 
 
 Source code
@@ -36,8 +36,8 @@ Defining a kernel descriptor
    :lines: 6-43
    :lineno-start: 6
 
-This part of the code defines a ``IKernelDescriptor``:
-a class that encapsulate the information required to compile a kernel.
+This part of the code defines an ``IKernelDescriptor``:
+a class that encapsulates the information required to compile a kernel.
 This class should override two methods:
 
 - ``build`` to instantiate a ``KernelBuilder``,
@@ -64,7 +64,7 @@ kernel is only compiled once and stored in the registry.
    :lineno-start: 59
 
 Alternatively, it is possible to use the above short-hand syntax.
-This syntax also make it is easy to replace the element type ``float`` to some other type such as ``int``::
+This syntax also makes it easy to replace the element type ``float`` with some other type such as ``int``::
 
     kl::launch(VectorAddDescriptor::for_type<int>(), n, dev_C, dev_A, dev_B);
 
@@ -75,4 +75,4 @@ It is even possible to define a templated function that passes type ``T`` on to
         kl::launch(VectorAddDescriptor::for_type<T>(), n, C, A, B);
     }
 
-Instead of using the global kernel registery, it is also possible to create local registry by creating a ``KernelRegistry`` instance.
+Instead of using the global kernel registry, it is also possible to create a local registry by creating a ``KernelRegistry`` instance.
diff --git a/_sources/examples/wisdom.rst.txt b/_sources/examples/wisdom.rst.txt
@@ -6,6 +6,7 @@ Wisdom Files
 
 In the previous example, we demonstrated how to compile a kernel by providing both a  ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).
 
+
 However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used. 
 To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).
 
@@ -86,7 +87,7 @@ To do so, we need to run the program with the environment variable ``KERNEL_LAUN
 This generates a file called ``vector_add_1000000.json`` in the directory set by ``set_global_capture_directory``.
 
 Alternatively, it is possible to capture several kernels at once by using the wildcard ``*``.
-For example, the following command export all kernels that are start with ``vector_``::
+For example, the following command exports all kernels that start with ``vector_``::
 
     $ KERNEL_LAUNCHER_CAPTURE=vector_* ./main
 

diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -19,9 +19,9 @@ Kernel Launcher
 
 .. image:: /logo.png
    :width: 670
-   :alt: kernel launcher
+   :alt: Kernel Launcher logo
 
-**Kernel Launcher** is a C++ library that makes it easy to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and launching them in a type-safe manner using C++ magic. There are two main reasons for using runtime compilation:
+**Kernel Launcher** is a C++ library designed to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and to launch them in a type-safe manner using C++ magic. Runtime compilation offers two significant advantages:
 
 * Kernels that have tunable parameters (block size, elements per thread, loop unroll factors, etc.) where the optimal configuration  depends on dynamic factors such as the GPU type and problem size.
 
@@ -33,12 +33,14 @@ Kernel Tuner Integration
 
 .. image:: /kernel_tuner_integration.png
    :width: 670
-   :alt: kernel launcher integration
+   :alt: Kernel Launcher and Kernel Tuner integration
 
 
-Kernel Launcher's tight integration with `Kernel Tuner <https://kerneltuner.github.io/>`_ results in highly-tuned kernels, as visualized above. 
-Kernel Launcher **captures** kernel launches within your application, which are then **tuned** by Kernel Tuner and saved as **wisdom** files. 
-These files are processed by Kernel Launcher during execution to **compile** the tuned kernel at runtime.
+The tight integration of **Kernel Launcher** with `Kernel Tuner <https://kerneltuner.github.io/>`_ ensures that kernels are highly optimized, as illustrated in the image above.
+Kernel Launcher can **capture** kernel launches within your application at runtime.
+These captured kernels can then be **tuned** by Kernel Tuner and the tuning results are saved as **wisdom** files. 
+These wisdom files are used by Kernel Launcher during execution to **compile** the tuned kernel at runtime.
+
 
 See :doc:`examples/wisdom` for an example of how this works in practise.
 
@@ -48,21 +50,22 @@ See :doc:`examples/wisdom` for an example of how this works in practise.
 Basic Example
 =============
 
-This sections hows a basic code example. See :ref:`example` for a more advance example.
+This section presents a simple code example illustrating how to use the Kernel Launcher. 
+For a more detailed example, refer to :ref:`example`.
 
 Consider the following CUDA kernel for vector addition.
 This kernel has a template parameter ``T`` and a tunable parameter ``ELEMENTS_PER_THREAD``.
 
 .. literalinclude:: examples/vector_add.cu
 
 
-The following C++ snippet shows how to use *Kernel Launcher* in host code:
+The following C++ snippet demonstrates how to use the Kernel Launcher in the host code:
 
 .. literalinclude:: examples/index.cpp
 
 
 
-Indices and tables
+Indices and Tables
 ============
 
 * :ref:`genindex`

diff --git a/examples/basic.html b/examples/basic.html
@@ -172,21 +172,21 @@ <h2>Code Explanation<a class="headerlink" href="#code-explanation" title="Permal
 </pre></div>
 </div>
 <p>First, we need to define a <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> instance.
-A <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> is essentially a <cite>blueprint</cite> that describes the information required to compile the CUDA kernel.
-The constructor takes the name of the kernel function and the <cite>.cu</cite> file where the code is located.
+A <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> is essentially a <code class="docutils literal notranslate"><span class="pre">blueprint</span></code> that describes the information required to compile the CUDA kernel.
+The constructor takes the name of the kernel function and the <code class="docutils literal notranslate"><span class="pre">.cu</span></code> file where the code is located.
 Optionally, we can also provide the kernel source as the third parameter.</p>
 <div class="highlight-c++ notranslate"><div class="highlight"><pre><span></span><span class="linenos">11</span><span class="w">    </span><span class="c1">// Define tunable parameters </span>
 <span class="linenos">12</span><span class="w">    </span><span class="k">auto</span><span class="w"> </span><span class="n">threads_per_block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">builder</span><span class="p">.</span><span class="n">tune</span><span class="p">(</span><span class="s">&quot;block_size&quot;</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="mi">32</span><span class="p">,</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w"> </span><span class="mi">128</span><span class="p">,</span><span class="w"> </span><span class="mi">256</span><span class="p">,</span><span class="w"> </span><span class="mi">512</span><span class="p">,</span><span class="w"> </span><span class="mi">1024</span><span class="p">});</span>
 <span class="linenos">13</span><span class="w">    </span><span class="k">auto</span><span class="w"> </span><span class="n">elements_per_thread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">builder</span><span class="p">.</span><span class="n">tune</span><span class="p">(</span><span class="s">&quot;elements_per_thread&quot;</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="mi">8</span><span class="p">});</span>
 </pre></div>
 </div>
 <p>CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory.
-Here, we define two tunable parameters: the number of threads per blocks and the number of elements processed per thread.</p>
+Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.</p>
 <div class="highlight-c++ notranslate"><div class="highlight"><pre><span></span><span class="linenos">15</span><span class="w">    </span><span class="c1">// Define expressions</span>
 <span class="linenos">16</span><span class="w">    </span><span class="k">auto</span><span class="w"> </span><span class="n">elements_per_block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">threads_per_block</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">elements_per_thread</span><span class="p">;</span>
 </pre></div>
 </div>
-<p>The values returned by <code class="docutils literal notranslate"><span class="pre">tune</span></code> are placeholder objecs.
+<p>The values returned by <code class="docutils literal notranslate"><span class="pre">tune</span></code> are placeholder objects.
 These objects can be combined using C++ operators to create new expressions objects.
 Note that <code class="docutils literal notranslate"><span class="pre">elements_per_block</span></code> does not actually contain a specific value;
 instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of <code class="docutils literal notranslate"><span class="pre">threads_per_block</span></code> and <code class="docutils literal notranslate"><span class="pre">elements_per_thread</span></code>.</p>
@@ -206,7 +206,7 @@ <h2>Code Explanation<a class="headerlink" href="#code-explanation" title="Permal
 <ul class="simple">
 <li><p><code class="docutils literal notranslate"><span class="pre">problem_size</span></code>: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and <code class="docutils literal notranslate"><span class="pre">kl::arg0</span></code> means that the size is specified as the first kernel argument (<cite>argument 0</cite>).</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">block_size</span></code>: A triplet <code class="docutils literal notranslate"><span class="pre">(x,</span> <span class="pre">y,</span> <span class="pre">z)</span></code> representing the block dimensions.</p></li>
-<li><p><code class="docutils literal notranslate"><span class="pre">grid_divsor</span></code>: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.</p></li>
+<li><p><code class="docutils literal notranslate"><span class="pre">grid_divisor</span></code>: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">template_args</span></code>: This property specifies template arguments, which can be type names and integral values.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">define</span></code>: Define preprocessor constants.</p></li>
 <li><p><code class="docutils literal notranslate"><span class="pre">shared_memory</span></code>: Specify the amount of shared memory required, in bytes.</p></li>