Skip to content

Commit

Permalink
Add changes for 6e27d0f
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Apr 25, 2024
1 parent d1f76eb commit 4afc975
Show file tree
Hide file tree
Showing 11 changed files with 57 additions and 51 deletions.
10 changes: 5 additions & 5 deletions _sources/examples/basic.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ Code Explanation
:lineno-start: 8

First, we need to define a ``KernelBuilder`` instance.
A ``KernelBuilder`` is essentially a `blueprint` that describes the information required to compile the CUDA kernel.
The constructor takes the name of the kernel function and the `.cu` file where the code is located.
A ``KernelBuilder`` is essentially a ``blueprint`` that describes the information required to compile the CUDA kernel.
The constructor takes the name of the kernel function and the ``.cu`` file where the code is located.
Optionally, we can also provide the kernel source as the third parameter.


Expand All @@ -40,15 +40,15 @@ Optionally, we can also provide the kernel source as the third parameter.
:lineno-start: 11

CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory.
Here, we define two tunable parameters: the number of threads per blocks and the number of elements processed per thread.
Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.



.. literalinclude:: basic.cpp
:lines: 15-16
:lineno-start: 15

The values returned by ``tune`` are placeholder objecs.
The values returned by ``tune`` are placeholder objects.
These objects can be combined using C++ operators to create new expressions objects.
Note that ``elements_per_block`` does not actually contain a specific value;
instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of ``threads_per_block`` and ``elements_per_thread``.
Expand All @@ -64,7 +64,7 @@ The following properties are supported:

* ``problem_size``: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and ``kl::arg0`` means that the size is specified as the first kernel argument (`argument 0`).
* ``block_size``: A triplet ``(x, y, z)`` representing the block dimensions.
* ``grid_divsor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
* ``grid_divisor``: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.
* ``template_args``: This property specifies template arguments, which can be type names and integral values.
* ``define``: Define preprocessor constants.
* ``shared_memory``: Specify the amount of shared memory required, in bytes.
Expand Down
8 changes: 4 additions & 4 deletions _sources/examples/pragma.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ Pragma Kernels
===========================

In the previous examples, we demonstrated how a tunable kernel can be specified by defining a ``KernelBuilder`` instance in the host-side code.
While this API offers flexiblity, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.
While this API offers flexibility, it can be cumbersome and requires keeping the kernel code in CUDA in sync with the host-side code in C++.

Kernel Launcher also provides a way to define kernel specifications directly in the CUDA code by using pragma directives to annotate the kernel code.
Although this method is less flexible than the ``KernelBuilder`` API, it is much more convenient and suitable for most CUDA kernels.
Expand Down Expand Up @@ -30,7 +30,7 @@ The kernel contains the following ``pragma`` directives:
:lineno-start: 1

The tune directives specify the tunable parameters: ``threads_per_block`` and ``items_per_thread``.
Since ``items_per_thread`` is also the name of the template parameter, so it is passed to the kernel as a compile-time constant via this parameter.
Since ``items_per_thread`` is also the name of the template parameter, it is passed to the kernel as a compile-time constant via this parameter.
The value of ``threads_per_block`` is not passed to the kernel but is used by subsequent pragmas.

.. literalinclude:: vector_add_annotated.cu
Expand All @@ -44,7 +44,7 @@ In this case, the constant ``items_per_block`` is defined as the product of ``th
:lines: 4-6
:lineno-start: 4

The ``problem_size`` directive defines the problem size (as discussed in as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
The ``problem_size`` directive defines the problem size (as discussed in :doc:`basic`), ``block_size`` specifies the thread block size, and ``grid_divisor`` specifies how the problem size should be divided to obtain the thread grid size.
Alternatively, ``grid_size`` can be used to specify the grid size directly.


Expand All @@ -67,7 +67,7 @@ In this example, the tuning key is ``"vector_add_" + T``, where ``T`` is the nam
Host Code
---------

The below code shows how to call the kernel from the host in C++::
The code below shows how to call the kernel from the host in C++::

#include "kernel_launcher/pragma.h"
using namespace kl = kernel_launcher;
Expand Down
12 changes: 6 additions & 6 deletions _sources/examples/registry.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Kernel Registry
.. The kernel registry essentially acts like a global cache of compiled kernels.
In the previous example, we saw how to use wisdom files by creating a ``WisdomKernel`` object.
This object will compile the kernel code on the first call and the keep the kernel loaded as long as the object exists.
This object will compile the kernel code on the first call and then keep the kernel loaded as long as the object exists.
Typically, one would define the ``WisdomKernel`` object as part of a class or as a global variable.

However, in certain scenarios, it is inconvenient or impractical to store ``WisdomKernel`` objects.
In these cases, it is possible to use the ``KernelRegistry``, that essentially acts like a global table of compiled kernel instances.
In these cases, it is possible to use the ``KernelRegistry`` that essentially acts like a global table of compiled kernel instances.


Source code
Expand All @@ -36,8 +36,8 @@ Defining a kernel descriptor
:lines: 6-43
:lineno-start: 6

This part of the code defines a ``IKernelDescriptor``:
a class that encapsulate the information required to compile a kernel.
This part of the code defines an ``IKernelDescriptor``:
a class that encapsulates the information required to compile a kernel.
This class should override two methods:

- ``build`` to instantiate a ``KernelBuilder``,
Expand All @@ -64,7 +64,7 @@ kernel is only compiled once and stored in the registry.
:lineno-start: 59

Alternatively, it is possible to use the above short-hand syntax.
This syntax also make it is easy to replace the element type ``float`` to some other type such as ``int``::
This syntax also makes it easy to replace the element type ``float`` with some other type such as ``int``::

kl::launch(VectorAddDescriptor::for_type<int>(), n, dev_C, dev_A, dev_B);

Expand All @@ -75,4 +75,4 @@ It is even possible to define a templated function that passes type ``T`` on to
kl::launch(VectorAddDescriptor::for_type<T>(), n, C, A, B);
}

Instead of using the global kernel registery, it is also possible to create local registry by creating a ``KernelRegistry`` instance.
Instead of using the global kernel registry, it is also possible to create a local registry by creating a ``KernelRegistry`` instance.
3 changes: 2 additions & 1 deletion _sources/examples/wisdom.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ Wisdom Files

In the previous example, we demonstrated how to compile a kernel by providing both a ``KernelBuilder`` instance (describing the `blueprint` for the kernel) and a ``Config`` instance (describing the configuration of the tunable parameters).


However, determining the optimal configuration can often be challenging, as it depends on both the problem size and the specific type of GPU being used.
To address this problem, Kernel Launcher provides a solution in the form of **wisdom files** (terminology borrowed from `FFTW <http://www.fftw.org/>`_).

Expand Down Expand Up @@ -86,7 +87,7 @@ To do so, we need to run the program with the environment variable ``KERNEL_LAUN
This generates a file called ``vector_add_1000000.json`` in the directory set by ``set_global_capture_directory``.

Alternatively, it is possible to capture several kernels at once by using the wildcard ``*``.
For example, the following command export all kernels that are start with ``vector_``::
For example, the following command exports all kernels that start with ``vector_``::

$ KERNEL_LAUNCHER_CAPTURE=vector_* ./main

Expand Down
21 changes: 12 additions & 9 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,9 @@ Kernel Launcher

.. image:: /logo.png
:width: 670
:alt: kernel launcher
:alt: Kernel Launcher logo

**Kernel Launcher** is a C++ library that makes it easy to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and launching them in a type-safe manner using C++ magic. There are two main reasons for using runtime compilation:
**Kernel Launcher** is a C++ library designed to dynamically compile *CUDA* kernels at runtime (using `NVRTC <https://docs.nvidia.com/cuda/nvrtc/index.html>`_) and to launch them in a type-safe manner using C++ magic. Runtime compilation offers two significant advantages:

* Kernels that have tunable parameters (block size, elements per thread, loop unroll factors, etc.) where the optimal configuration depends on dynamic factors such as the GPU type and problem size.

Expand All @@ -33,12 +33,14 @@ Kernel Tuner Integration

.. image:: /kernel_tuner_integration.png
:width: 670
:alt: kernel launcher integration
:alt: Kernel Launcher and Kernel Tuner integration


Kernel Launcher's tight integration with `Kernel Tuner <https://kerneltuner.github.io/>`_ results in highly-tuned kernels, as visualized above.
Kernel Launcher **captures** kernel launches within your application, which are then **tuned** by Kernel Tuner and saved as **wisdom** files.
These files are processed by Kernel Launcher during execution to **compile** the tuned kernel at runtime.
The tight integration of **Kernel Launcher** with `Kernel Tuner <https://kerneltuner.github.io/>`_ ensures that kernels are highly optimized, as illustrated in the image above.
Kernel Launcher can **capture** kernel launches within your application at runtime.
These captured kernels can then be **tuned** by Kernel Tuner and the tuning results are saved as **wisdom** files.
These wisdom files are used by Kernel Launcher during execution to **compile** the tuned kernel at runtime.


See :doc:`examples/wisdom` for an example of how this works in practise.

Expand All @@ -48,21 +50,22 @@ See :doc:`examples/wisdom` for an example of how this works in practise.
Basic Example
=============

This sections hows a basic code example. See :ref:`example` for a more advance example.
This section presents a simple code example illustrating how to use the Kernel Launcher.
For a more detailed example, refer to :ref:`example`.

Consider the following CUDA kernel for vector addition.
This kernel has a template parameter ``T`` and a tunable parameter ``ELEMENTS_PER_THREAD``.

.. literalinclude:: examples/vector_add.cu


The following C++ snippet shows how to use *Kernel Launcher* in host code:
The following C++ snippet demonstrates how to use the Kernel Launcher in the host code:

.. literalinclude:: examples/index.cpp



Indices and tables
Indices and Tables
============

* :ref:`genindex`
Expand Down
10 changes: 5 additions & 5 deletions examples/basic.html
Original file line number Diff line number Diff line change
Expand Up @@ -172,21 +172,21 @@ <h2>Code Explanation<a class="headerlink" href="#code-explanation" title="Permal
</pre></div>
</div>
<p>First, we need to define a <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> instance.
A <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> is essentially a <cite>blueprint</cite> that describes the information required to compile the CUDA kernel.
The constructor takes the name of the kernel function and the <cite>.cu</cite> file where the code is located.
A <code class="docutils literal notranslate"><span class="pre">KernelBuilder</span></code> is essentially a <code class="docutils literal notranslate"><span class="pre">blueprint</span></code> that describes the information required to compile the CUDA kernel.
The constructor takes the name of the kernel function and the <code class="docutils literal notranslate"><span class="pre">.cu</span></code> file where the code is located.
Optionally, we can also provide the kernel source as the third parameter.</p>
<div class="highlight-c++ notranslate"><div class="highlight"><pre><span></span><span class="linenos">11</span><span class="w"> </span><span class="c1">// Define tunable parameters </span>
<span class="linenos">12</span><span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">threads_per_block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">builder</span><span class="p">.</span><span class="n">tune</span><span class="p">(</span><span class="s">&quot;block_size&quot;</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="mi">32</span><span class="p">,</span><span class="w"> </span><span class="mi">64</span><span class="p">,</span><span class="w"> </span><span class="mi">128</span><span class="p">,</span><span class="w"> </span><span class="mi">256</span><span class="p">,</span><span class="w"> </span><span class="mi">512</span><span class="p">,</span><span class="w"> </span><span class="mi">1024</span><span class="p">});</span>
<span class="linenos">13</span><span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">elements_per_thread</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">builder</span><span class="p">.</span><span class="n">tune</span><span class="p">(</span><span class="s">&quot;elements_per_thread&quot;</span><span class="p">,</span><span class="w"> </span><span class="p">{</span><span class="mi">1</span><span class="p">,</span><span class="w"> </span><span class="mi">2</span><span class="p">,</span><span class="w"> </span><span class="mi">4</span><span class="p">,</span><span class="w"> </span><span class="mi">8</span><span class="p">});</span>
</pre></div>
</div>
<p>CUDA kernels often have tunable parameters that can impact their performance, such as block size, thread granularity, register usage, and the use of shared memory.
Here, we define two tunable parameters: the number of threads per blocks and the number of elements processed per thread.</p>
Here, we define two tunable parameters: the number of threads per block and the number of elements processed per thread.</p>
<div class="highlight-c++ notranslate"><div class="highlight"><pre><span></span><span class="linenos">15</span><span class="w"> </span><span class="c1">// Define expressions</span>
<span class="linenos">16</span><span class="w"> </span><span class="k">auto</span><span class="w"> </span><span class="n">elements_per_block</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">threads_per_block</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">elements_per_thread</span><span class="p">;</span>
</pre></div>
</div>
<p>The values returned by <code class="docutils literal notranslate"><span class="pre">tune</span></code> are placeholder objecs.
<p>The values returned by <code class="docutils literal notranslate"><span class="pre">tune</span></code> are placeholder objects.
These objects can be combined using C++ operators to create new expressions objects.
Note that <code class="docutils literal notranslate"><span class="pre">elements_per_block</span></code> does not actually contain a specific value;
instead, it is an abstract expression that, upon kernel instantiation, is evaluated as the product of <code class="docutils literal notranslate"><span class="pre">threads_per_block</span></code> and <code class="docutils literal notranslate"><span class="pre">elements_per_thread</span></code>.</p>
Expand All @@ -206,7 +206,7 @@ <h2>Code Explanation<a class="headerlink" href="#code-explanation" title="Permal
<ul class="simple">
<li><p><code class="docutils literal notranslate"><span class="pre">problem_size</span></code>: This is an N-dimensional vector that represents the size of the problem. In this case, is one-dimensional and <code class="docutils literal notranslate"><span class="pre">kl::arg0</span></code> means that the size is specified as the first kernel argument (<cite>argument 0</cite>).</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">block_size</span></code>: A triplet <code class="docutils literal notranslate"><span class="pre">(x,</span> <span class="pre">y,</span> <span class="pre">z)</span></code> representing the block dimensions.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">grid_divsor</span></code>: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">grid_divisor</span></code>: This property is used to calculate the size of the grid (i.e., the number of blocks along each axis). For each kernel launch, the problem size is divided by the divisors to calculate the grid size. In other words, this property expresses the number of elements processed per thread block.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">template_args</span></code>: This property specifies template arguments, which can be type names and integral values.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">define</span></code>: Define preprocessor constants.</p></li>
<li><p><code class="docutils literal notranslate"><span class="pre">shared_memory</span></code>: Specify the amount of shared memory required, in bytes.</p></li>
Expand Down
Loading

0 comments on commit 4afc975

Please sign in to comment.