Introduction
OpenCL-C asynchronous copy primitives
- Standard OpenCL strided and 1D/2D/3D copy
- Kalray OpenCL 2D/3D-block copy extension
Step 0: Reference GPU-friendly kernel
Step 1: Explicit tiling
Step 2: Step-1 + __local on input
Step 3: Step-2 + __local on output
Step 4: Step-3 + double-buffering overlapping
Step 5: Step-4 + manual loop unroll
Step 6: Step-5 + Fast-inverse-square-root
Step 7: Step-5 + Native-C kernel + -ffast-math
Final results
References

Introduction

This is a sandbox project aiming at demonstrating step-by-step optimizations of stencil and image-processing algorithms on clustered many-core architectures fitted with local memories and DMA capabilities, like the Kalray MPPA many-core processor.

The OpenCL-C is taken as programming language, and the 3x3 Sobel filter for the computation kernel.

Main objectives

This example will walk through the first GPU-friendly reference kernel, to other more and more advanced optimizations. The goal is to highlight key methods to obtain high performance and demonstrate each one of them on the Kalray MPPA architecture.

Keep it as simple as possible
Focus on the data-transfer and processing part on Accelerator (OpenCL kernels in *.cl)
Optimize performance as much as possible, as long as it remains easy to understand
Insightful comments and clear coding style
Process some real images, like a hands-on demo

Files and Run

├── host.c
├── images
│   ├── curious_cat.png
│   └── valve_original.png
├── Makefile
├── ocl_utils.c
├── ocl_utils.h
├── png_utils.c
├── png_utils.h
├── sobel.cl
├── sobel_compute_block.c
├── sobel_config.h
└── README.md

The most important files are sobel.cl and host.c. OpenCL kernels in sobel.cl are structured in several chronological/incremental steps. This README is a short summary of these optimization steps. Reader is strongly recommended to follow those kernels in sobel.cl step-by-step. From step 4 on, reader should also refer to host.c for the execution deployment parameters (clEnqueueNDRangeKernel()).

Requirements:

Kalray AccessCore Embedded (ACE) SDK >= 4.5.0
libpng >= 1.6
imagemagick and provided display command

Example install command on Ubuntu 18.04:

sudo apt install libpng16-16 imagemagick

Run:

$ make run_hw [INPUT_IMAGE_PNG=/.../.../image.png] [OUTPUT_DISPLAY=0|1] \
              [HAVE_FAST_MATH=0|1]

# INPUT_IMAGE_PNG    : Path to user PNG image (default = /images/valve_original.png)
# OUTPUT_DISPLAY     : Enable/Disable display of all output images (default = 1)
# HAVE_FAST_MATH     : Enable/Disable -ffast-math in native C kernels (default = 1)

3x3 Sobel filter in a nutshell

The Sobel operator is widely known and used as among others edge detection algorithms. For each output pixel (y, x), its horizontal and vertical derivatives $G_{x}$ and $G_{y}$ are computed as follow (Image source: Wikipedia):

in which * denotes the image processing convolution and $A$ is a 3x3 sub-view of the input image around the pixel (y, x):

      +------+------+------+
      |      |      |      |
      +------+------+------+
A =   |      |(y, x)|      |
      +------+------+------+
      |      |      |      |
      +------+------+------+

Then the output pixel is calculated as below:

Shifted 3x3 Sobel filter

For the sake of simplicity of boundary condition and different optimization steps, we implement in this project a "shifted" Sobel operator, i.e. the computed pixel (y, x) will be written to the index (y-1, x-1), instead of (y, x). This produces an output image somehow "shifted" by one pixel in the top-left direction.

+------+------+------+              +---------+--------+--------+
|      |      |      |              |(y-1,x-1)|        |        |
+------+------+------+   write to   +---------+--------+--------+
|      |(y, x)|      |  =========>  |         |        |        |
+------+------+------+              +---------+--------+--------+
|      |      |      |              |         |        |        |
+------+------+------+              +---------+--------+--------+

The BORDER_REPLICATE boundary condition is employed for out-of-bound pixels, with direct min() or clamp() on neighbor indices, without explicit padding on the input image.

OpenCL-C asynchronous copy primitives

Standard OpenCL strided and 1D/2D/3D copy

The OpenCL 1.2 async primitives and more recent OpenCL 3.0 DMA extension are supported by the Kalray OpenCL driver (>= ACE-4.5.0).

Kalray OpenCL 2D/3D-block copy extension

Besides supporting above-mentioned OpenCL 1.2 and 3.0 asynchronous primitives, Kalray also added its own async-copy extension in order to reduce the programming effort when optimizing image processing and other stencil kernels on embedded platforms such as MPPA.

Motivation on this Kalray async-copy extension is reported in [1]. Further fundamental rationale and technical implementation details of these DMA-based primitives were also studied in [2], [3] and [4].

In short, the Kalray block-2D2D async-copy extension is illustrated in the below figure, in which a generic 2D-copy is characterized by:

begin address of __global image and __local buffer
an extent block of height x width elements to be copied
read-coordinates and dimension of __global image
write-coordinates and dimension of __local buffer

An element may contain one or several GENTYPEs, in the same way that a pixel may have one (gray), three (YUV or RGB) or four (RGBA) components. The developer builds a tiling algorithm and reasons in term of pixels, not bytes. Only at the moment of call to async_copy_*() that a num_gentype_per_elem can be inserted to specify the pixel-composition (1, 3, 4 or even 7, 9, 13 etc.). These Kalray 2D/3D primitives always take the original data pointers as argument (__global image, __local buffer), gather all parameters above and automatically perform byte-size calculation and address offsets under the hood. Developer no longer needs to perform such an error-prone process.

/**
 * async_work_group_copy_block_2D2D between global and local
 *
 * @param  dst                   dst pointer
 * @param  src                   src pointer
 * @param  num_gentype_per_elem  Number of gentypes in an element of the 2D block
 * @param  block_size            Dimension in elements of the 2D block {width, height}.
 *                                Total copied gentypes is
 *                                (num_gentype_per_elem * block_size.x * block_size.y)
 * @param  local_point           Local copy indexes and buffer size
 *                                {xpos, ypos, xdim, ydim}
 * @param  global_point          Global copy indexes and buffer size
 *                                {xpos, ypos, xdim, ydim}
 * @param  event                 event associated to the transfer
 * @return                       event associated to the transfer
 */
event_t async_work_group_copy_block_2D2D(__local GENTYPE *dst,
                              const __global GENTYPE *src,
                              size_t num_gentype_per_elem,
                              int2 block_size,
                              int4 local_point,
                              int4 global_point,
                              event_t event);

event_t async_work_group_copy_block_2D2D(__global GENTYPE *dst,
                              const __local GENTYPE *src,
                              size_t num_gentype_per_elem,
                              int2 block_size,
                              int4 local_point,
                              int4 global_point,
                              event_t event);

From now on, you can either open sobel.cl for a deep dive to the code, or keep reading till the end for a quick overview of optimization techniques. Some details in this README may have been intentionally removed for the sake of readability. Those missing points can be found, of course, in sobel.cl (and host.c).

Step 0: Reference GPU-friendly kernel

This first step is implemented in the well-known Data-Parallel paradigm, aimed to run efficiently on GPUs in the SIMT execution model. This kernel will be run as reference and performance baseline on the Kalray MPPA Accelerator.

Host configuration: (one work-item per output pixel)

clGetDeviceInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE, ..., &max_workgroup_size);

struct kernel_desc_s kernel_desc[] = {
        {
            .name       = "sobel_step_0",
            .globalSize = {ALIGN_MULT_UP(image_width, max_workgroup_size), image_height},
            .localSize  = {max_workgroup_size, 1},
        },
};

kernel_desc[i].ocl_kernel = clCreateKernel(program, kernel_desc[i].name, &err);
...
clEnqueueNDRangeKernel(queue, kernel_desc[i].ocl_kernel, 2, NULL,
                       kernel_desc[i].globalSize, kernel_desc[i].localSize, ...);

OpenCL kernel:

// Boundary condition
#define IMAGE_INDEX(y, x) \
    ((min(y, image_height-1) * image_width) + min(x, image_width-1))

__kernel void sobel_step_0(__global uchar *image_in, __global uchar *image_out,
                           int image_width, int image_height, float scale)
{
    int x = get_global_id(0);
    int y = get_global_id(1);

    if (x >= image_width || y >= image_height) { return; }

    // load neighbors
    float c0 = image_in[IMAGE_INDEX(y+0, x+0)];
    float c1 = image_in[IMAGE_INDEX(y+0, x+1)];
    float c2 = image_in[IMAGE_INDEX(y+0, x+2)];

    float n0 = image_in[IMAGE_INDEX(y+1, x+0)];
    float n2 = image_in[IMAGE_INDEX(y+1, x+2)];

    float t0 = image_in[IMAGE_INDEX(y+2, x+0)];
    float t1 = image_in[IMAGE_INDEX(y+2, x+1)];
    float t2 = image_in[IMAGE_INDEX(y+2, x+2)];

    // compute
    float magx = mad(2.0f, (n2 - n0), (c2 - c0 + t2 - t0));
    float magy = mad(2.0f, (t1 - c1), (t0 - c0 + t2 - c2));
    float mag  = hypot(magx, magy) * scale;

    // store pixel
    image_out[IMAGE_INDEX(y, x)] = convert_uchar_sat(mag);
}

Step 1: Explicit tiling

Explicit tiling is different from, so called Implicit tiling.

Implicit tiling is done by OpenCL mapping of input pixels on two-dimensional workgroups. This model usually maps one pixel on one workitem (c.f. Step-0) and the tile size (aka workgroup size in this case) is determined based on global_work_size[] and local_work_size[] of clEnqueueNDRangeKernel()

Explicit tiling, on another hand, does not rely on workgroup shape to perform tiling, but on user-defined TILE_HEIGHT x TILE_WIDTH tiles. The OpenCL kernel may need to be modified as well to be able to handle arbitrary sizes, regardless the number of workitems or workgroup topology.

// NOTE: "Shifted Sobel filter" whose the output tile is shifted back to
// top-left by one pixel, yielding the below stencil block with HALO_SIZE == 2:
//
//                    |<------ TILE_WIDTH ------->|  HALO_SIZE (== 2)
//         -----------+---------------------------+----+
//            ʌ       |                           |    |
//            |       |                           |    |
//            |       |                           |    |
//            |       |                           |    |
//       TILE_HEIGHT  |           Tile            |    |
//            |       |                           |    |
//            |       |                           |    |
//            v       |                           |    |
//         -----------+---------------------------+    |
//         HALO_SIZE  |                                |
//                    +--------------------------------+

The Host configuration now spawns one workgroup per tile:

struct kernel_desc_s kernel_desc[] = {
        ...
        {
            .name       = "sobel_step_1",
            .globalSize = {(ceil(((double)image_width)  / TILE_WIDTH))  * max_workgroup_size,
                           (ceil(((double)image_height) / TILE_HEIGHT)) * 1},
            .localSize  = {max_workgroup_size, 1},
        },
};

For the sake of abstraction, a sobel_compute_block() function will be introduced for each step which is doing tiling. This allows separation between (1) data-transfer optimization and (2) arithmetic computations, which are two different tracks, though they are both tied to the final performance.

static void sobel_compute_block_step_1(__global uchar *block_in,
                                       __global uchar *block_out,
                                       int block_row_stride,
                                       int block_width, int block_height,
                                       int block_width_halo, int block_height_halo,
                                       float scale)
{
    // dispatch rows of block block_height x block_width on workitems
    const int irow_begin = ...;
    const int irow_end   = ...;
    ...
    for (int irow = irow_begin; irow < irow_end; irow++)
    {
        for (int icol = 0; icol < block_width; icol++)
        {
            // load neighbors
            ...
            // compute
            ...
            // store pixel
            ...
        }
    }
}

__kernel void sobel_step_1(__global uchar *image_in, __global uchar *image_out,
                           int image_width, int image_height, float scale)
{
    const int group_idx = get_group_id(0);
    const int group_idy = get_group_id(1);

    const ulong block_offset = (block_idy * image_width) + block_idx;

    __global uchar *block_in  = image_in  + block_offset;
    __global uchar *block_out = image_out + block_offset;
    ...
    sobel_compute_block_step_1(block_in, block_out, ...);
}

Step 2: Step-1 + `__local` on input

This step introduces first use of __local memory on MPPA device, as replacement for DRAM __global memory, combined with DMA-backed asynchronous copy primitives (OpenCL 1.2 specification and Kalray extension)

__kernel void sobel_step_2(__global uchar *image_in, __global uchar *image_out,
                           int image_width, int image_height, float scale)
{
    __local uchar block_in_local[(TILE_HEIGHT + HALO_SIZE) * (TILE_WIDTH + HALO_SIZE)];
    ...

    // ------------------------------------------------------------
    // Before computing, copy data to __local
    //     (block_height_halo x block_width_halo)
    // ------------------------------------------------------------
    event = async_work_group_copy_block_2D2D(block_in_local, image_in, ...);
    wait_group_events(1, &event);

    // ------------------------------------------------------------
    // Compute (same as sobel_compute_block_step_1() but with
    // block_in_local[])
    // ------------------------------------------------------------
    sobel_compute_block_step_2(block_in_local, block_out,
                               block_in_row_stride, block_out_row_stride,
                               block_width, block_height,
                               block_width_halo, block_height_halo,
                               scale);
}

Step 3: Step-2 + `__local` on output

Same as Step-2, with an additional __local buffer for storing output tiles, before sending to __global

__kernel void sobel_step_3(__global uchar *image_in, __global uchar *image_out,
                           int image_width, int image_height, float scale)
{
    __local uchar block_in_local [(TILE_HEIGHT + HALO_SIZE) * (TILE_WIDTH + HALO_SIZE)];
    __local uchar block_out_local[       TILE_HEIGHT        *        TILE_WIDTH       ];
    ...
    // ------------------------------------------------------------
    // Before computing, copy data to __local
    //     (block_height_halo x block_width_halo)
    // ------------------------------------------------------------
    event = async_work_group_copy_block_2D2D(block_in_local, image_in, ...);
    wait_group_events(1, &event);

    // ------------------------------------------------------------
    // Compute (same as sobel_compute_block_step_1() but with
    // block_in_local[])
    // ------------------------------------------------------------
    sobel_compute_block_step_3(block_in_local, block_out_local, ...);

    // ------------------------------------------------------------
    // After computing, send result to __global
    //     (block_height x block_width)
    // ------------------------------------------------------------
    event = async_work_group_copy_block_2D2D(image_out, block_out_local, ...);
    wait_group_events(1, &event);
}

Step 4: Step-3 + double-buffering overlapping

Optimize data-transfer between __global and __local with double-buffering and overlapping. This will be the ultimate optimization step on data transfer.

Unlike previous steps 1/2/3, in which each workgroup computes one tile, and there are as many workgroups as the number of tiles in image. In this step, there will be only 5 workgroups spawned (CL_DEVICE_MAX_COMPUTE_UNITS). Each workgroup will be in charge of multiple tiles and perform overlapping through double-buffering and async-copy.

Reader in invited to refer to the state-of-the-art studies on this subject (e.g. [3] and [4] and many others).

Host configuration:

clGetDeviceInfo(CL_DEVICE_MAX_COMPUTE_UNITS, ..., &max_compute_units);

struct kernel_desc_s kernel_desc[] = {
        ...
        {
            .name       = "sobel_step_4",
            .globalSize = {max_workgroup_size * max_compute_units, 1},
            .localSize  = {max_workgroup_size, 1},
        },
};

Step 5: Step-4 + manual loop unroll

After PAPI instrumentation on the compute_block() function, you will notice that its execution time can take up to 90% of the kernel time. We can now perform some optimizations on it:

Change compute type from float to short
Vectorization by manual loop-unrolling
We can also use the Kalray C-Native extension (see Kalray OpenCL manual) to plug an external optimized C or assembly kernel of compute_block() into OpenCL steps (see Step-7).

static void sobel_compute_block_step_5(__local uchar *block_in_local,
                                       __local uchar *block_out_local,
                                       int block_in_row_stride, int block_out_row_stride,
                                       int block_width, int block_height,
                                       int block_width_halo, int block_height_halo,
                                       float scale)
{
    ...
    for (int irow = irow_begin; irow < irow_end; irow++)
    {
        int icol = 0;
        for (; icol + 8 + HALO_SIZE <= block_width_halo; icol += 8)
        {
            // load neighbors
            short8 c0 = convert_short8(vload8(...));
            short8 c1 = convert_short8(vload8(...));
            short8 c2 = convert_short8(vload8(...));

            short8 n0 = convert_short8(vload8(...));
            short8 n2 = convert_short8(vload8(...));

            short8 t0 = convert_short8(vload8(...));
            short8 t1 = convert_short8(vload8(...));
            short8 t2 = convert_short8(vload8(...));

            // compute
            float8 magx = ...;
            float8 magy = ...;
            float8 mag  = ...;

            // store pixel
            vstore8(...);
        }
        ...
        for (; icol < block_width; icol++)
        {
            // load neighbors
            ...
            // compute
            ...
            // store pixel
            ...
        }
    }

    // sync to gather result from all WI
    barrier(CLK_LOCAL_MEM_FENCE);
}

Step 6: Step-5 + Fast-inverse-square-root

After various optimizations in step-5, the performance is still bounded by the square-root function (or hypot()). We apply in this step the Fast-inverse-square-root to implement a new fast_hypot() function.

NOTE: In some circumstances, this new fast_hypot() may produce output pixels with a difference of +-1 in comparison to the standard IEEE-754 implementation. The correctness check is accordingly relaxed by a tolerance of +-1 (host.c).

static inline float8 Q_rsqrt(float8 number)
{
    const float8 x2 = number * 0.5f;
    const float threehalfs = 1.5f;
    union {
        float8 f;
        uint8 i;
    } conv  = { .f = number };
    conv.i  = 0x5f3759df - ( conv.i >> 1 );
    conv.f  *= threehalfs - ( x2 * conv.f * conv.f );
    return conv.f;
}

static inline float8 fast_hypot(float8 magx, float8 magy)
{
    const float8 sum_pow_xy = (magx * magx) + (magy * magy);
    const float8 rsqrt = Q_rsqrt(sum_pow_xy);
    return sum_pow_xy * rsqrt;
}

Step 7: Step-5 + Native-C kernel + `-ffast-math`

We now use the Native function extension to optimize even more the compute_block() function in a separate C source file and plug it into a new OpenCL kernel. We will particularly deal with square-root operations which are still expensive, even with the Fast-inverse-square-root algorithm.

We tune the square root function in the native C version through some compilation flags (typically -ffast-math etc.) to enable MPPA hardware-based inverse-square-root instruction. Reader is invited to refer to sobel_compute_block.c and Makefile for how this is written, compiled and linked.

NOTE: Similar to Step-6, enabling -ffast-math and hardware-based instruction may cause a +-1 difference versus the standard IEEE-754 implementation.

Final results

Below is the performance of all steps, and their speedup compared to the Step-0 on a large image (curious_cat.png 2160x3840). Output image of each step is also checked for correctness against the output of the reference Step-0 (with a +-1 tolerance).

$ make run_hw OUTPUT_DISPLAY=0 INPUT_IMAGE_PNG=images/curious_cat.png
...
[HOST] Kernel        sobel_step_0(): Host cold 96.828 ms hot 95.695 ms - Device cold 93.104 ms hot 93.016 ms - Speedup vs. Step-0  1.00  [PASSED]
[HOST] Kernel        sobel_step_1(): Host cold 39.743 ms hot 38.124 ms - Device cold 35.109 ms hot 35.123 ms - Speedup vs. Step-0  2.65  [PASSED]
[HOST] Kernel        sobel_step_2(): Host cold 39.681 ms hot 37.783 ms - Device cold 35.160 ms hot 35.168 ms - Speedup vs. Step-0  2.64  [PASSED]
[HOST] Kernel        sobel_step_3(): Host cold 39.383 ms hot 36.149 ms - Device cold 34.781 ms hot 34.823 ms - Speedup vs. Step-0  2.67  [PASSED]
[HOST] Kernel        sobel_step_4(): Host cold 34.748 ms hot 32.905 ms - Device cold 30.306 ms hot 30.268 ms - Speedup vs. Step-0  3.07  [PASSED]
[HOST] Kernel   sobel_step_4_PAPI(): Host cold 39.764 ms hot 36.826 ms - Device cold 35.679 ms hot 34.177 ms - Speedup vs. Step-0  2.72  [PASSED]
[HOST] Kernel        sobel_step_5(): Host cold 30.958 ms hot 29.727 ms - Device cold 27.214 ms hot 27.168 ms - Speedup vs. Step-0  3.42  [PASSED]
[HOST] Kernel   sobel_step_6_fast(): Host cold  7.288 ms hot  4.452 ms - Device cold  3.419 ms hot  3.400 ms - Speedup vs. Step-0 27.36  [PASSED]
[HOST] Kernel sobel_step_7_native(): Host cold  4.521 ms hot  4.538 ms - Device cold  2.100 ms hot  2.098 ms - Speedup vs. Step-0 44.34  [PASSED] (HAVE_FAST_MATH = 1)

The performance gain is more significant on large images than on small ones, since a certain number (dozen) of tiles per workgroup is required for the double-buffering tiling algorithm to become efficient.

User can disable -ffast-math on the C-native kernel, to get an idea on the potential performance gain when enabling hardware arithmetic features (Step-7 from speedup x44 to speedup x4).

$ make run_hw OUTPUT_DISPLAY=0 INPUT_IMAGE_PNG=images/curious_cat.png HAVE_FAST_MATH=0
...
[HOST] Kernel        sobel_step_0(): Host cold 97.299 ms hot 95.973 ms - Device cold 93.067 ms hot 93.074 ms - Speedup vs. Step-0  1.00  [PASSED]
[HOST] Kernel        sobel_step_1(): Host cold 39.292 ms hot 36.222 ms - Device cold 35.095 ms hot 35.088 ms - Speedup vs. Step-0  2.65  [PASSED]
[HOST] Kernel        sobel_step_2(): Host cold 39.468 ms hot 37.794 ms - Device cold 35.158 ms hot 35.193 ms - Speedup vs. Step-0  2.64  [PASSED]
[HOST] Kernel        sobel_step_3(): Host cold 38.928 ms hot 37.745 ms - Device cold 34.826 ms hot 34.815 ms - Speedup vs. Step-0  2.67  [PASSED]
[HOST] Kernel        sobel_step_4(): Host cold 34.626 ms hot 32.880 ms - Device cold 30.310 ms hot 30.284 ms - Speedup vs. Step-0  3.07  [PASSED]
[HOST] Kernel   sobel_step_4_PAPI(): Host cold 39.913 ms hot 35.383 ms - Device cold 35.453 ms hot 34.273 ms - Speedup vs. Step-0  2.72  [PASSED]
[HOST] Kernel        sobel_step_5(): Host cold 31.536 ms hot 29.855 ms - Device cold 27.204 ms hot 27.107 ms - Speedup vs. Step-0  3.43  [PASSED]
[HOST] Kernel   sobel_step_6_fast(): Host cold  7.997 ms hot  6.068 ms - Device cold  3.425 ms hot  3.415 ms - Speedup vs. Step-0 27.25  [PASSED]
[HOST] Kernel sobel_step_7_native(): Host cold 25.981 ms hot 26.213 ms - Device cold 23.464 ms hot 23.483 ms - Speedup vs. Step-0  3.96  [PASSED] (HAVE_FAST_MATH = 0)

Depending on application and computation kernel, the "earning bread" may come from either compute-bound or memory-bound aspects (or both):

Mathematical functions (trigonometric (sin, cos, atan, hypot etc.) or power and logarithmic (pow, exp, log))
Floating-point and/or integer operations (widening, narrowing, mixed-precision etc.)
Vectorization or "matrix-fication" of processing (and how to exploit all available hardware capacity)
Compilers and build flags (-O2, -O3, -ffast-math, -ffp-contract=*, -flto -fipa-* etc.)
Memory-traffic (access-pattern, cache, DMA, prefetch, etc.)
...

References

[1] KhronosGroup/OpenCL-Docs#573

[2] Hascoët, Julien, et al. "Asynchronous one-sided communications and synchronizations for a clustered manycore processor." Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia. 2017.

[3] Julien Hascoet, Benoît Dupont de Dinechin, Karol Desnos, Jean-Francois Nezan. A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore. IEEE High Performance Extreme Computing Conference (HPEC 2018), Sep 2018, Waltham, MA, United States. ⟨10.1109/hpec.2018.8547736⟩. ⟨hal-02049414⟩

[4] Minh-Quan Ho, Christian Obrecht, Bernard Tourancheau, Benoît Dupont de Dinechin, Julien Hascoet. Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors. 36th IEEE International Performance Computing and Communications Conference (IPCCC 2017), Dec 2017, San Diego, United States. ⟨hal-01652614⟩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents

Introduction

Main objectives

Files and Run

3x3 Sobel filter in a nutshell

Shifted 3x3 Sobel filter

OpenCL-C asynchronous copy primitives

Standard OpenCL strided and 1D/2D/3D copy

Kalray OpenCL 2D/3D-block copy extension

Step 0: Reference GPU-friendly kernel

Step 1: Explicit tiling

Step 2: Step-1 + `__local` on input

Step 3: Step-2 + `__local` on output

Step 4: Step-3 + double-buffering overlapping

Step 5: Step-4 + manual loop unroll

Step 6: Step-5 + Fast-inverse-square-root

Step 7: Step-5 + Native-C kernel + `-ffast-math`

Final results

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
.arcconfig		.arcconfig
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
host.c		host.c
ocl_utils.c		ocl_utils.c
ocl_utils.h		ocl_utils.h
png_utils.c		png_utils.c
png_utils.h		png_utils.h
sobel.cl		sobel.cl
sobel_compute_block.c		sobel_compute_block.c
sobel_config.h		sobel_config.h

License

kalray/opencl_optim_examples

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Introduction

Main objectives

Files and Run

3x3 Sobel filter in a nutshell

Shifted 3x3 Sobel filter

OpenCL-C asynchronous copy primitives

Standard OpenCL strided and 1D/2D/3D copy

Kalray OpenCL 2D/3D-block copy extension

Step 0: Reference GPU-friendly kernel

Step 1: Explicit tiling

Step 2: Step-1 + __local on input

Step 3: Step-2 + __local on output

Step 4: Step-3 + double-buffering overlapping

Step 5: Step-4 + manual loop unroll

Step 6: Step-5 + Fast-inverse-square-root

Step 7: Step-5 + Native-C kernel + -ffast-math

Final results

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 2: Step-1 + `__local` on input

Step 3: Step-2 + `__local` on output

Step 7: Step-5 + Native-C kernel + `-ffast-math`

Packages