- Introduction
- OpenCL-C asynchronous copy primitives
- Step 0: Reference GPU-friendly kernel
- Step 1: Explicit tiling
- Step 2: Step-1 +
__local
on input - Step 3: Step-2 +
__local
on output - Step 4: Step-3 + double-buffering overlapping
- Step 5: Step-4 + manual loop unroll
- Step 6: Step-5 + Fast-inverse-square-root
- Step 7: Step-5 + Native-C kernel +
-ffast-math
- Final results
- References
This is a sandbox project aiming at demonstrating step-by-step optimizations of stencil and image-processing algorithms on clustered many-core architectures fitted with local memories and DMA capabilities, like the Kalray MPPA many-core processor.
The OpenCL-C is taken as programming language, and the 3x3 Sobel filter for the computation kernel.
This example will walk through the first GPU-friendly reference kernel, to other more and more advanced optimizations. The goal is to highlight key methods to obtain high performance and demonstrate each one of them on the Kalray MPPA architecture.
- Keep it as simple as possible
- Focus on the data-transfer and processing part on Accelerator (OpenCL kernels in *.cl)
- Optimize performance as much as possible, as long as it remains easy to understand
- Insightful comments and clear coding style
- Process some real images, like a hands-on demo
├── host.c
├── images
│ ├── curious_cat.png
│ └── valve_original.png
├── Makefile
├── ocl_utils.c
├── ocl_utils.h
├── png_utils.c
├── png_utils.h
├── sobel.cl
├── sobel_compute_block.c
├── sobel_config.h
└── README.md
The most important files are sobel.cl
and host.c
. OpenCL kernels in
sobel.cl
are structured in several chronological/incremental steps.
This README is a short summary of these optimization steps. Reader is
strongly recommended to follow those kernels in sobel.cl
step-by-step.
From step 4 on, reader should also refer to host.c
for the execution
deployment parameters
(clEnqueueNDRangeKernel()
).
Requirements:
- Kalray AccessCore Embedded (ACE) SDK >= 4.5.0
libpng
>= 1.6imagemagick
and provideddisplay
command
Example install command on Ubuntu 18.04:
sudo apt install libpng16-16 imagemagick
Run:
$ make run_hw [INPUT_IMAGE_PNG=/.../.../image.png] [OUTPUT_DISPLAY=0|1] \
[HAVE_FAST_MATH=0|1]
# INPUT_IMAGE_PNG : Path to user PNG image (default = /images/valve_original.png)
# OUTPUT_DISPLAY : Enable/Disable display of all output images (default = 1)
# HAVE_FAST_MATH : Enable/Disable -ffast-math in native C kernels (default = 1)
The Sobel operator is widely known and used as among others edge detection algorithms. For each output pixel (y, x), its horizontal and vertical derivatives and are computed as follow (Image source: Wikipedia):
in which *
denotes the image processing convolution and
is a 3x3 sub-view of the input image around the pixel (y, x):
+------+------+------+
| | | |
+------+------+------+
A = | |(y, x)| |
+------+------+------+
| | | |
+------+------+------+
Then the output pixel is calculated as below:
For the sake of simplicity of boundary condition and different optimization steps, we implement in this project a "shifted" Sobel operator, i.e. the computed pixel (y, x) will be written to the index (y-1, x-1), instead of (y, x). This produces an output image somehow "shifted" by one pixel in the top-left direction.
+------+------+------+ +---------+--------+--------+
| | | | |(y-1,x-1)| | |
+------+------+------+ write to +---------+--------+--------+
| |(y, x)| | =========> | | | |
+------+------+------+ +---------+--------+--------+
| | | | | | | |
+------+------+------+ +---------+--------+--------+
The BORDER_REPLICATE
boundary condition is employed for out-of-bound pixels,
with direct min()
or clamp()
on neighbor indices, without
explicit padding on the input image.
The OpenCL 1.2 async primitives and more recent OpenCL 3.0 DMA extension are supported by the Kalray OpenCL driver (>= ACE-4.5.0).
Besides supporting above-mentioned OpenCL 1.2 and 3.0 asynchronous primitives, Kalray also added its own async-copy extension in order to reduce the programming effort when optimizing image processing and other stencil kernels on embedded platforms such as MPPA.
Motivation on this Kalray async-copy extension is reported in [1]. Further fundamental rationale and technical implementation details of these DMA-based primitives were also studied in [2], [3] and [4].
In short, the Kalray block-2D2D async-copy extension is illustrated in the below figure, in which a generic 2D-copy is characterized by:
- begin address of
__global
image and__local
buffer - an extent block of
height x width
elements to be copied - read-coordinates and dimension of
__global
image - write-coordinates and dimension of
__local
buffer
An element may contain one or several GENTYPE
s,
in the same way that a pixel may have one (gray), three (YUV or RGB) or
four (RGBA) components. The developer builds a tiling algorithm and
reasons in term of pixels, not bytes. Only at the moment of call to
async_copy_*()
that a num_gentype_per_elem
can be inserted to specify
the pixel-composition (1, 3, 4 or even 7, 9, 13 etc.).
These Kalray 2D/3D primitives always take the original
data pointers as argument (__global
image, __local
buffer),
gather all parameters above and automatically perform byte-size calculation and
address offsets under the hood.
Developer no longer needs to perform such an error-prone process.
/**
* async_work_group_copy_block_2D2D between global and local
*
* @param dst dst pointer
* @param src src pointer
* @param num_gentype_per_elem Number of gentypes in an element of the 2D block
* @param block_size Dimension in elements of the 2D block {width, height}.
* Total copied gentypes is
* (num_gentype_per_elem * block_size.x * block_size.y)
* @param local_point Local copy indexes and buffer size
* {xpos, ypos, xdim, ydim}
* @param global_point Global copy indexes and buffer size
* {xpos, ypos, xdim, ydim}
* @param event event associated to the transfer
* @return event associated to the transfer
*/
event_t async_work_group_copy_block_2D2D(__local GENTYPE *dst,
const __global GENTYPE *src,
size_t num_gentype_per_elem,
int2 block_size,
int4 local_point,
int4 global_point,
event_t event);
event_t async_work_group_copy_block_2D2D(__global GENTYPE *dst,
const __local GENTYPE *src,
size_t num_gentype_per_elem,
int2 block_size,
int4 local_point,
int4 global_point,
event_t event);
From now on, you can either open sobel.cl
for a deep dive to the code, or
keep reading till the end for a quick overview of optimization techniques.
Some details in this README may have been intentionally removed for the sake
of readability. Those missing points can be found, of course,
in sobel.cl
(and host.c
).
This first step is implemented in the well-known Data-Parallel paradigm, aimed to run efficiently on GPUs in the SIMT execution model. This kernel will be run as reference and performance baseline on the Kalray MPPA Accelerator.
Host configuration: (one work-item per output pixel)
clGetDeviceInfo(CL_DEVICE_MAX_WORK_GROUP_SIZE, ..., &max_workgroup_size);
struct kernel_desc_s kernel_desc[] = {
{
.name = "sobel_step_0",
.globalSize = {ALIGN_MULT_UP(image_width, max_workgroup_size), image_height},
.localSize = {max_workgroup_size, 1},
},
};
kernel_desc[i].ocl_kernel = clCreateKernel(program, kernel_desc[i].name, &err);
...
clEnqueueNDRangeKernel(queue, kernel_desc[i].ocl_kernel, 2, NULL,
kernel_desc[i].globalSize, kernel_desc[i].localSize, ...);
OpenCL kernel:
// Boundary condition
#define IMAGE_INDEX(y, x) \
((min(y, image_height-1) * image_width) + min(x, image_width-1))
__kernel void sobel_step_0(__global uchar *image_in, __global uchar *image_out,
int image_width, int image_height, float scale)
{
int x = get_global_id(0);
int y = get_global_id(1);
if (x >= image_width || y >= image_height) { return; }
// load neighbors
float c0 = image_in[IMAGE_INDEX(y+0, x+0)];
float c1 = image_in[IMAGE_INDEX(y+0, x+1)];
float c2 = image_in[IMAGE_INDEX(y+0, x+2)];
float n0 = image_in[IMAGE_INDEX(y+1, x+0)];
float n2 = image_in[IMAGE_INDEX(y+1, x+2)];
float t0 = image_in[IMAGE_INDEX(y+2, x+0)];
float t1 = image_in[IMAGE_INDEX(y+2, x+1)];
float t2 = image_in[IMAGE_INDEX(y+2, x+2)];
// compute
float magx = mad(2.0f, (n2 - n0), (c2 - c0 + t2 - t0));
float magy = mad(2.0f, (t1 - c1), (t0 - c0 + t2 - c2));
float mag = hypot(magx, magy) * scale;
// store pixel
image_out[IMAGE_INDEX(y, x)] = convert_uchar_sat(mag);
}
Explicit tiling is different from, so called Implicit tiling.
Implicit tiling is done by OpenCL mapping of input pixels on two-dimensional
workgroups. This model usually maps one pixel on one workitem (c.f. Step-0)
and the tile size (aka workgroup size in this case) is determined based on
global_work_size[]
and local_work_size[]
of
clEnqueueNDRangeKernel()
Explicit tiling, on another hand, does not rely on workgroup shape to perform
tiling, but on user-defined TILE_HEIGHT x TILE_WIDTH
tiles.
The OpenCL kernel may need to be modified as well to be able to handle arbitrary
sizes, regardless the number of workitems or workgroup topology.
// NOTE: "Shifted Sobel filter" whose the output tile is shifted back to
// top-left by one pixel, yielding the below stencil block with HALO_SIZE == 2:
//
// |<------ TILE_WIDTH ------->| HALO_SIZE (== 2)
// -----------+---------------------------+----+
// ʌ | | |
// | | | |
// | | | |
// | | | |
// TILE_HEIGHT | Tile | |
// | | | |
// | | | |
// v | | |
// -----------+---------------------------+ |
// HALO_SIZE | |
// +--------------------------------+
The Host configuration now spawns one workgroup per tile:
struct kernel_desc_s kernel_desc[] = {
...
{
.name = "sobel_step_1",
.globalSize = {(ceil(((double)image_width) / TILE_WIDTH)) * max_workgroup_size,
(ceil(((double)image_height) / TILE_HEIGHT)) * 1},
.localSize = {max_workgroup_size, 1},
},
};
For the sake of abstraction, a sobel_compute_block()
function will be
introduced for each step which is doing tiling. This allows separation
between (1) data-transfer optimization and (2) arithmetic computations, which
are two different tracks, though they are both tied to the final performance.
static void sobel_compute_block_step_1(__global uchar *block_in,
__global uchar *block_out,
int block_row_stride,
int block_width, int block_height,
int block_width_halo, int block_height_halo,
float scale)
{
// dispatch rows of block block_height x block_width on workitems
const int irow_begin = ...;
const int irow_end = ...;
...
for (int irow = irow_begin; irow < irow_end; irow++)
{
for (int icol = 0; icol < block_width; icol++)
{
// load neighbors
...
// compute
...
// store pixel
...
}
}
}
__kernel void sobel_step_1(__global uchar *image_in, __global uchar *image_out,
int image_width, int image_height, float scale)
{
const int group_idx = get_group_id(0);
const int group_idy = get_group_id(1);
const ulong block_offset = (block_idy * image_width) + block_idx;
__global uchar *block_in = image_in + block_offset;
__global uchar *block_out = image_out + block_offset;
...
sobel_compute_block_step_1(block_in, block_out, ...);
}
This step introduces first use of __local
memory on MPPA device, as
replacement for DRAM __global
memory, combined with DMA-backed asynchronous
copy primitives (OpenCL 1.2 specification and Kalray extension)
__kernel void sobel_step_2(__global uchar *image_in, __global uchar *image_out,
int image_width, int image_height, float scale)
{
__local uchar block_in_local[(TILE_HEIGHT + HALO_SIZE) * (TILE_WIDTH + HALO_SIZE)];
...
// ------------------------------------------------------------
// Before computing, copy data to __local
// (block_height_halo x block_width_halo)
// ------------------------------------------------------------
event = async_work_group_copy_block_2D2D(block_in_local, image_in, ...);
wait_group_events(1, &event);
// ------------------------------------------------------------
// Compute (same as sobel_compute_block_step_1() but with
// block_in_local[])
// ------------------------------------------------------------
sobel_compute_block_step_2(block_in_local, block_out,
block_in_row_stride, block_out_row_stride,
block_width, block_height,
block_width_halo, block_height_halo,
scale);
}
Same as Step-2, with an additional __local
buffer for storing output tiles,
before sending to __global
__kernel void sobel_step_3(__global uchar *image_in, __global uchar *image_out,
int image_width, int image_height, float scale)
{
__local uchar block_in_local [(TILE_HEIGHT + HALO_SIZE) * (TILE_WIDTH + HALO_SIZE)];
__local uchar block_out_local[ TILE_HEIGHT * TILE_WIDTH ];
...
// ------------------------------------------------------------
// Before computing, copy data to __local
// (block_height_halo x block_width_halo)
// ------------------------------------------------------------
event = async_work_group_copy_block_2D2D(block_in_local, image_in, ...);
wait_group_events(1, &event);
// ------------------------------------------------------------
// Compute (same as sobel_compute_block_step_1() but with
// block_in_local[])
// ------------------------------------------------------------
sobel_compute_block_step_3(block_in_local, block_out_local, ...);
// ------------------------------------------------------------
// After computing, send result to __global
// (block_height x block_width)
// ------------------------------------------------------------
event = async_work_group_copy_block_2D2D(image_out, block_out_local, ...);
wait_group_events(1, &event);
}
Optimize data-transfer between __global
and __local
with double-buffering
and overlapping. This will be the ultimate optimization step on data transfer.
Unlike previous steps 1/2/3, in which each workgroup computes one tile, and
there are as many workgroups as the number of tiles in image. In this step,
there will be only 5 workgroups spawned (CL_DEVICE_MAX_COMPUTE_UNITS
).
Each workgroup will be in charge of multiple tiles and perform overlapping
through double-buffering and async-copy.
Reader in invited to refer to the state-of-the-art studies on this subject (e.g. [3] and [4] and many others).
Host configuration:
clGetDeviceInfo(CL_DEVICE_MAX_COMPUTE_UNITS, ..., &max_compute_units);
struct kernel_desc_s kernel_desc[] = {
...
{
.name = "sobel_step_4",
.globalSize = {max_workgroup_size * max_compute_units, 1},
.localSize = {max_workgroup_size, 1},
},
};
After PAPI
instrumentation on the compute_block()
function, you will notice
that its execution time can take up to 90% of the kernel time.
We can now perform some optimizations on it:
- Change compute type from
float
toshort
- Vectorization by manual loop-unrolling
- We can also use the Kalray C-Native extension (see Kalray OpenCL manual)
to plug an external optimized C or assembly kernel of
compute_block()
into OpenCL steps (see Step-7).
static void sobel_compute_block_step_5(__local uchar *block_in_local,
__local uchar *block_out_local,
int block_in_row_stride, int block_out_row_stride,
int block_width, int block_height,
int block_width_halo, int block_height_halo,
float scale)
{
...
for (int irow = irow_begin; irow < irow_end; irow++)
{
int icol = 0;
for (; icol + 8 + HALO_SIZE <= block_width_halo; icol += 8)
{
// load neighbors
short8 c0 = convert_short8(vload8(...));
short8 c1 = convert_short8(vload8(...));
short8 c2 = convert_short8(vload8(...));
short8 n0 = convert_short8(vload8(...));
short8 n2 = convert_short8(vload8(...));
short8 t0 = convert_short8(vload8(...));
short8 t1 = convert_short8(vload8(...));
short8 t2 = convert_short8(vload8(...));
// compute
float8 magx = ...;
float8 magy = ...;
float8 mag = ...;
// store pixel
vstore8(...);
}
...
for (; icol < block_width; icol++)
{
// load neighbors
...
// compute
...
// store pixel
...
}
}
// sync to gather result from all WI
barrier(CLK_LOCAL_MEM_FENCE);
}
After various optimizations in step-5, the performance is still bounded
by the square-root function (or hypot()
).
We apply in this step the
Fast-inverse-square-root
to implement a new fast_hypot()
function.
NOTE: In some circumstances, this new fast_hypot()
may produce
output pixels with a difference of +-1 in comparison to the standard IEEE-754
implementation. The correctness check is accordingly relaxed by a tolerance
of +-1 (host.c
).
static inline float8 Q_rsqrt(float8 number)
{
const float8 x2 = number * 0.5f;
const float threehalfs = 1.5f;
union {
float8 f;
uint8 i;
} conv = { .f = number };
conv.i = 0x5f3759df - ( conv.i >> 1 );
conv.f *= threehalfs - ( x2 * conv.f * conv.f );
return conv.f;
}
static inline float8 fast_hypot(float8 magx, float8 magy)
{
const float8 sum_pow_xy = (magx * magx) + (magy * magy);
const float8 rsqrt = Q_rsqrt(sum_pow_xy);
return sum_pow_xy * rsqrt;
}
We now use the Native function extension to optimize even more the
compute_block()
function in a separate C source file and plug it into
a new OpenCL kernel. We will particularly deal with square-root operations
which are still expensive, even with the Fast-inverse-square-root algorithm.
We tune the square root function in the native C version through
some compilation flags (typically -ffast-math
etc.) to enable MPPA
hardware-based inverse-square-root instruction.
Reader is invited to refer to sobel_compute_block.c
and Makefile
for how
this is written, compiled and linked.
NOTE: Similar to Step-6,
enabling -ffast-math
and hardware-based instruction may cause a +-1
difference versus the standard IEEE-754 implementation.
Below is the performance of all steps, and their speedup compared to the
Step-0 on a large image (curious_cat.png
2160x3840).
Output image of each step is also checked for correctness against the output of
the reference Step-0 (with a +-1 tolerance).
$ make run_hw OUTPUT_DISPLAY=0 INPUT_IMAGE_PNG=images/curious_cat.png
...
[HOST] Kernel sobel_step_0(): Host cold 96.828 ms hot 95.695 ms - Device cold 93.104 ms hot 93.016 ms - Speedup vs. Step-0 1.00 [PASSED]
[HOST] Kernel sobel_step_1(): Host cold 39.743 ms hot 38.124 ms - Device cold 35.109 ms hot 35.123 ms - Speedup vs. Step-0 2.65 [PASSED]
[HOST] Kernel sobel_step_2(): Host cold 39.681 ms hot 37.783 ms - Device cold 35.160 ms hot 35.168 ms - Speedup vs. Step-0 2.64 [PASSED]
[HOST] Kernel sobel_step_3(): Host cold 39.383 ms hot 36.149 ms - Device cold 34.781 ms hot 34.823 ms - Speedup vs. Step-0 2.67 [PASSED]
[HOST] Kernel sobel_step_4(): Host cold 34.748 ms hot 32.905 ms - Device cold 30.306 ms hot 30.268 ms - Speedup vs. Step-0 3.07 [PASSED]
[HOST] Kernel sobel_step_4_PAPI(): Host cold 39.764 ms hot 36.826 ms - Device cold 35.679 ms hot 34.177 ms - Speedup vs. Step-0 2.72 [PASSED]
[HOST] Kernel sobel_step_5(): Host cold 30.958 ms hot 29.727 ms - Device cold 27.214 ms hot 27.168 ms - Speedup vs. Step-0 3.42 [PASSED]
[HOST] Kernel sobel_step_6_fast(): Host cold 7.288 ms hot 4.452 ms - Device cold 3.419 ms hot 3.400 ms - Speedup vs. Step-0 27.36 [PASSED]
[HOST] Kernel sobel_step_7_native(): Host cold 4.521 ms hot 4.538 ms - Device cold 2.100 ms hot 2.098 ms - Speedup vs. Step-0 44.34 [PASSED] (HAVE_FAST_MATH = 1)
The performance gain is more significant on large images than on small ones, since a certain number (dozen) of tiles per workgroup is required for the double-buffering tiling algorithm to become efficient.
User can disable -ffast-math
on the C-native kernel, to get an idea on the
potential performance gain when enabling hardware arithmetic features
(Step-7 from speedup x44 to speedup x4).
$ make run_hw OUTPUT_DISPLAY=0 INPUT_IMAGE_PNG=images/curious_cat.png HAVE_FAST_MATH=0
...
[HOST] Kernel sobel_step_0(): Host cold 97.299 ms hot 95.973 ms - Device cold 93.067 ms hot 93.074 ms - Speedup vs. Step-0 1.00 [PASSED]
[HOST] Kernel sobel_step_1(): Host cold 39.292 ms hot 36.222 ms - Device cold 35.095 ms hot 35.088 ms - Speedup vs. Step-0 2.65 [PASSED]
[HOST] Kernel sobel_step_2(): Host cold 39.468 ms hot 37.794 ms - Device cold 35.158 ms hot 35.193 ms - Speedup vs. Step-0 2.64 [PASSED]
[HOST] Kernel sobel_step_3(): Host cold 38.928 ms hot 37.745 ms - Device cold 34.826 ms hot 34.815 ms - Speedup vs. Step-0 2.67 [PASSED]
[HOST] Kernel sobel_step_4(): Host cold 34.626 ms hot 32.880 ms - Device cold 30.310 ms hot 30.284 ms - Speedup vs. Step-0 3.07 [PASSED]
[HOST] Kernel sobel_step_4_PAPI(): Host cold 39.913 ms hot 35.383 ms - Device cold 35.453 ms hot 34.273 ms - Speedup vs. Step-0 2.72 [PASSED]
[HOST] Kernel sobel_step_5(): Host cold 31.536 ms hot 29.855 ms - Device cold 27.204 ms hot 27.107 ms - Speedup vs. Step-0 3.43 [PASSED]
[HOST] Kernel sobel_step_6_fast(): Host cold 7.997 ms hot 6.068 ms - Device cold 3.425 ms hot 3.415 ms - Speedup vs. Step-0 27.25 [PASSED]
[HOST] Kernel sobel_step_7_native(): Host cold 25.981 ms hot 26.213 ms - Device cold 23.464 ms hot 23.483 ms - Speedup vs. Step-0 3.96 [PASSED] (HAVE_FAST_MATH = 0)
Depending on application and computation kernel, the "earning bread" may come from either compute-bound or memory-bound aspects (or both):
- Mathematical functions (trigonometric (
sin
,cos
,atan
,hypot
etc.) or power and logarithmic (pow
,exp
,log
)) - Floating-point and/or integer operations (widening, narrowing, mixed-precision etc.)
- Vectorization or "matrix-fication" of processing (and how to exploit all available hardware capacity)
- Compilers and build flags (
-O2
,-O3
,-ffast-math
,-ffp-contract=*
,-flto -fipa-*
etc.) - Memory-traffic (access-pattern, cache, DMA, prefetch, etc.)
- ...
[1] KhronosGroup/OpenCL-Docs#573
[2] Hascoët, Julien, et al. "Asynchronous one-sided communications and synchronizations for a clustered manycore processor." Proceedings of the 15th IEEE/ACM Symposium on Embedded Systems for Real-Time Multimedia. 2017.
[3] Julien Hascoet, Benoît Dupont de Dinechin, Karol Desnos, Jean-Francois Nezan. A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore. IEEE High Performance Extreme Computing Conference (HPEC 2018), Sep 2018, Waltham, MA, United States. ⟨10.1109/hpec.2018.8547736⟩. ⟨hal-02049414⟩
[4] Minh-Quan Ho, Christian Obrecht, Bernard Tourancheau, Benoît Dupont de Dinechin, Julien Hascoet. Improving 3D Lattice Boltzmann Method stencil with asynchronous transfers on many-core processors. 36th IEEE International Performance Computing and Communications Conference (IPCCC 2017), Dec 2017, San Diego, United States. ⟨hal-01652614⟩