This repository contains some sample codes helpful in learning how to program with OpenMP and CUDA. Here we are going to go through each directory and explain the purpose of each file. First we are focusing on parallelization using OpenMP. Then we parallelize large calculations using CUDA and GPU. In the end, a sample project for calculating the determinant of large matrices is developed using both OpenMP and CUDA.
In the OpenMP section, there is a sample code in parallel_for_loop.cpp
, which, as the name suggests, is a simple
for-loop parallelization.
In the matrix_add.cpp
code, we have three 2D matrices, A, B, and C, where we want to calculate C = A + B. We do this
in
two ways: i) row-wise parallelization using a single parallel for-loop and ii) parallelized nested for-loops using the
collapse()
argument. The elements of A and B matrices are initialized randomly between 0 and 100. The
schedule({static, dynamic}, chunk_size)
specifies how the thread allocation should be. If the schedule is static
, it statically allocates a chunk_size
number of
successive iterations to a single thread. In the case of dynamic
allocation, each thread dynamically selects some
number
of chunks
of iterations and runs them.
In the Mandelbrot
section, we have the parallelization of
the Mandelbrot set generation, which generates Mandelbrot sets.
The serial.cpp
file contains the serial code for generating this set. Here we are trying to parallelize this code
using OpenMP. The first step is to parallelize the main for-loop in the For.cpp
. The second technique is
Multi-Tasking which generates multiple tasks to do the process.
One major problem that reduces the program speed is writing to file. We can
use the Pipeline
technique (as we did in Pipeline.cpp
) to parallelize both: i) set generation and ii) writing to
file.
In the PrefixSum
section, we are parallelizing
the Prefix sum problem.
The serial.cpp
contains the serial code for this problem. The naive parallelization of the program creates the
challenge of
race condition that can reduce the efficiency of parallelism. You can see the first parallelization
in parallel_simple.cpp
. Another
approach to parallelizing this problem is
the Hillis and Steele algorithm.
Although this algorithm might be slower than the serial mode, it is faster when programmed and run on GPUs.
We start the CUDA section with a test program generated by Visual Studio. The purpose of this program in VS is to ensure
that CUDA works. This
program in under the VectorAdd
directory where we brought the serial code in serial.cpp
, the parallelized code using
OpenMP
in parallel_omp.cpp
, and finally the parallel code on GPU in parallel_cuda.cu
.
In the OpenMP section, we had a matrix_add.cpp
program in which the elements of two matrices A and B are summed and
stored
in a result matrix C. We have the same program in the CUDA section matrix_add.cu
, where we do the same process, except
that the matrices are
very large, and the parallelization is being done on GPU.
Under the MatrixMult
section, we parallelized the matrix multiplication operation. The serial program is
in serial.cpp
.
First, we have a naive parallelization in single_block.cu
where we use only a single block of threads in GPU to
multiply
two matrices. The number of threads in a single block is limited to 1024. Therefore, for matrices with sizes larger than
1024,
we should either use tiling method or use multiple blocks. It is not efficient to use a single block of threads
while we have
multiple free blocks. Therefore, multiple_blocks.cu
program does this operation using multiple blocks (without tiling)
. We combined the tiling method with multiple blocks in multiple_blocks_tiling.cu
.
Also, an example of the matrix multiplication operation using the cuBLAS library is given in matmul_cuBLAS.cu
.
Another section dedicates to streaming multiple kernels and the concept of pageable and pinned memory. For
information
about pageable and pinned memory, please visit the following
useful link.
Also, for streaming, see link.
We use the Histogram
problem as the example for this section. In the Histogram problem, the goal is to count the
number of
different elements in a large array. In the pageable.cu
we used the pageable memory allocated using the malloc()
function. In the pinned.cu
, the memory is allocated using cudaMallocHost()
and is page-locked, making it faster
to access the host memory. In order to use streaming and concurrent kernels, the memory must be pinned. In
the streaming.cu
we ran multiple kernels in parallel using page-locked memory.
The Determinant
directory contains the serial and parallel code for calculating the determinant of large matrices. The
method we used is LU decomposition, a simple and fast method for determinant
calculation. Sample matrices for testing the code is given in data_in.zip
. You should extract this file before
running the code in the same directory. The results will be stored in a data_out
directory with file names similar to input
files except that an outputs_
will be prepended to the name of the input file.
For the parallelization part, reading from and writing to files are done in parallel using OpenMP, while the determinant calculation gets done on GPU.