Skip to content

Sample codes for parallel programming using OpenMP on CPU and CUDA on GPU

License

Notifications You must be signed in to change notification settings

MehranTaghian/CUDA-OpenMP-samples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Programming on CPU and GPU

This repository contains some sample codes helpful in learning how to program with OpenMP and CUDA. Here we are going to go through each directory and explain the purpose of each file. First we are focusing on parallelization using OpenMP. Then we parallelize large calculations using CUDA and GPU. In the end, a sample project for calculating the determinant of large matrices is developed using both OpenMP and CUDA.

OpenMP

In the OpenMP section, there is a sample code in parallel_for_loop.cpp, which, as the name suggests, is a simple for-loop parallelization.

In the matrix_add.cpp code, we have three 2D matrices, A, B, and C, where we want to calculate C = A + B. We do this in two ways: i) row-wise parallelization using a single parallel for-loop and ii) parallelized nested for-loops using the collapse() argument. The elements of A and B matrices are initialized randomly between 0 and 100. The schedule({static, dynamic}, chunk_size) specifies how the thread allocation should be. If the schedule is static, it statically allocates a chunk_size number of successive iterations to a single thread. In the case of dynamic allocation, each thread dynamically selects some number of chunks of iterations and runs them.

In the Mandelbrot section, we have the parallelization of the Mandelbrot set generation, which generates Mandelbrot sets. The serial.cpp file contains the serial code for generating this set. Here we are trying to parallelize this code using OpenMP. The first step is to parallelize the main for-loop in the For.cpp. The second technique is Multi-Tasking which generates multiple tasks to do the process. One major problem that reduces the program speed is writing to file. We can use the Pipeline technique (as we did in Pipeline.cpp) to parallelize both: i) set generation and ii) writing to file.

In the PrefixSum section, we are parallelizing the Prefix sum problem. The serial.cpp contains the serial code for this problem. The naive parallelization of the program creates the challenge of race condition that can reduce the efficiency of parallelism. You can see the first parallelization in parallel_simple.cpp . Another approach to parallelizing this problem is the Hillis and Steele algorithm. Although this algorithm might be slower than the serial mode, it is faster when programmed and run on GPUs.

CUDA

We start the CUDA section with a test program generated by Visual Studio. The purpose of this program in VS is to ensure that CUDA works. This program in under the VectorAdd directory where we brought the serial code in serial.cpp, the parallelized code using OpenMP in parallel_omp.cpp, and finally the parallel code on GPU in parallel_cuda.cu.

In the OpenMP section, we had a matrix_add.cpp program in which the elements of two matrices A and B are summed and stored in a result matrix C. We have the same program in the CUDA section matrix_add.cu, where we do the same process, except that the matrices are very large, and the parallelization is being done on GPU.

Under the MatrixMult section, we parallelized the matrix multiplication operation. The serial program is in serial.cpp. First, we have a naive parallelization in single_block.cu where we use only a single block of threads in GPU to multiply two matrices. The number of threads in a single block is limited to 1024. Therefore, for matrices with sizes larger than 1024, we should either use tiling method or use multiple blocks. It is not efficient to use a single block of threads while we have multiple free blocks. Therefore, multiple_blocks.cu program does this operation using multiple blocks (without tiling) . We combined the tiling method with multiple blocks in multiple_blocks_tiling.cu. Also, an example of the matrix multiplication operation using the cuBLAS library is given in matmul_cuBLAS.cu.

Another section dedicates to streaming multiple kernels and the concept of pageable and pinned memory. For information about pageable and pinned memory, please visit the following useful link. Also, for streaming, see link. We use the Histogram problem as the example for this section. In the Histogram problem, the goal is to count the number of different elements in a large array. In the pageable.cu we used the pageable memory allocated using the malloc() function. In the pinned.cu, the memory is allocated using cudaMallocHost() and is page-locked, making it faster to access the host memory. In order to use streaming and concurrent kernels, the memory must be pinned. In the streaming.cu we ran multiple kernels in parallel using page-locked memory.

Sample Project (Determinant)

The Determinant directory contains the serial and parallel code for calculating the determinant of large matrices. The method we used is LU decomposition, a simple and fast method for determinant calculation. Sample matrices for testing the code is given in data_in.zip. You should extract this file before running the code in the same directory. The results will be stored in a data_out directory with file names similar to input files except that an outputs_ will be prepended to the name of the input file.

For the parallelization part, reading from and writing to files are done in parallel using OpenMP, while the determinant calculation gets done on GPU.