This repository contains the software which has been written by myself during my master thesis in physics at the Institute of Computer Engineering of the University of Heidelberg (ZITI).
This software embodies the static part of project Mekong, which intends to simplify multi GPU programming by simulating a one GPU environment to the programmer. For a detailed introduction to the project please read my master thesis contained in the top level directory of this repository.
- CUDA 7.5 Toolkit
- LLVM 3.9.0. Even better with git hash 314a1b6403bc971a9d68d7eed3a0ce1359d03656
- Clang compiler of version 3.9.0. Even better with git hash f847bad18bcc46b5b47d32274a9d53a50209e555
- libclc to enable Clang's OpenCL frontend: http://libclc.llvm.org
- Polly library of version 3.9. Even better with git hash 48731f9f119b03d7dc8b73aa7dd24720c4859645
- ISL library of version 0.16.
Building the software consists of three parts: (1) building the LLVM custom passes, (2) building the test case and its LLVM IR, and (3) building Mekong's runtime library.
In the top level directory do:
mkdir build
cd build
cmake ..
make
If this does not work for you, because e.g. you did not install the
excact listed requirements, you might want to take a look into the
CMakeLists.txt
file located in the top level directory.
Three test cases are located in ./samples
, which you can copy into
./test_orig
to speed up your working progress with the scripts
given in ./sh
. Each test case consists of a host code written in CUDA
Driver API, and a device code written in OpenCL. If the test case is
used as a single GPU application, the device code must be compiled to
the PTX format deploying Clang's OpenCL frontend and its PTX backend.
The PTX device code can be included in the host code at application
runtime with NVIDIA's Just-In-Time (JIT) PTX compiler. You can take a
look into the code of the test cases to understand how this can be
implemented. Once you copied the test case into ./test_orig
you can
compile it with the provided Makefile if you have set your environment
variables:
CLC_INC
: include directory of libclc intallationCLC_LIB
:$CLC_LIB/built_libs/
must containnvptx64--nvidiacl.bc
CUDA_LIB
: must containlibcuda.so
CUDA_INC
: must containcuda.h
Afterwards do
cd ./test_orig/
mkdir llvm ptx bin
make
Before building the runtime library you can adjust your configuration
in ./CONFIG.txt
. Moreover the information of the analysis pass is
linked statically into the runtime library, which means you have to run
the analysis pass first, before compiling the runtime library. You
should build the runtime library like:
mkdir ./runtime/build
cd ./runtime/build
cmake ..
make
I recommend to test the functionality of the software on your system
with one of the provided test cases in ./samples
. For further
development of the software I suggest to use the bash scripts
contained in ./sh
:
bsp_analysis.sh
executes the device code analysis pass of./test_orig/llvm/test_device.ll
. The result will be saved in a file withdashdb
format (see./dashdb
for the database implementation). The content file is located in./bsp_analysis/dbs
.bsp_transform.sh
executes the device code transformation and outputs into./test_trans/llvm
and./test_trans/ptx
. The device code transformation requires the database file produced by the analysis pass.host_transform.sh
executes the host code transformation pass, which basically consists of function call replacements und function declaration insertions (seehost_transform/src
for the host code pass implementation)trans_and_link_on.sh
can be used to send the transferred device and host code withscp
onto a remote machine. With the also transferred runtime library the object files will be linked against the runtime library and the CUDA library on the remote machine to produce a executable. You have to adust the paths within this script to fit to your installations.
After you got the scripts running you should be able to produce a running binary.
You can build a doxygen documentation of most parts of the software. To
understand the runtime library I recommend the wrapLaunchKernel
function in ./runtime/src/mekong-wrapping.cc
as a starting point.
This function replaces the cuLaunchKernel
function after the host
transformation pass has been run.
The key performance measurements are contained in the master thesis.
All benchmarks, which have been done, are located in ipython notebook
files in directory ./measurements
. To understand the measured times
in detail you should read the master thesis, or at least take a look at
the bare measured data:
- Matrix Multiplication: https://gist.github.com/codecircuit/d8d58eab0a0a591b5c625b0b5eae7459
- Stencil Code: https://gist.github.com/codecircuit/d1067257e453f2500be68c24c06799ac
- N Body Code: https://gist.github.com/codecircuit/ea60dccff3ee2643fcddfdf889ce7ac7
The software is in a experimental state and can be used for a few test cases. It represents a modular written base for Mekong's static part. If you want to introduce a new test case, you should read chapter Usage in the master thesis.
The provided software is published under the terms of the GNU General Public License.