Miniball Cuda

CUDA port of the Miniball code. The original Miniball code was written by Bernd Gaertner and can be downloaded on the ETHZ Website. The algorithm is described in his Paper.

Note that I did not parallelize the Miniball algorithm itself. Instead this implementation provides Miniball as a device function. This way many balls can be computed at the same time. Depending on your system you will only see an improvement when the number of independent spheres gets large (>1000). There may also be limitations in terms of flexibility and memory usage with this modified version (see usage).

There is a usage example in example.cu together with a minimal CMakeList file to build it. There is also a performance comparison available between CPU and GPU in timing_test.cu. Note however that the modified code is slightly slower when run on the CPU then the original implementation.

building

include MiniballCuda.h in your code
enable the cuda compiler option "--expt-relaxed-constexpr"
compile as usual

usage

Usage is similar to the original Miniball implementation. However you can call it from host and device code. This way every cuda thread can calculate one sphere. That requires the dimension and maximum number of points per sphere to be known at compile time. They are passed to the algorithm as template parameters.

For more information see the example code in example.cu.

There is also an option (template parameter) to switch back to the recursive implementation. That can be faster but leads to a stack overflow when using many points per sphere (GPU's have limited stack).

changes from original Miniball code

The following changes were made to the original miniball code to allow it to run with cuda.

make dimension a compile time constant,
that allows to remove dynamic memory allocation
make max amount of points per sphere a compile time constant
add list class that works without dynamic memory
add iterative implementation for mtf_mp iterative
remove internal timing code

note on performance

The algorithm is not very well suited for the GPU and my implementation far from optimal. On my GPU it can only archive about 18% compute utilization (measured with nsight compute). If you fork this repository and put some time in optimizing the code I would be very interested in the results.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
MiniballCuda.hpp		MiniballCuda.hpp
dummy.cpp		dummy.cpp
example.cu		example.cu
readme.md		readme.md
timing_test.cu		timing_test.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Miniball Cuda

building

usage

changes from original Miniball code

note on performance

About

Releases

Packages

Languages

License

hschwane/MiniballCuda

Folders and files

Latest commit

History

Repository files navigation

Miniball Cuda

building

usage

changes from original Miniball code

note on performance

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages