Release Header-only runtime API wrappers, split NVTX wrappers, etc. · eyalroz/cuda-api-wrappers

Main changes since 0.3.3:

The runtime API wrappers are now a header-only library.
Split the NVTX wrappers and the Runtime API wrappers into two separate libraries.
Added several fundamental types which were implicit in previous versions: cuda::size_t, cuda::dimensionality_t.

Minor API tweaks:

Renamed launch -> enqueue_launch
Can now schedule managed memory region attachment on streams
Now wrapping cudaMemAdvise() advice.
Array copying uses typed pointers
Added: A cuda::managed::device_side_pointer_for() standalone function
Added: A container facade for the sequence of all devices, so you can now write for (auto device : cuda::devices() ) { }.
De-templatized: device setter RAII class
Added: a freestanding cuda::synchronize() function instead of some wrapper methods
Made some type definitions from inside device_t to the device:: namespace
Added: A subclass of memory::region_t for managed memory
Using memory::region_t in more API functions
Dropped cuda::kernel::maximum_dynamic_shared_memory_per_block().
Centralized the definitions of take_ownership and do_not_take_ownership
Made stream_t& parameters into const stream_t&, almost universally.

Bug fixes:

Build, compatibility, usability:

Fix support for CMake 3.8 (CMakeLists.txt was using some post-3.8 features)
Clang-related:
- Skipping examples which clang++ doesn't support yet (need
- Only enabling separable compilation and CUDA
- const-cast'ing const void * kernel function pointers before reinterpretation - clang wont'tt let it
- GNU extension dropped when compiling examples with CUDA (clang dioesn't support ths)
- Fixed std::max() call issue
CMake targets depending on the wrappers should now have a C++11 language standard requirement for compilation
The wrappers now assert C++11 or later is used, instead of letting you just fail somewhere.

Provide feedback