Releases: eyalroz/cuda-api-wrappers
Version 0.4.6: Minor bug fixes, wrap() fully supported
(v0.4.5 was discarded due to an invalid version string; this is essentially the same as v0.4.5 but with the version string fixed.)
Changes since v0.4.4:
API changes
- #298 : The
wrap()
methods, which take raw CUDA handles for events, devices, streams etc. and wrap them in, well, the library's wrapper objects (as opposed to otherwise getting/creating wrapper objects directly, with no raw handles) - are now out of thedetail_::
namespace and part of the library's proper API.
Bug fixes
- #300: Was hiding some CUDA 11 stream-related features due to faulty runtime API version check.
- #299: Now correctly copying stream properites.
- #296: (Probably) fixed a Win64-to-Win32 cross-build compilation issue with callback function signatures.
Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.
Version 0.4.4: MSVC compilation cleanup
Changes since v0.4.3:
Bug fixes
- Device-properties-related functions using baked-in data corrected for some compute capabilities.
New functionality
- #284 Introduced a grid-and-block-dimensions structure,
grid::complete_dimensions_t
- Additional variants of
cuda::memory::set()
so that you may use either regions or plain pointer. device_t::global_memory_t
now has anassociated_device()
member.- #223, #272 Support for CUDA 11.0 stream attributes.
- Added
device_t::supports_block_cooperation()
. - Additional variants of
cuda::memory::copy()
for convenience. - #292: Device-properties-related functions requiring baked-in data now support Ampere GPUs (CC 8.0, 8.6).
- #293: Some methods of
compute_architecture_t
are now available only forcompute_capability_t
, as it is no longer reasonable to rely on microarchitecture-generation-default values (e.g. amount of shared memory per block, number of in-flight threads per multiprocessor etc.)
Changes to existing functionality
- #280 Events and streams now have "handles" rather than "ids".
- Partial revamp of the CUDA array wrapper classes (e.g. no templatization).
- #258 Block "Cooperativity" is now part of the launch configuration, so less launch variants are necessary.
- #250 Now offering
const
variants for both regular and mapped memory. - #269 Renamed
cuda::device::resource_limit_t
->cuda::device::limit_t
. - Support for GitHub workflows
- #267 the NVTX library now depends on
CUDA::nvToolsExt
(which it should). - #268 Now exporting the requirement for the CUDAToolkit dependency.
cuda::runtime_error
can now be constructed also using an r-value string reference, not just a constant l-value reference.- Removed some unnecessary explicit namespace specification in
error.hpp
. - Now using uniform parameter name in allocation functions
- Renamed:
array_t::associated_device()
->array_t::device()
. - #285, #289 Now using the
wrap()
idiom for constructingdevice_t
's - #273: Added device-setter RAII objects to some asynchronous stream method.
- Rework of (global-memory) symbol handling: No more
symbol_t
type; functionality moved fromcuda::memory::
tocuda::symbol::
; and now willing to locate any-type-argument.
Build mechanism
- Avoid always re-determining CUDA architectures by minding the cache.
- Fixed the
CompileWithWarnings.cmake
module to pass the appropriate flags to the appropriate executables (NVCC front-end vs actual compiler, MSVC vd GCC/clang)
Other changes
- Multiple cosmetic changes to avoid MSVC compilation warning, e.g. explicit narrowing casts.
- Example program changes, including utility headers.
- Added a modified version of the CUDA sample program
binaryPartitionCG
. - Some internal changes to wrapper classes with no external interface change.
- NVTX exception
what()
message fix. - #283 : Some wrapper identification string generator functions in
detail_
subnamespaces.
This version is know to work with CUDA versions up to 11.5; pre-11.0 CUDA versions are supported, but not tested routinely.
Version 0.4.3: New features, compatibility changes
Changes since v0.4.2:
New functionality
- Support for working with CUDA symbols.
- Support for asynchronous memory allocation.
- Classes for all memory regions - both managed and regular, both constant and non-constant memory (we used to have some of these only).
Changes to existing functionality
launch_configuration_t
is now constexpr.- Arguably better interface for the partially-existing managed memory region classes.
- Pervasive use of regions as parameters to API functions involving memory: Copying, allocating, modifying attributes etc.
- Renamed:
no_shared_memory
->no_dynamic_shared_memory
.
Other changes
- CMake-based build mechanism changes to rely on CMake 3.17 changes to CUDA support (no effect on the use of the library).
- Replaced the internal
detail
namespaces withdetail_
, forlibcu++
compatibility. - Dropped the
FindCUDAAPIWrappers.cmake
module.
This version is know to work with CUDA versions up to 11.4 (but old CUDA versions are not routinely tested).
0.4.2: Bug fixes, compatibility improvements, range-for over devices
This is a minor release, with mostly bug fixes and compatibility improvements. Other than in its version number, it is identical to 0.4.1, which was retracted due to a version numbering issue.
Changes since 0.4:
- Can now access all devices as a range:
for(auto device : cuda::devices()) { /* etc. etc. */ }
. - Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
- A stream priority range is now its own class.
Bug fixes:
- Dropped invalid stream-priority-related constant.
- The device management test was getting the direction of priority ranges backwards.
- The
p2pBandwidthLatencyTest
example program was failing with cross-device event wait attempts, due to callingwait()
andrecord()
on the wrong stream. - Removed a spurious template specifier in
device.hpp
- Can now construct
cuda::launch_configuration_t
from two integers with C++14 and later.
Build, compatibility, usability:
- CMake 3.18 and later no longer complain about the lack of a
CUDA_ARCHITECTURES
value. - Should now be compatible with MSVC 16.8 on Windows.
Header-only runtime API wrappers, split NVTX wrappers, etc.
Main changes since 0.3.3:
- The runtime API wrappers are now a header-only library.
- Split the NVTX wrappers and the Runtime API wrappers into two separate libraries.
- Added several fundamental types which were implicit in previous versions:
cuda::size_t
,cuda::dimensionality_t
.
Minor API tweaks:
- Renamed
launch
->enqueue_launch
- Can now schedule managed memory region attachment on streams
- Now wrapping
cudaMemAdvise()
advice. - Array copying uses typed pointers
- Added: A
cuda::managed::device_side_pointer_for()
standalone function - Added: A container facade for the sequence of all devices, so you can now write
for (auto device : cuda::devices() ) { }
. - De-templatized: device setter RAII class
- Added: a freestanding
cuda::synchronize()
function instead of some wrapper methods - Made some type definitions from inside
device_t
to thedevice::
namespace - Added: A subclass of
memory::region_t
for managed memory - Using
memory::region_t
in more API functions - Dropped
cuda::kernel::maximum_dynamic_shared_memory_per_block()
. - Centralized the definitions of
take_ownership
anddo_not_take_ownership
- Made
stream_t&
parameters intoconst stream_t&
, almost universally.
Bug fixes:
- Cross-device waiting on events
- Error message fixes
- Not assuming the
uintNN_t
types are in the default namespace
Build, compatibility, usability:
- Fix support for CMake 3.8 (
CMakeLists.txt
was using some post-3.8 features) - Clang-related:
- Skipping examples which clang++ doesn't support yet (need
- Only enabling separable compilation and CUDA
- const-cast'ing
const void *
kernel function pointers before reinterpretation - clang wont'tt let it - GNU extension dropped when compiling examples with CUDA (clang dioesn't support ths)
- Fixed
std::max()
call issue
- CMake targets depending on the wrappers should now have a C++11 language standard requirement for compilation
- The wrappers now assert C++11 or later is used, instead of letting you just fail somewhere.
De-templatization, no numeric handles etc.
This release includes both significant additions to the coverage by the wrappers, as well as major changes to the existing wrappers API.
Main changes since 0.2.0:
- Forget about numeric handles! The wrapper classes no longer take numeric handles as parameters, in methods exposed to the user. You'll be dealing with
device_t
's,event_t
's,stream_t
's etc. - notdevice::id_t
,device::stream_t
anddevice::event_t
's. - Wrappers classes no longer templated. That means, on one hand, you don't have to worry about the template argument of "do we assume the wrapper's device is the current one?" ; but on the other hand, every use of the wrapper will set the current device (even if it's already the right one). A lot of code was simplified or even remoed thanks to this change.
device_function_t
is now namedkernel_t
, as only kernels are acceptable by the CUDA Runtime API calls mentioning "device functions". Also,kernel_t
's are now a pair of (kernel, device), as the settings which can be made for a kernel are mostly/entirely device-specific.- The examples
CMakeLists.txt
has been split off from the mainCMakeFiles.txt
and moved into a subdirectory, removing any dependencies it may have. - Kernel launching now uses perfect forwarding of all parameters.
- The library is now almost completely header-only. The single exception to this rule is profiling-related code. If you don't use it - the library is header-only for you.
- Changed my email address in the code...
Main additions since 0.2.0:
- 2D and 3D Array support.
- 2D and 3D texture support.
- A single
set()
andget()
for all memory spaces.
Plus a few bug fixes, and another example program from the CUDA samples.
Changes from 0.3.0:
- Fixed: Self-recursion in one of the memory allocation functions.
- Fixed: Added missing
inline
specifiers to some functions - White space tweaks
Initial versioned release
This repository has not really needed "releases" so far:
- We're gradually wrapping an API, with the underlying API changing occasionally - so breaking changes are made frequently.
- The master branch is always the most stable and rounded-out version of the code one can use.
However, with other code potentially starting to depend on this repository, and with the CMake scripts maturing somewhat (thanks goes to @codecircuit for the latter) - named/versioned releases start to make more sense, if only for referential convenience.
Of course, there's the question of a versioning scheme. If we go with semantic versioning, we're going to be switching major version numbers all the time.
For now, versions will be numbered as follows: A.B.C
or A.B.C-string
.
A
is the major version number. It will increase with major changes to the library's overall functionality relative to the previous major-version. What counts major? If a whole lot of your host-side code has to change for it to work, then the library change is major.b
is the minor version number. It will increase with changes to the library's functionality - including its API; and unlike SemVer - this change is not necessarily an addition. The change may be rather big in terms of code, but not in terms of the fundamental use patterns .C
is a "patch" version number. These changes are for bugfixes and minor tweaks. They often don't affect the API at all - but they might in some small subtle way.
Finally, why 0.2.0? Well, it's somewhat arbitrary; but the extension has had "core" functionality pretty stable for a while now, with quite a few users; so 0.1.0 feel a bit "premature", which this isn't. On the other hand, 1.0.0 is too presumptuous, since:
- We don't have decent feature-test coverage of most of the library (the examples cover a lot though.);
- We don't have full nor effectively-fool support of CUDA 9.x
- We don't have good enough unit test coverage.
So 1.0.0 is a while off; enjoy 0.2.0 for now.