Skip to content

Releases: eyalroz/cuda-api-wrappers

Version 0.4.6: Minor bug fixes, wrap() fully supported

09 Mar 18:41
Compare
Choose a tag to compare

(v0.4.5 was discarded due to an invalid version string; this is essentially the same as v0.4.5 but with the version string fixed.)

Changes since v0.4.4:

API changes

  • #298 : The wrap() methods, which take raw CUDA handles for events, devices, streams etc. and wrap them in, well, the library's wrapper objects (as opposed to otherwise getting/creating wrapper objects directly, with no raw handles) - are now out of the detail_:: namespace and part of the library's proper API.

Bug fixes

  • #300: Was hiding some CUDA 11 stream-related features due to faulty runtime API version check.
  • #299: Now correctly copying stream properites.
  • #296: (Probably) fixed a Win64-to-Win32 cross-build compilation issue with callback function signatures.

Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.

Version 0.4.4: MSVC compilation cleanup

23 Dec 22:53
Compare
Choose a tag to compare

Changes since v0.4.3:

Bug fixes

  • Device-properties-related functions using baked-in data corrected for some compute capabilities.

New functionality

  • #284 Introduced a grid-and-block-dimensions structure, grid::complete_dimensions_t
  • Additional variants of cuda::memory::set() so that you may use either regions or plain pointer.
  • device_t::global_memory_t now has an associated_device() member.
  • #223, #272 Support for CUDA 11.0 stream attributes.
  • Added device_t::supports_block_cooperation().
  • Additional variants of cuda::memory::copy() for convenience.
  • #292: Device-properties-related functions requiring baked-in data now support Ampere GPUs (CC 8.0, 8.6).
  • #293: Some methods of compute_architecture_t are now available only for compute_capability_t, as it is no longer reasonable to rely on microarchitecture-generation-default values (e.g. amount of shared memory per block, number of in-flight threads per multiprocessor etc.)

Changes to existing functionality

  • #280 Events and streams now have "handles" rather than "ids".
  • Partial revamp of the CUDA array wrapper classes (e.g. no templatization).
  • #258 Block "Cooperativity" is now part of the launch configuration, so less launch variants are necessary.
  • #250 Now offering const variants for both regular and mapped memory.
  • #269 Renamed cuda::device::resource_limit_t -> cuda::device::limit_t.
  • Support for GitHub workflows
  • #267 the NVTX library now depends on CUDA::nvToolsExt (which it should).
  • #268 Now exporting the requirement for the CUDAToolkit dependency.
  • cuda::runtime_error can now be constructed also using an r-value string reference, not just a constant l-value reference.
  • Removed some unnecessary explicit namespace specification in error.hpp.
  • Now using uniform parameter name in allocation functions
  • Renamed: array_t::associated_device() -> array_t::device().
  • #285, #289 Now using the wrap() idiom for constructing device_t's
  • #273: Added device-setter RAII objects to some asynchronous stream method.
  • Rework of (global-memory) symbol handling: No more symbol_t type; functionality moved from cuda::memory:: to cuda::symbol::; and now willing to locate any-type-argument.

Build mechanism

  • Avoid always re-determining CUDA architectures by minding the cache.
  • Fixed the CompileWithWarnings.cmake module to pass the appropriate flags to the appropriate executables (NVCC front-end vs actual compiler, MSVC vd GCC/clang)

Other changes

  • Multiple cosmetic changes to avoid MSVC compilation warning, e.g. explicit narrowing casts.
  • Example program changes, including utility headers.
  • Added a modified version of the CUDA sample program binaryPartitionCG.
  • Some internal changes to wrapper classes with no external interface change.
  • NVTX exception what() message fix.
  • #283 : Some wrapper identification string generator functions in detail_ subnamespaces.

This version is know to work with CUDA versions up to 11.5; pre-11.0 CUDA versions are supported, but not tested routinely.

Version 0.4.3: New features, compatibility changes

20 Aug 07:23
Compare
Choose a tag to compare

Changes since v0.4.2:

New functionality

  • Support for working with CUDA symbols.
  • Support for asynchronous memory allocation.
  • Classes for all memory regions - both managed and regular, both constant and non-constant memory (we used to have some of these only).

Changes to existing functionality

  • launch_configuration_t is now constexpr.
  • Arguably better interface for the partially-existing managed memory region classes.
  • Pervasive use of regions as parameters to API functions involving memory: Copying, allocating, modifying attributes etc.
  • Renamed: no_shared_memory -> no_dynamic_shared_memory.

Other changes

  • CMake-based build mechanism changes to rely on CMake 3.17 changes to CUDA support (no effect on the use of the library).
  • Replaced the internal detail namespaces with detail_, for libcu++ compatibility.
  • Dropped the FindCUDAAPIWrappers.cmake module.

This version is know to work with CUDA versions up to 11.4 (but old CUDA versions are not routinely tested).

0.4.2: Bug fixes, compatibility improvements, range-for over devices

24 Feb 13:20
Compare
Choose a tag to compare

This is a minor release, with mostly bug fixes and compatibility improvements. Other than in its version number, it is identical to 0.4.1, which was retracted due to a version numbering issue.

Changes since 0.4:

  • Can now access all devices as a range: for(auto device : cuda::devices()) { /* etc. etc. */ }.
  • Wrapper classes (specifically, events and streams) now have non-owning copy constructors.
  • A stream priority range is now its own class.

Bug fixes:

  • Dropped invalid stream-priority-related constant.
  • The device management test was getting the direction of priority ranges backwards.
  • The p2pBandwidthLatencyTest example program was failing with cross-device event wait attempts, due to calling wait() and record() on the wrong stream.
  • Removed a spurious template specifier in device.hpp
  • Can now construct cuda::launch_configuration_t from two integers with C++14 and later.

Build, compatibility, usability:

  • CMake 3.18 and later no longer complain about the lack of a CUDA_ARCHITECTURES value.
  • Should now be compatible with MSVC 16.8 on Windows.

Header-only runtime API wrappers, split NVTX wrappers, etc.

14 Oct 14:14
Compare
Choose a tag to compare

Main changes since 0.3.3:

  • The runtime API wrappers are now a header-only library.
  • Split the NVTX wrappers and the Runtime API wrappers into two separate libraries.
  • Added several fundamental types which were implicit in previous versions: cuda::size_t, cuda::dimensionality_t.

Minor API tweaks:

  • Renamed launch -> enqueue_launch
  • Can now schedule managed memory region attachment on streams
  • Now wrapping cudaMemAdvise() advice.
  • Array copying uses typed pointers
  • Added: A cuda::managed::device_side_pointer_for() standalone function
  • Added: A container facade for the sequence of all devices, so you can now write for (auto device : cuda::devices() ) { }.
  • De-templatized: device setter RAII class
  • Added: a freestanding cuda::synchronize() function instead of some wrapper methods
  • Made some type definitions from inside device_t to the device:: namespace
  • Added: A subclass of memory::region_t for managed memory
  • Using memory::region_t in more API functions
  • Dropped cuda::kernel::maximum_dynamic_shared_memory_per_block().
  • Centralized the definitions of take_ownership and do_not_take_ownership
  • Made stream_t& parameters into const stream_t&, almost universally.

Bug fixes:

  • Cross-device waiting on events
  • Error message fixes
  • Not assuming the uintNN_t types are in the default namespace

Build, compatibility, usability:

  • Fix support for CMake 3.8 (CMakeLists.txt was using some post-3.8 features)
  • Clang-related:
    • Skipping examples which clang++ doesn't support yet (need
    • Only enabling separable compilation and CUDA
    • const-cast'ing const void * kernel function pointers before reinterpretation - clang wont'tt let it
    • GNU extension dropped when compiling examples with CUDA (clang dioesn't support ths)
    • Fixed std::max() call issue
  • CMake targets depending on the wrappers should now have a C++11 language standard requirement for compilation
  • The wrappers now assert C++11 or later is used, instead of letting you just fail somewhere.

De-templatization, no numeric handles etc.

20 Jul 19:49
Compare
Choose a tag to compare

This release includes both significant additions to the coverage by the wrappers, as well as major changes to the existing wrappers API.

Main changes since 0.2.0:

  • Forget about numeric handles! The wrapper classes no longer take numeric handles as parameters, in methods exposed to the user. You'll be dealing with device_t's, event_t's, stream_t's etc. - not device::id_t, device::stream_t and device::event_t's.
  • Wrappers classes no longer templated. That means, on one hand, you don't have to worry about the template argument of "do we assume the wrapper's device is the current one?" ; but on the other hand, every use of the wrapper will set the current device (even if it's already the right one). A lot of code was simplified or even remoed thanks to this change.
  • device_function_t is now named kernel_t, as only kernels are acceptable by the CUDA Runtime API calls mentioning "device functions". Also, kernel_t's are now a pair of (kernel, device), as the settings which can be made for a kernel are mostly/entirely device-specific.
  • The examples CMakeLists.txt has been split off from the main CMakeFiles.txt and moved into a subdirectory, removing any dependencies it may have.
  • Kernel launching now uses perfect forwarding of all parameters.
  • The library is now almost completely header-only. The single exception to this rule is profiling-related code. If you don't use it - the library is header-only for you.
  • Changed my email address in the code...

Main additions since 0.2.0:

  • 2D and 3D Array support.
  • 2D and 3D texture support.
  • A single set() and get() for all memory spaces.

Plus a few bug fixes, and another example program from the CUDA samples.

Changes from 0.3.0:

  • Fixed: Self-recursion in one of the memory allocation functions.
  • Fixed: Added missing inline specifiers to some functions
  • White space tweaks

Initial versioned release

20 Jan 15:57
f983473
Compare
Choose a tag to compare

This repository has not really needed "releases" so far:

  • We're gradually wrapping an API, with the underlying API changing occasionally - so breaking changes are made frequently.
  • The master branch is always the most stable and rounded-out version of the code one can use.

However, with other code potentially starting to depend on this repository, and with the CMake scripts maturing somewhat (thanks goes to @codecircuit for the latter) - named/versioned releases start to make more sense, if only for referential convenience.

Of course, there's the question of a versioning scheme. If we go with semantic versioning, we're going to be switching major version numbers all the time.

For now, versions will be numbered as follows: A.B.C or A.B.C-string.

  • A is the major version number. It will increase with major changes to the library's overall functionality relative to the previous major-version. What counts major? If a whole lot of your host-side code has to change for it to work, then the library change is major.
  • b is the minor version number. It will increase with changes to the library's functionality - including its API; and unlike SemVer - this change is not necessarily an addition. The change may be rather big in terms of code, but not in terms of the fundamental use patterns .
  • C is a "patch" version number. These changes are for bugfixes and minor tweaks. They often don't affect the API at all - but they might in some small subtle way.

Finally, why 0.2.0? Well, it's somewhat arbitrary; but the extension has had "core" functionality pretty stable for a while now, with quite a few users; so 0.1.0 feel a bit "premature", which this isn't. On the other hand, 1.0.0 is too presumptuous, since:

  • We don't have decent feature-test coverage of most of the library (the examples cover a lot though.);
  • We don't have full nor effectively-fool support of CUDA 9.x
  • We don't have good enough unit test coverage.

So 1.0.0 is a while off; enjoy 0.2.0 for now.