Skip to content

Releases: eyalroz/cuda-api-wrappers

Version 0.6.1: Minor bug fixes

07 Dec 22:50
Compare
Choose a tag to compare

Changes since v0.6:

Bug fixes

  • #442 Changed a no-longer-valid use of link::input_type_t in link.hpp which was triggering an error when building with C++17.
  • #438 Corrected the make_cuda_host_alloc_flags() function, which was bitwise-AND-ing instead of bitwise-OR-ing.

Other changes

  • #441 kernel_t::context() now uses wrap() and is noexcept
  • #436 , #437 Now respecting the CUDA_NO_HALF preprocessor define, and not defining nor including including and half-precision-related code with it defined.

Version 0.6: PTX compilation library support

08 Oct 17:15
Compare
Choose a tag to compare

Changes since v0.5.6:

PTX Compilation library

This version introduces a single major change:

Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.

Value-vs-reference issues

  • #430 : Now passing kernel-like objects by reference rather than by value where relevant in the kernel launch wrapper functions.
  • #433 : Now passing program name by value rather than by reference.

Other changes

  • #431 : The NVTX wrappers no longer depend on a thread support library
  • #436 : The wrapper library now respects CUDA_NO_HALF, when you want to avoid CUDA defining the half
  • #432 : Removed some std:: rather than ::std:: namespace qualifications which had snuck into the codebase recently (which cause trouble with NVIDIA's cuda::std namespace).
  • #435 : Updated static data tables for the Ampere/Lovelace (8.x) and Hopper architectures.

Version 0.5.6: Compatibility and partial-inclusion fixes

08 Oct 17:01
Compare
Choose a tag to compare

Changes since v0.5.5:

New functionality

  • #423: Add an implementation of the surface and texture reference getters for modules (getting raw references, not corresponding wrapper classes for these objects, which this library does not currently offer)

C++14-and-later compatibility fixes

  • #415: Resolved incompatibility of std::optional/std::experimental::optional with the internal poor_mans_optional
  • #416: corrected placement of inclusion of std::experimental::optional

Other changes

  • #428, #429 : Minor fixes and tweaks to CUDA array code (via the cuda::array_t class template)
  • #427, #406 : Stream and Event wrapper class instances are now non-copyable (you need to either move them or pass references/pointers to them)
  • #425, #426: Error and exception handling improvements (with a slight performance benefit)
  • #424 : Link options now passed by const-reference, not by value
  • #411: Add :: prefix to occurrences of std:: (which snuck in again in recent versions; these potentially clashe with NVIDIA's standard library constructs)
  • #413: Added missing intra-library #include directives which were masked when including all APIs, but not when including individual headers. Also, removed inappropriate inline decorators from declaration-only lines
  • #420: Internal renaming
  • #417, #417: Internal placement of functionality in header files (files in cuda/api/ vs in cuda/api/multi_wrapper_impls).
  • #412: bandwidthtest now includes <iostream> on its own
  • #409: Moved pci_id_impl.hpp into the detail/ subfolder (and renamed it)

Version 0.5.5: Minor changes

10 Sep 18:43
Compare
Choose a tag to compare

Changes since v0.5.4:

Run-time compilation functionality

  • #397 : The NVRTC compilation options class now supports passing extra options to PTXAS, and also supports --dopt
  • #403 : The program builder class can now accept named header additions using std::string's for the name and/or header source (rather than only C-style const char* strings).

Bug fixes

  • #396 : scoped_existence_ensurer_t, the gadget for ensuring there is some current context (regardless of which) will now make sure the driver has been initialized.
  • #395 : Can now start profiling with our nvtx component even if the driver not yet being initialized.

Other changes

  • #400 : Added an alias for waiting/synchronizing on an event: You can now execute cuda::wait(my_event), not just cuda::synchronize(my_event).
  • #399 : time_elapsed_between() can now accept std::pair's of events.
  • #398 : Added another example program, the CUDA sample bandwidthtest
  • #401 : Made all stream enqueuing methods const (so you can now enqueue on a stream passed by const-reference).
  • #404 : Can now construct grid::overall_dimensions_t from a dim3 object, so that they're more interoperable with CUDA-related values you obtained elsewhere.

v0.5.4: Minor build issue fixes

19 Aug 19:06
Compare
Choose a tag to compare

Changes since v0.5.3:

Build-related fixes

  • #392 Made the NVTX and NVRTC wrappers usable in multiple translation units within the same executable
  • #393 Made the NVTX dependency on libdl (on Linux) explicit

Other changes

  • #394 Avoiding redundant cuInit() call when getting a device's name

v0.5.3: Asynch memory ops, NVRTC compilation improvements

26 Jul 07:55
Compare
Choose a tag to compare

Changes since v0.5.2:

Runtime program compilation (NVRTC) improvements

  • #379: Can get the compilation log, PTX, cubin or NVVM in a user-provided rather than self-allocated buffer
  • #388: A builder interface for NVRTC programs
  • #386: Add support for nvrtcGetSupportedArchs()
  • #375: Support adding arbitrary options when dynamically compiling a CUDA program
  • #265: Support for diag-suppress/error/warn compilation options

Runtime-compilation-related Bug fixes

  • #391: Fix for a CUDA 10.0 support regression
  • #384: Make nvrtc depend on runtime-and-driver
  • #376: When rendering compilation options to a string, we get an extra space
  • #378: Compilation log vector contains trailing '\0'
  • #387: nvrtc.h included in wrong file

Other changes

  • #390: Avoiding a memory leak when getting a CUDA device's name
  • #248: Support asynchronous memory allocation (in v0.5.2 we only had allocation, no freeing)

Caveats

Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

Version 0.5.2: Windows compatibility, less redundant API calls

18 Jun 21:08
Compare
Choose a tag to compare

Changes since v0.5.1:

Full MS Windows support is restored in this version (AFAICT). Also worked out some kinks and polished a few interfaces.

Bug fixes

  • #330, #369, #372 Corrected some launch_config_builder logic bugs.
  • #368 Fixed an accidental primary context deactivation in p2pBandwidthLatencyTest
  • #360 Was missing an implementation of context_t::create_event()
  • #357 All assignment operators updated to appropriatlyhandle primary context reference unit propagation
  • #351 Fixed a typo in Windows-target-only code
  • #335 Redundant 0x in error messages
  • #329 marshalled_options.hpp errors with C++17
  • #324 marshalled_options.hpp needs cuda::span, but doesn't see it
  • #325 nvrtc/compilation_options.hpp needs to know about device_t

Windows compatibility

  • #345 Avoid non-portable assumptions regarding thread handles in vectorAdd_profiled
  • #344 Workaround for an MSVC SFINAE error with std::iterator_traits<Iter>
  • #343 std::experimental::filesystem not properly supported on Windows
  • #342 Don't try to use mkstemp on Windows
  • #341 Avoid size_t <-> unsigned overload clash on Windows
  • #340 Apply the CUDA_CB decoration to shared memory size-determiner function - it's actually necessary on Windows
  • #339 Avoid some MSVC compiler warnings
  • #338 Added missing inclusions to have Windows NT HANDLE defined
  • #337 Support for MSVC's standard-incompliant __cplusplus value
  • #347 Using ::std:: rather than std::, to avoid clashes with NVIDIA's libcustd - that is included by default by CUDA 11.7's nvcc.

Interface tweaks

RTC compilation options

  • #364 marshal() and render() are now stand-alone functions.
  • #363 Can now render compilation options to an ::std::string (in case you want to save/print them)
  • #362 Add a clear_language_dialect() to rtc::compilation_options_t
  • #361 If an rtc::compilation_options_t is asked to set the language dialect to an empty or null string - unset it instead
  • #355 Support taking the C++ language dialect as an ::std::string, not just a C-style string.

Other classes

  • #365 module::get_kernel() can now take an ::std::string
  • #359 Now exposing the interface for enqueuing kernels with type-erased arguments, passed via an array of void* (so far, you could only enqueue when you passed the parameter types_.
  • #356 (Almost) all proxy classes are now move-assignable and move-cosntructible, but not copy-assignable or copy-constructible. Move them or use cosnt-ref's.
  • #358 link_t should have a device_id()

Miscellaneous and internal issues

  • #367 Avoiding a redundant scoped context setting when enqueuing a kernel
  • #366 Spruced up CUDA_DEVICE_OR_THIS_SCOPE() and CUDA_CONTEXT_FOR_THIS_SCOPE()
  • #353 Added missing PCI function initializer to the PCI location wrappers class.
  • #352 Simplified the options marshalling code
  • #349 Prefix CMake options with CAW_, for use as a subproject (e.g. FetchContent)
  • #346 Fix CUDA installation in GitHub action scripts
  • #326 Drop redundant inclusions and make include order more "challenging" in vectorAdd examples
  • #328 Reduce gratuitous API calls in current_device::detail::set()
  • #331 Can now load a module from file into any context, not just the current context
  • #334 Reduce the number of redundant informative API calls enhancement resolved-on-development
  • #333 Don't treat freeing in a destroyed context as an error
  • #303 Use CUDA_VERSION instead of CUDART_VERSION
  • #370 cuda::context::current::exists() now return false, rather than throwing, if the CUDA driver has not been initialized
  • #373 In Debug builds, now validating launch configuration grid dimensions before enqueueing/launching anything (as CUDA tends to fail silently, e.g. for emtpy grids)

Caveats

Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.

Version 0.5.1: Fully header-only, launch config builder

09 May 21:35
Compare
Choose a tag to compare

Changes since v0.5.0:

Build mechanism

  • #307 The library is now entirely header-only (the NVTX wrappers, which used to be compiled, are now all within headers).

New supported features

  • #308 Supporting both narrow/regular and wide character inputs for NVRTC compilation.
  • #309 Support for naming streams, devices and events with NVTX

Concepts/facilities introduced

Compatibility

  • #304 : Now compatible will all CUDA versions between 9.0 and 11.6

Bug fixes

  • #320 No longer getting an error message about module::create() when including only runtime_api.hpp.
  • #317 No longer "leaking" references to device primary contexts which made them never be destroyed after some point. Fixing this exposed a few other latent issues involving non-existence of primary contexts: #316.
  • #314 No longer failing to enqueue events when there is no current context.
  • #305, #306 :
    • Added missing named errors to cuda::status
    • Now using driver error codes wherever applicable (they only started to coincide with Runtime API error codes in a recent CUDA version)
    • Renamed mis-named error: cuda::status::not_ready -> cuda::status::async_operations_not_yet_completed.
  • #315 In one of the example programs, we were launching a kernel on the current device rather than the one the user had chosen.

Miscellaneous and internal issues

  • #310 NVTX wrapper now uses driver-API-style
  • #303 Using CUDA_VERSION instead of CUDA_RT_VERSION where relevant.
  • #320 Added an example program only explicitly including runtime-API-related headers.
  • #321 Weakened requirement from kernel parameter types from TriviallyCopyable to just being trivially copy-constructible.

Caveats

Windows support is partially broken in this version.

v0.5.0: Rewrite, Driver+Runtime API+NVRTC coverage

19 Feb 18:01
Compare
Choose a tag to compare

This is a near-complete under-the-hood rewrite of the API wrappers library, while maintaining its existing API almost entirely: The library now primarily relies on CUDA Driver API calls, with Runtime API calls used only where the driver does not straightforwardly provide the same functionality.

If you are only interested in the Runtime API, you may which to use the latest 0.4.x release. At the moment, that is 0.4.7.

Fundamental feature set additions

  • #9 Driver API support
  • #228, #262 : NVRTC support

Wrapper classes introduced

  • Contexts: context_t.
  • Dynamically vs. statically compiled kernels: kernel_t and apriori_compiled_kernel_t
  • Device primary contexts: device::primary_context_t
  • link_t: Linking together compiled code to satisfy symbol definition requirements and complete executables.
  • link_options_t defining options for linking.
  • Virtual memory: physical_allocation_t, address_range_reservation_t and mapping_t between pairs of the former.
  • Modules: module_t, made up of compiled binary/PTX code - functions, global symbols etc - which may be loaded into contexts

and via NVRTC support:

  • Programs: rtc::program_t, made up of CUDA or PTX source code: program_t.
  • Compilation options, rtc::compilation_options_t defining options for compiling programs.

(All of the classes above are under the cuda:: namespace)

Concepts/facilities introduced

  • Treatment of the primary context as a context and its creation or destruction
  • The context stack
  • The current context
  • Waiting on a the value of a scalar in global device memory
  • Access by specific contexts to specific contexts of peer devices

Caveats

Windows support is partially broken in this version.

Version 0.4.7: Minor changes

12 Mar 10:34
Compare
Choose a tag to compare

This version has very few changes to relative to 0.4.6. These are:

Bug fixes

  • #301 : Now ensuring launch configurations can be assigned to each other.

Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.