Releases: eyalroz/cuda-api-wrappers
Version 0.6.1: Minor bug fixes
Version 0.6: PTX compilation library support
Changes since v0.5.6:
PTX Compilation library
This version introduces a single major change:
- #385 : Support for NVIDIA's PTX compilation library.
Note: The CUDA driver already supports compilation of PTX code, but it has limited supported for various compilation options; plus - it requires a driver to be loaded, i.e. requires kernel involvement and a GPU on your system. This library does not.
Value-vs-reference issues
- #430 : Now passing kernel-like objects by reference rather than by value where relevant in the kernel launch wrapper functions.
- #433 : Now passing program name by value rather than by reference.
Other changes
- #431 : The NVTX wrappers no longer depend on a thread support library
- #436 : The wrapper library now respects
CUDA_NO_HALF
, when you want to avoid CUDA defining thehalf
- #432 : Removed some
std::
rather than::std::
namespace qualifications which had snuck into the codebase recently (which cause trouble with NVIDIA'scuda::std
namespace). - #435 : Updated static data tables for the Ampere/Lovelace (8.x) and Hopper architectures.
Version 0.5.6: Compatibility and partial-inclusion fixes
Changes since v0.5.5:
New functionality
- #423: Add an implementation of the surface and texture reference getters for modules (getting raw references, not corresponding wrapper classes for these objects, which this library does not currently offer)
C++14-and-later compatibility fixes
- #415: Resolved incompatibility of
std::optional
/std::experimental::optional
with the internalpoor_mans_optional
- #416: corrected placement of inclusion of
std::experimental::optional
Other changes
- #428, #429 : Minor fixes and tweaks to CUDA array code (via the
cuda::array_t
class template) - #427, #406 : Stream and Event wrapper class instances are now non-copyable (you need to either move them or pass references/pointers to them)
- #425, #426: Error and exception handling improvements (with a slight performance benefit)
- #424 : Link options now passed by const-reference, not by value
- #411: Add
::
prefix to occurrences ofstd::
(which snuck in again in recent versions; these potentially clashe with NVIDIA's standard library constructs) - #413: Added missing intra-library
#include
directives which were masked when including all APIs, but not when including individual headers. Also, removed inappropriateinline
decorators from declaration-only lines - #420: Internal renaming
- #417, #417: Internal placement of functionality in header files (files in
cuda/api/
vs incuda/api/multi_wrapper_impls
). - #412:
bandwidthtest
now includes<iostream>
on its own - #409: Moved
pci_id_impl.hpp
into thedetail/
subfolder (and renamed it)
Version 0.5.5: Minor changes
Changes since v0.5.4:
Run-time compilation functionality
- #397 : The NVRTC compilation options class now supports passing extra options to PTXAS, and also supports
--dopt
- #403 : The program builder class can now accept named header additions using
std::string
's for the name and/or header source (rather than only C-styleconst char*
strings).
Bug fixes
- #396 :
scoped_existence_ensurer_t
, the gadget for ensuring there is some current context (regardless of which) will now make sure the driver has been initialized. - #395 : Can now start profiling with our nvtx component even if the driver not yet being initialized.
Other changes
- #400 : Added an alias for waiting/synchronizing on an event: You can now execute
cuda::wait(my_event)
, not justcuda::synchronize(my_event)
. - #399 :
time_elapsed_between()
can now acceptstd::pair
's of events. - #398 : Added another example program, the CUDA sample
bandwidthtest
- #401 : Made all stream enqueuing methods
const
(so you can now enqueue on a stream passed by const-reference). - #404 : Can now construct
grid::overall_dimensions_t
from adim3
object, so that they're more interoperable with CUDA-related values you obtained elsewhere.
v0.5.4: Minor build issue fixes
v0.5.3: Asynch memory ops, NVRTC compilation improvements
Changes since v0.5.2:
Runtime program compilation (NVRTC) improvements
- #379: Can get the compilation log, PTX, cubin or NVVM in a user-provided rather than self-allocated buffer
- #388: A builder interface for NVRTC programs
- #386: Add support for nvrtcGetSupportedArchs()
- #375: Support adding arbitrary options when dynamically compiling a CUDA program
- #265: Support for diag-suppress/error/warn compilation options
Runtime-compilation-related Bug fixes
- #391: Fix for a CUDA 10.0 support regression
- #384: Make nvrtc depend on runtime-and-driver
- #376: When rendering compilation options to a string, we get an extra space
- #378: Compilation log vector contains trailing '\0'
- #387:
nvrtc.h
included in wrong file
Other changes
- #390: Avoiding a memory leak when getting a CUDA device's name
- #248: Support asynchronous memory allocation (in v0.5.2 we only had allocation, no freeing)
Caveats
Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.
Version 0.5.2: Windows compatibility, less redundant API calls
Changes since v0.5.1:
Full MS Windows support is restored in this version (AFAICT). Also worked out some kinks and polished a few interfaces.
Bug fixes
- #330, #369, #372 Corrected some
launch_config_builder
logic bugs. - #368 Fixed an accidental primary context deactivation in
p2pBandwidthLatencyTest
- #360 Was missing an implementation of
context_t::create_event()
- #357 All assignment operators updated to appropriatlyhandle primary context reference unit propagation
- #351 Fixed a typo in Windows-target-only code
- #335 Redundant
0x
in error messages - #329
marshalled_options.hpp
errors with C++17 - #324
marshalled_options.hpp
needscuda::span
, but doesn't see it - #325
nvrtc/compilation_options.hpp
needs to know aboutdevice_t
Windows compatibility
- #345 Avoid non-portable assumptions regarding thread handles in
vectorAdd_profiled
- #344 Workaround for an MSVC SFINAE error with
std::iterator_traits<Iter>
- #343
std::experimental::filesystem
not properly supported on Windows - #342 Don't try to use
mkstemp
on Windows - #341 Avoid
size_t
<->unsigned
overload clash on Windows - #340 Apply the CUDA_CB decoration to shared memory size-determiner function - it's actually necessary on Windows
- #339 Avoid some MSVC compiler warnings
- #338 Added missing inclusions to have Windows NT
HANDLE
defined - #337 Support for MSVC's standard-incompliant
__cplusplus
value - #347 Using
::std::
rather thanstd::
, to avoid clashes with NVIDIA's libcustd - that is included by default by CUDA 11.7's nvcc.
Interface tweaks
RTC compilation options
- #364
marshal()
andrender()
are now stand-alone functions. - #363 Can now render compilation options to an
::std::string
(in case you want to save/print them) - #362 Add a
clear_language_dialect()
tortc::compilation_options_t
- #361 If an
rtc::compilation_options_t
is asked to set the language dialect to an empty or null string - unset it instead - #355 Support taking the C++ language dialect as an
::std::string
, not just a C-style string.
Other classes
- #365
module::get_kernel()
can now take an::std::string
- #359 Now exposing the interface for enqueuing kernels with type-erased arguments, passed via an array of
void*
(so far, you could only enqueue when you passed the parameter types_. - #356 (Almost) all proxy classes are now move-assignable and move-cosntructible, but not copy-assignable or copy-constructible. Move them or use cosnt-ref's.
- #358
link_t
should have adevice_id()
Miscellaneous and internal issues
- #367 Avoiding a redundant scoped context setting when enqueuing a kernel
- #366 Spruced up
CUDA_DEVICE_OR_THIS_SCOPE()
andCUDA_CONTEXT_FOR_THIS_SCOPE()
- #353 Added missing PCI function initializer to the PCI location wrappers class.
- #352 Simplified the options marshalling code
- #349 Prefix CMake options with
CAW_
, for use as a subproject (e.g. FetchContent) - #346 Fix CUDA installation in GitHub action scripts
- #326 Drop redundant inclusions and make include order more "challenging" in
vectorAdd
examples - #328 Reduce gratuitous API calls in
current_device::detail::set()
- #331 Can now load a module from file into any context, not just the current context
- #334 Reduce the number of redundant informative API calls enhancement resolved-on-development
- #333 Don't treat freeing in a destroyed context as an error
- #303 Use
CUDA_VERSION
instead ofCUDART_VERSION
- #370
cuda::context::current::exists()
now return false, rather than throwing, if the CUDA driver has not been initialized - #373 In Debug builds, now validating launch configuration grid dimensions before enqueueing/launching anything (as CUDA tends to fail silently, e.g. for emtpy grids)
Caveats
Continuous build testing on Windows is failing on GitHub Actions due to trouble with CMake detecting the NVTX path. Assistance from users on this matter would be appreciated.
Version 0.5.1: Fully header-only, launch config builder
Changes since v0.5.0:
Build mechanism
- #307 The library is now entirely header-only (the NVTX wrappers, which used to be compiled, are now all within headers).
New supported features
- #308 Supporting both narrow/regular and wide character inputs for NVRTC compilation.
- #309 Support for naming streams, devices and events with NVTX
Concepts/facilities introduced
- #311 A Builder-pattern class for building launch configurations more easily.
Compatibility
- #304 : Now compatible will all CUDA versions between 9.0 and 11.6
Bug fixes
- #320 No longer getting an error message about
module::create()
when including onlyruntime_api.hpp
. - #317 No longer "leaking" references to device primary contexts which made them never be destroyed after some point. Fixing this exposed a few other latent issues involving non-existence of primary contexts: #316.
- #314 No longer failing to enqueue events when there is no current context.
- #305, #306 :
- Added missing named errors to
cuda::status
- Now using driver error codes wherever applicable (they only started to coincide with Runtime API error codes in a recent CUDA version)
- Renamed mis-named error:
cuda::status::not_ready
->cuda::status::async_operations_not_yet_completed
.
- Added missing named errors to
- #315 In one of the example programs, we were launching a kernel on the current device rather than the one the user had chosen.
Miscellaneous and internal issues
- #310 NVTX wrapper now uses driver-API-style
- #303 Using
CUDA_VERSION
instead ofCUDA_RT_VERSION
where relevant. - #320 Added an example program only explicitly including runtime-API-related headers.
- #321 Weakened requirement from kernel parameter types from TriviallyCopyable to just being trivially copy-constructible.
Caveats
Windows support is partially broken in this version.
v0.5.0: Rewrite, Driver+Runtime API+NVRTC coverage
This is a near-complete under-the-hood rewrite of the API wrappers library, while maintaining its existing API almost entirely: The library now primarily relies on CUDA Driver API calls, with Runtime API calls used only where the driver does not straightforwardly provide the same functionality.
If you are only interested in the Runtime API, you may which to use the latest 0.4.x release. At the moment, that is 0.4.7.
Fundamental feature set additions
Wrapper classes introduced
- Contexts:
context_t
. - Dynamically vs. statically compiled kernels:
kernel_t
andapriori_compiled_kernel_t
- Device primary contexts:
device::primary_context_t
link_t
: Linking together compiled code to satisfy symbol definition requirements and complete executables.link_options_t
defining options for linking.- Virtual memory:
physical_allocation_t
,address_range_reservation_t
andmapping_t
between pairs of the former. - Modules:
module_t
, made up of compiled binary/PTX code - functions, global symbols etc - which may be loaded into contexts
and via NVRTC support:
- Programs:
rtc::program_t
, made up of CUDA or PTX source code:program_t
. - Compilation options,
rtc::compilation_options_t
defining options for compiling programs.
(All of the classes above are under the cuda::
namespace)
Concepts/facilities introduced
- Treatment of the primary context as a context and its creation or destruction
- The context stack
- The current context
- Waiting on a the value of a scalar in global device memory
- Access by specific contexts to specific contexts of peer devices
Caveats
Windows support is partially broken in this version.
Version 0.4.7: Minor changes
This version has very few changes to relative to 0.4.6. These are:
Bug fixes
- #301 : Now ensuring launch configurations can be assigned to each other.
Note: Users's help is kindly requested in preparing for the next major release, which will cover both the runtime and the driver API, and NVRTC as well. See this branch and contact me / open relevant issues.