From 2c9e5e7abdc5b0b813344580961d8222a716c83f Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Thu, 11 Feb 2021 10:21:58 -0600 Subject: [PATCH 01/12] More stuff. --- docs/mr_proposal.md | 47 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 docs/mr_proposal.md diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md new file mode 100644 index 0000000000..0dc51292b1 --- /dev/null +++ b/docs/mr_proposal.md @@ -0,0 +1,47 @@ +# CUDA `` Extension Proposal + +## Motivation + +Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. +CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`. +As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. +However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. +Interoperation among these applications is difficult without an interface to enable sharing a common allocator. + +In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface. +C++17 introduced [``](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. +This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. + + + + +However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model. +For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory. +Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)[1](#link-footnote) to include memory allocation as stream-ordered events. +Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++. + + + + + + +[1]: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface. +There will never be one allocator that satisfies all users. +Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. + +## Description + +We propose extending `` to provide a common memory allocation interface that meets the needs of CUDA C++ programmers. + +We chose `` as the basis for a CUDA-specific allocator interface for several reasons: + +- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. + +- The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. + +- Likewise, [Thrust](https://github.com/NVIDIA/thrust) has had similar success with its `thrust::mr::memory_resource` interface. + +Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `` is the logical choice. + + + From d4b41a92d857ba0e3a9774f8c9ecbea3b317833c Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Thu, 11 Feb 2021 12:35:28 -0600 Subject: [PATCH 02/12] Description. --- docs/mr_proposal.md | 137 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 136 insertions(+), 1 deletion(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 0dc51292b1..474d7b7202 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -35,7 +35,7 @@ We propose extending `` to provide a common memory allocation i We chose `` as the basis for a CUDA-specific allocator interface for several reasons: -- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. +- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4). - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. @@ -43,5 +43,140 @@ We chose `` as the basis for a CUDA-specific allocator interfac Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `` is the logical choice. +This proposal includes the addition of the following to libcu++: + +### `memory_kind` + +A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory. +This is intended to be similar to the existing `thread_scope` enum. + +```c++ +enum class memory_kind { + device, ///< Device memory accessible only from device + unified, ///< Unified memory accessible from both host and device + pinned, ///< Page-locked system memory accessible from both host and device + host ///< System memory only accessible from host code +}; +``` + +### `stream_view` + +A strongly typed, non-owning, view-type for `cudaStream_t`. +This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream. + +### `cuda::memory_resource` + +The `cuda::memory_resource` class template is the abstract base class interface akin to `std::pmr::memory_resource` with two main differences: + +1. The `Kind` template parameter determines the `memory_kind` allocated by the resource. + +2. The `Context` template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization. +By default, the `Context` is the `any_context` tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization. + +```c++ +/** + * @brief Tag type for the default context of `memory_resource`. + * + * Default context in which storage may be used immediately on any thread or any + * CUDA stream without synchronization. + */ +struct any_context{}; + +template +class memory_resource{ +public: + void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); } + void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); } + Context get_context(){ return do_get_context(); } +private: + virtual void* do_allocate(size_t n, size_t alignment) = 0; + virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0; + virtual void do_get_context() = 0; +}; +``` + +The purpose of the `Context` template parameter is to allow for more generic allocation semantics. +For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction: + +```c++ +struct stream_context{ + cuda::stream_view s; +}; + +template +class stream_bound_memory_resource : public cuda::memory_resource{ +public: + stream_bound_memory_resource(cuda::stream_view s) : s_{s} {} +private: + void* do_allocate(size_t n, size_t alignment) override { // always allocate on `s` } + void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` } + stream_context do_get_context(){ return s_; } + stream_context s_; +}; +``` + +### `cuda::pmr_adaptor` + +`cuda::memory_resource` is similar to `std::pmr::memory_resource`, but they do not share a common inheritance hierarchy, therefore an object that derives from `cuda::memory_resource` cannot be used polymorphically as a `std::pmr::memory_resource`, i.e., a `cuda::memory_resource` derived type cannot be passed to a function that expects a `std::pmr::memory_resource` pointer or reference. +However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type. +The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. + + + + +### `cuda::stream_ordered_memory_resource` + +The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation. +This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). + +```c++ +template +class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */> +{ +public: + static constexpr size_t default_alignment = alignof(max_align_t); + // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream + void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); } + void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); } + void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); } + void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); } + private: + virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0; + virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0; +}; +``` + +### Concrete Resource Implementations: + +Just as `` provides concrete, derived implementations of `std::pmr::memory_resource`, libcu++ will provide the following: + +- `cuda::new_delete_resource : public cuda::memory_resource` + - Uses `::operator new()`/`::operator delete()` for allocating host memory +- `cuda::cuda_resource : public cuda::memory_resource` + - Uses `cudaMalloc/cudaFree` for allocating device memory +- `cuda::unified_resource : public cuda::memory_resource` + - Uses `cudaMallocManaged/cudaFree` for unified memory +- `cuda::pinned_resource : public cuda::memory_resource` + - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory +- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource` + - Uses `cudaMallocAsync/cudaFreeAsync` for device memory + +Other resource implementations may be added as deemed appropriate. + +### `cuda::polymorphic_allocator` + +TBD + +### `cuda::stream_ordered_allocator` + +TBD + + + + + + + + From f0b6af954375e0a3630cac65ad58b36ea42124b6 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Thu, 11 Feb 2021 10:21:58 -0600 Subject: [PATCH 03/12] More stuff. --- docs/mr_proposal.md | 47 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 docs/mr_proposal.md diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md new file mode 100644 index 0000000000..0dc51292b1 --- /dev/null +++ b/docs/mr_proposal.md @@ -0,0 +1,47 @@ +# CUDA `` Extension Proposal + +## Motivation + +Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. +CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`. +As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. +However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. +Interoperation among these applications is difficult without an interface to enable sharing a common allocator. + +In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface. +C++17 introduced [``](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. +This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. + + + + +However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model. +For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory. +Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)[1](#link-footnote) to include memory allocation as stream-ordered events. +Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++. + + + + + + +[1]: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface. +There will never be one allocator that satisfies all users. +Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. + +## Description + +We propose extending `` to provide a common memory allocation interface that meets the needs of CUDA C++ programmers. + +We chose `` as the basis for a CUDA-specific allocator interface for several reasons: + +- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. + +- The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. + +- Likewise, [Thrust](https://github.com/NVIDIA/thrust) has had similar success with its `thrust::mr::memory_resource` interface. + +Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `` is the logical choice. + + + From 30d7e96fde3002e0b751a6a605286aca7469864d Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Thu, 11 Feb 2021 12:35:28 -0600 Subject: [PATCH 04/12] Description. --- docs/mr_proposal.md | 137 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 136 insertions(+), 1 deletion(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 0dc51292b1..474d7b7202 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -35,7 +35,7 @@ We propose extending `` to provide a common memory allocation i We chose `` as the basis for a CUDA-specific allocator interface for several reasons: -- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. +- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4). - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. @@ -43,5 +43,140 @@ We chose `` as the basis for a CUDA-specific allocator interfac Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `` is the logical choice. +This proposal includes the addition of the following to libcu++: + +### `memory_kind` + +A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory. +This is intended to be similar to the existing `thread_scope` enum. + +```c++ +enum class memory_kind { + device, ///< Device memory accessible only from device + unified, ///< Unified memory accessible from both host and device + pinned, ///< Page-locked system memory accessible from both host and device + host ///< System memory only accessible from host code +}; +``` + +### `stream_view` + +A strongly typed, non-owning, view-type for `cudaStream_t`. +This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream. + +### `cuda::memory_resource` + +The `cuda::memory_resource` class template is the abstract base class interface akin to `std::pmr::memory_resource` with two main differences: + +1. The `Kind` template parameter determines the `memory_kind` allocated by the resource. + +2. The `Context` template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization. +By default, the `Context` is the `any_context` tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization. + +```c++ +/** + * @brief Tag type for the default context of `memory_resource`. + * + * Default context in which storage may be used immediately on any thread or any + * CUDA stream without synchronization. + */ +struct any_context{}; + +template +class memory_resource{ +public: + void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); } + void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); } + Context get_context(){ return do_get_context(); } +private: + virtual void* do_allocate(size_t n, size_t alignment) = 0; + virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0; + virtual void do_get_context() = 0; +}; +``` + +The purpose of the `Context` template parameter is to allow for more generic allocation semantics. +For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction: + +```c++ +struct stream_context{ + cuda::stream_view s; +}; + +template +class stream_bound_memory_resource : public cuda::memory_resource{ +public: + stream_bound_memory_resource(cuda::stream_view s) : s_{s} {} +private: + void* do_allocate(size_t n, size_t alignment) override { // always allocate on `s` } + void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` } + stream_context do_get_context(){ return s_; } + stream_context s_; +}; +``` + +### `cuda::pmr_adaptor` + +`cuda::memory_resource` is similar to `std::pmr::memory_resource`, but they do not share a common inheritance hierarchy, therefore an object that derives from `cuda::memory_resource` cannot be used polymorphically as a `std::pmr::memory_resource`, i.e., a `cuda::memory_resource` derived type cannot be passed to a function that expects a `std::pmr::memory_resource` pointer or reference. +However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type. +The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. + + + + +### `cuda::stream_ordered_memory_resource` + +The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation. +This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). + +```c++ +template +class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */> +{ +public: + static constexpr size_t default_alignment = alignof(max_align_t); + // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream + void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); } + void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); } + void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); } + void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); } + private: + virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0; + virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0; +}; +``` + +### Concrete Resource Implementations: + +Just as `` provides concrete, derived implementations of `std::pmr::memory_resource`, libcu++ will provide the following: + +- `cuda::new_delete_resource : public cuda::memory_resource` + - Uses `::operator new()`/`::operator delete()` for allocating host memory +- `cuda::cuda_resource : public cuda::memory_resource` + - Uses `cudaMalloc/cudaFree` for allocating device memory +- `cuda::unified_resource : public cuda::memory_resource` + - Uses `cudaMallocManaged/cudaFree` for unified memory +- `cuda::pinned_resource : public cuda::memory_resource` + - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory +- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource` + - Uses `cudaMallocAsync/cudaFreeAsync` for device memory + +Other resource implementations may be added as deemed appropriate. + +### `cuda::polymorphic_allocator` + +TBD + +### `cuda::stream_ordered_allocator` + +TBD + + + + + + + + From 72d024e4827494b78e6527498da44e2261295da7 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Thu, 11 Feb 2021 13:28:40 -0600 Subject: [PATCH 05/12] Add stub section for containers. --- docs/mr_proposal.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 474d7b7202..c943a174eb 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -141,7 +141,7 @@ public: void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); } void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); } private: - virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0; + virtual void* do_allocate(size_t nk, size_t alignment, cuda::stream_view s) = 0; virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0; }; ``` @@ -171,6 +171,11 @@ TBD TBD +### Containers + +TBD. libcu++ will provide memory owning container types that work with `cuda::memory_resource/cuda::stream_ordered_memory`. + + From 62b09acafd188af115e35be08f5f918297e304b3 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:11:26 -0600 Subject: [PATCH 06/12] Apply suggestions from code review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: MichaƂ Zientkiewicz Co-authored-by: Mark Harris --- docs/mr_proposal.md | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index c943a174eb..48570c0d3e 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -2,22 +2,22 @@ ## Motivation -Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. +Performance sensitive applications that make frequent dynamic memory allocations often find allocating memory to be a significant overhead. CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`. As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. Interoperation among these applications is difficult without an interface to enable sharing a common allocator. -In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface. -C++17 introduced [``](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. -This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. +In Standard C++, [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) has traditionally provided this common interface. +C++17 introduced [``](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`. +This polymorphic interface provides a standard way to define, expose and share custom memory allocation. However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model. -For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory. -Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)[1](#link-footnote) to include memory allocation as stream-ordered events. +For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications may access at least four different kinds of dynamically allocated memory. +Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync` and `cudaFreeAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)[1](#link-footnote), which provide stream-ordered memory allocation and deallocation. Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++. @@ -35,7 +35,7 @@ We propose extending `` to provide a common memory allocation i We chose `` as the basis for a CUDA-specific allocator interface for several reasons: -- `` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4). +- `` is the direction taken by Standard C++ for custom, stateful allocators. An allocator interface with a common look and feel will ease working between Standard and CUDA C++. For more information on `` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4). - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. @@ -62,7 +62,7 @@ enum class memory_kind { ### `stream_view` A strongly typed, non-owning, view-type for `cudaStream_t`. -This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream. +This type provides a more typesafe C++ wrapper around `cudaStream_t` and serves as the input argument type for any libcu++ API that takes a CUDA stream. ### `cuda::memory_resource` @@ -121,9 +121,6 @@ private: However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type. The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. - - - ### `cuda::stream_ordered_memory_resource` The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation. @@ -184,4 +181,3 @@ TBD. libcu++ will provide memory owning container types that work with `cuda::me - From ca21851e96e14005440efce5f0d6854778201a30 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:15:29 -0600 Subject: [PATCH 07/12] Add _async. --- docs/mr_proposal.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 48570c0d3e..aa945fb1b1 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -124,7 +124,7 @@ The `cuda::pmr_adaptor` class is intended to provide this functionality by inher ### `cuda::stream_ordered_memory_resource` The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation. -This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). +This is similar to `cuda::memory_resource` but `allocate_async` and `deallocate_async` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). ```c++ template @@ -133,13 +133,13 @@ class stream_ordered_memory_resource : public memory_resource<_Kind /* default c public: static constexpr size_t default_alignment = alignof(max_align_t); // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream - void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); } - void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); } - void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); } - void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); } + void* allocate_async(size_t n, cuda::stream_view s){ return do_allocate_async(n, default_alignment, s); } + void* allocate_async(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate_async(n, alignment, s); } + void deallocate_async(void* p, size_t n, cuda::stream_view s){ return do_deallocate_async(p, n, default_alignment, s); } + void deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate_async(p, n, alignment, s); } private: - virtual void* do_allocate(size_t nk, size_t alignment, cuda::stream_view s) = 0; - virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0; + virtual void* do_allocate_async(size_t nk, size_t alignment, cuda::stream_view s) = 0; + virtual void do_deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0; }; ``` From 125975d97fdbee4ddf8c5065d81af427273ff6aa Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:15:38 -0600 Subject: [PATCH 08/12] Fix typo. --- docs/mr_proposal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index aa945fb1b1..81d817a311 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -155,7 +155,7 @@ Just as `` provides concrete, derived implementations of `std:: - Uses `cudaMallocManaged/cudaFree` for unified memory - `cuda::pinned_resource : public cuda::memory_resource` - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory -- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource` +- `cuda::cuda_async_resource : public cuda::stream_ordered_memory_resource` - Uses `cudaMallocAsync/cudaFreeAsync` for device memory Other resource implementations may be added as deemed appropriate. From ffd93854a8a8cc2ce05ccc0f35494f0fd45e74f9 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:33:42 -0600 Subject: [PATCH 09/12] Update future work section. --- docs/mr_proposal.md | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 81d817a311..6e971b92be 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -160,24 +160,8 @@ Just as `` provides concrete, derived implementations of `std:: Other resource implementations may be added as deemed appropriate. -### `cuda::polymorphic_allocator` - -TBD - -### `cuda::stream_ordered_allocator` - -TBD - -### Containers - -TBD. libcu++ will provide memory owning container types that work with `cuda::memory_resource/cuda::stream_ordered_memory`. - - - - - - - - +## Future Work +Future work will include the design of allocators similar to `std::pmr::polymorphic_allocator` to work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource`. +Likewise, containers that work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource` will be future work. From 45c5ab971295122b645cfb9780ea26dcbca46e95 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:35:09 -0600 Subject: [PATCH 10/12] Rename derived implementations. --- docs/mr_proposal.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 6e971b92be..56f209d1dd 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -149,13 +149,13 @@ Just as `` provides concrete, derived implementations of `std:: - `cuda::new_delete_resource : public cuda::memory_resource` - Uses `::operator new()`/`::operator delete()` for allocating host memory -- `cuda::cuda_resource : public cuda::memory_resource` +- `cuda::device_resource : public cuda::memory_resource` - Uses `cudaMalloc/cudaFree` for allocating device memory - `cuda::unified_resource : public cuda::memory_resource` - Uses `cudaMallocManaged/cudaFree` for unified memory - `cuda::pinned_resource : public cuda::memory_resource` - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory -- `cuda::cuda_async_resource : public cuda::stream_ordered_memory_resource` +- `cuda::async_device_resource : public cuda::stream_ordered_memory_resource` - Uses `cudaMallocAsync/cudaFreeAsync` for device memory Other resource implementations may be added as deemed appropriate. From 544754f95f17baabe3f4e45b189f864dec1c3c15 Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:44:28 -0600 Subject: [PATCH 11/12] Update footnote. --- docs/mr_proposal.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 56f209d1dd..099f750cf1 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -25,9 +25,9 @@ Therefore, there is a need for a common allocator interface similar to `std::pmr -[1]: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface. +[1]: Note that `cudaMallocAsync` obviates neither the need for custom, CUDA-aware allocators, nor the need for a common allocation interface. There will never be one allocator that satisfies all users. -Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. +Furthermore, a common interface allows composing and layering utilities like logging, limiting, leak checking, and tracking. ## Description From acf4247c1d849b47a28bb0a284c9bd7465c1df8e Mon Sep 17 00:00:00 2001 From: Jake Hemstad Date: Mon, 1 Mar 2021 13:52:31 -0600 Subject: [PATCH 12/12] Update namespace policy. --- docs/mr_proposal.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md index 099f750cf1..9ed618672f 100644 --- a/docs/mr_proposal.md +++ b/docs/mr_proposal.md @@ -160,6 +160,14 @@ Just as `` provides concrete, derived implementations of `std:: Other resource implementations may be added as deemed appropriate. +## `cuda::` Namespace Policy + +The current policy of libcu++ is that everything in the `cuda::` namespace must be heterogeneous, i.e., `__host__ __device__`. +The facilities described above in `` are intended to be host-only at this time. +Therefore, we propose to modify the policy to allow host-only constructs in `cuda::`. +Device-only constructs will still be disallowed in `cuda::`. +Any device-only construct would go into `cuda::device::`. + ## Future Work Future work will include the design of allocators similar to `std::pmr::polymorphic_allocator` to work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource`.