From 2c9e5e7abdc5b0b813344580961d8222a716c83f Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Thu, 11 Feb 2021 10:21:58 -0600
Subject: [PATCH 01/12] More stuff.

---
 docs/mr_proposal.md | 47 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 docs/mr_proposal.md
diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
new file mode 100644
index 0000000000..0dc51292b1
--- /dev/null
+++ b/docs/mr_proposal.md
@@ -0,0 +1,47 @@
+# CUDA `<memory_resource>` Extension Proposal
+
+## Motivation
+
+Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. 
+CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`.
+As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. 
+However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. 
+Interoperation among these applications is difficult without an interface to enable sharing a common allocator.
+
+In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface.
+C++17 introduced [`<memory_resource>`](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. 
+This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. 
+
+
+<!--- In addition, `<memory_resource>` provides a handful of standard `memory_resource` implementations akin to custom allocators that seek to perform better than standard allocation, e.g., [`unsynchronized_pool_resource`](https://en.cppreference.com/w/cpp/memory/unsynchronized_pool_resource). --->
+
+However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model.
+For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory.
+Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)<sup>[1](#link-footnote)</sup> to include memory allocation as stream-ordered events.
+Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++.
+
+<!--- standard device memory, standard pageable host memory, unified memory pageable between host and device, and pinned host memory directly accessible from device --->
+
+<!--- in response to the proliferation of custom CUDA device memory allocators to provide users with an allocator faster than `cudaMalloc/cudaFree` and reduce the burden of many developers to maintain their own allocator. --->
+
+
+<a name="link-footnote">[1]</a>: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface.
+There will never be one allocator that satisfies all users. 
+Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. 
+
+## Description
+
+We propose extending `<memory_resource>` to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.
+
+We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:
+
+- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. 
+
+- The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. 
+
+- Likewise, [Thrust](https://github.com/NVIDIA/thrust) has had similar success with its `thrust::mr::memory_resource` interface. 
+
+Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `<memory_resource>` is the logical choice. 
+
+
+

From d4b41a92d857ba0e3a9774f8c9ecbea3b317833c Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Thu, 11 Feb 2021 12:35:28 -0600
Subject: [PATCH 02/12] Description.

---
 docs/mr_proposal.md | 137 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 136 insertions(+), 1 deletion(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 0dc51292b1..474d7b7202 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -35,7 +35,7 @@ We propose extending `<memory_resource>` to provide a common memory allocation i
 
 We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:
 
-- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. 
+- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `<memory_resource>` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4).
 
 - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. 
 
@@ -43,5 +43,140 @@ We chose `<memory_resource>` as the basis for a CUDA-specific allocator interfac
 
 Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `<memory_resource>` is the logical choice. 
 
+This proposal includes the addition of the following to libcu++:
+
+### `memory_kind` 
+
+A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory. 
+This is intended to be similar to the existing `thread_scope` enum.
+
+```c++
+enum class memory_kind {
+  device,  ///< Device memory accessible only from device
+  unified, ///< Unified memory accessible from both host and device
+  pinned,  ///< Page-locked system memory accessible from both host and device
+  host     ///< System memory only accessible from host code
+};
+```
+
+### `stream_view`
+
+A strongly typed, non-owning, view-type for `cudaStream_t`. 
+This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream.
+
+### `cuda::memory_resource`
+
+The `cuda::memory_resource` class template is the abstract base class interface akin to `std::pmr::memory_resource` with two main differences:
+
+1. The `Kind` template parameter determines the `memory_kind` allocated by the resource.
+
+2. The `Context` template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization.
+By default, the `Context` is the `any_context` tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.
+
+```c++
+/**
+ * @brief Tag type for the default context of `memory_resource`.
+ *
+ * Default context in which storage may be used immediately on any thread or any
+ * CUDA stream without synchronization.
+ */
+struct any_context{};
+
+template <memory_kind Kind, typename Context = any_context>
+class memory_resource{
+public:
+   void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); }
+   void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); }
+   Context get_context(){ return do_get_context(); }
+private:
+   virtual void* do_allocate(size_t n, size_t alignment) = 0;
+   virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0;
+   virtual void do_get_context() = 0;
+};
+```
+
+The purpose of the `Context` template parameter is to allow for more generic allocation semantics. 
+For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:
+
+```c++
+struct stream_context{
+    cuda::stream_view s;
+};
+
+template <memory_kind Kind>
+class stream_bound_memory_resource : public cuda::memory_resource<Kind, stream_context>{
+public:
+   stream_bound_memory_resource(cuda::stream_view s) : s_{s} {}
+private:
+   void* do_allocate(size_t n, size_t alignment)  override  { // always allocate on `s` }
+   void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` }
+   stream_context do_get_context(){ return s_; }
+   stream_context s_;
+};
+```
+
+### `cuda::pmr_adaptor`
+
+`cuda::memory_resource` is similar to `std::pmr::memory_resource`, but they do not share a common inheritance hierarchy, therefore an object that derives from `cuda::memory_resource` cannot be used polymorphically as a `std::pmr::memory_resource`, i.e., a `cuda::memory_resource` derived type cannot be passed to a function that expects a `std::pmr::memory_resource` pointer or reference. 
+However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type.
+The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. 
+
+
+
+
+### `cuda::stream_ordered_memory_resource`
+
+The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation.
+This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). 
+
+```c++
+template <memory_kind Kind>
+class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */>
+{
+public:
+    static constexpr size_t default_alignment = alignof(max_align_t);
+    // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
+    void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); }
+    void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); }
+    void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); }
+    void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); }
+ private:
+    virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0;
+    virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
+};
+```
+
+### Concrete Resource Implementations:
+
+Just as `<memory_resource>` provides concrete, derived implementations of `std::pmr::memory_resource`, libcu++ will provide the following:
+
+- `cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>`
+   - Uses `::operator new()`/`::operator delete()` for allocating host memory
+- `cuda::cuda_resource : public cuda::memory_resource<memory_kind::device>`
+   - Uses `cudaMalloc/cudaFree` for allocating device memory
+- `cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>`
+   - Uses `cudaMallocManaged/cudaFree` for unified memory
+- `cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>`
+   - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory
+- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource<memory_kind::device>`
+   - Uses `cudaMallocAsync/cudaFreeAsync` for device memory
+
+Other resource implementations may be added as deemed appropriate.
+
+### `cuda::polymorphic_allocator`
+
+TBD
+
+### `cuda::stream_ordered_allocator`
+
+TBD
+
+
+
+
+
+
+
+
 
 

From f0b6af954375e0a3630cac65ad58b36ea42124b6 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Thu, 11 Feb 2021 10:21:58 -0600
Subject: [PATCH 03/12] More stuff.

---
 docs/mr_proposal.md | 47 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 docs/mr_proposal.md

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
new file mode 100644
index 0000000000..0dc51292b1
--- /dev/null
+++ b/docs/mr_proposal.md
@@ -0,0 +1,47 @@
+# CUDA `<memory_resource>` Extension Proposal
+
+## Motivation
+
+Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. 
+CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`.
+As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. 
+However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. 
+Interoperation among these applications is difficult without an interface to enable sharing a common allocator.
+
+In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface.
+C++17 introduced [`<memory_resource>`](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. 
+This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. 
+
+
+<!--- In addition, `<memory_resource>` provides a handful of standard `memory_resource` implementations akin to custom allocators that seek to perform better than standard allocation, e.g., [`unsynchronized_pool_resource`](https://en.cppreference.com/w/cpp/memory/unsynchronized_pool_resource). --->
+
+However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model.
+For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory.
+Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)<sup>[1](#link-footnote)</sup> to include memory allocation as stream-ordered events.
+Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++.
+
+<!--- standard device memory, standard pageable host memory, unified memory pageable between host and device, and pinned host memory directly accessible from device --->
+
+<!--- in response to the proliferation of custom CUDA device memory allocators to provide users with an allocator faster than `cudaMalloc/cudaFree` and reduce the burden of many developers to maintain their own allocator. --->
+
+
+<a name="link-footnote">[1]</a>: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface.
+There will never be one allocator that satisfies all users. 
+Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. 
+
+## Description
+
+We propose extending `<memory_resource>` to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.
+
+We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:
+
+- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. 
+
+- The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. 
+
+- Likewise, [Thrust](https://github.com/NVIDIA/thrust) has had similar success with its `thrust::mr::memory_resource` interface. 
+
+Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `<memory_resource>` is the logical choice. 
+
+
+

From 30d7e96fde3002e0b751a6a605286aca7469864d Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Thu, 11 Feb 2021 12:35:28 -0600
Subject: [PATCH 04/12] Description.

---
 docs/mr_proposal.md | 137 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 136 insertions(+), 1 deletion(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 0dc51292b1..474d7b7202 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -35,7 +35,7 @@ We propose extending `<memory_resource>` to provide a common memory allocation i
 
 We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:
 
-- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. 
+- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `<memory_resource>` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4).
 
 - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. 
 
@@ -43,5 +43,140 @@ We chose `<memory_resource>` as the basis for a CUDA-specific allocator interfac
 
 Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `<memory_resource>` is the logical choice. 
 
+This proposal includes the addition of the following to libcu++:
+
+### `memory_kind` 
+
+A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory. 
+This is intended to be similar to the existing `thread_scope` enum.
+
+```c++
+enum class memory_kind {
+  device,  ///< Device memory accessible only from device
+  unified, ///< Unified memory accessible from both host and device
+  pinned,  ///< Page-locked system memory accessible from both host and device
+  host     ///< System memory only accessible from host code
+};
+```
+
+### `stream_view`
+
+A strongly typed, non-owning, view-type for `cudaStream_t`. 
+This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream.
+
+### `cuda::memory_resource`
+
+The `cuda::memory_resource` class template is the abstract base class interface akin to `std::pmr::memory_resource` with two main differences:
+
+1. The `Kind` template parameter determines the `memory_kind` allocated by the resource.
+
+2. The `Context` template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization.
+By default, the `Context` is the `any_context` tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.
+
+```c++
+/**
+ * @brief Tag type for the default context of `memory_resource`.
+ *
+ * Default context in which storage may be used immediately on any thread or any
+ * CUDA stream without synchronization.
+ */
+struct any_context{};
+
+template <memory_kind Kind, typename Context = any_context>
+class memory_resource{
+public:
+   void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); }
+   void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); }
+   Context get_context(){ return do_get_context(); }
+private:
+   virtual void* do_allocate(size_t n, size_t alignment) = 0;
+   virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0;
+   virtual void do_get_context() = 0;
+};
+```
+
+The purpose of the `Context` template parameter is to allow for more generic allocation semantics. 
+For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:
+
+```c++
+struct stream_context{
+    cuda::stream_view s;
+};
+
+template <memory_kind Kind>
+class stream_bound_memory_resource : public cuda::memory_resource<Kind, stream_context>{
+public:
+   stream_bound_memory_resource(cuda::stream_view s) : s_{s} {}
+private:
+   void* do_allocate(size_t n, size_t alignment)  override  { // always allocate on `s` }
+   void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` }
+   stream_context do_get_context(){ return s_; }
+   stream_context s_;
+};
+```
+
+### `cuda::pmr_adaptor`
+
+`cuda::memory_resource` is similar to `std::pmr::memory_resource`, but they do not share a common inheritance hierarchy, therefore an object that derives from `cuda::memory_resource` cannot be used polymorphically as a `std::pmr::memory_resource`, i.e., a `cuda::memory_resource` derived type cannot be passed to a function that expects a `std::pmr::memory_resource` pointer or reference. 
+However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type.
+The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. 
+
+
+
+
+### `cuda::stream_ordered_memory_resource`
+
+The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation.
+This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). 
+
+```c++
+template <memory_kind Kind>
+class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */>
+{
+public:
+    static constexpr size_t default_alignment = alignof(max_align_t);
+    // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
+    void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); }
+    void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); }
+    void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); }
+    void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); }
+ private:
+    virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0;
+    virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
+};
+```
+
+### Concrete Resource Implementations:
+
+Just as `<memory_resource>` provides concrete, derived implementations of `std::pmr::memory_resource`, libcu++ will provide the following:
+
+- `cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>`
+   - Uses `::operator new()`/`::operator delete()` for allocating host memory
+- `cuda::cuda_resource : public cuda::memory_resource<memory_kind::device>`
+   - Uses `cudaMalloc/cudaFree` for allocating device memory
+- `cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>`
+   - Uses `cudaMallocManaged/cudaFree` for unified memory
+- `cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>`
+   - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory
+- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource<memory_kind::device>`
+   - Uses `cudaMallocAsync/cudaFreeAsync` for device memory
+
+Other resource implementations may be added as deemed appropriate.
+
+### `cuda::polymorphic_allocator`
+
+TBD
+
+### `cuda::stream_ordered_allocator`
+
+TBD
+
+
+
+
+
+
+
+
 
 

From 72d024e4827494b78e6527498da44e2261295da7 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Thu, 11 Feb 2021 13:28:40 -0600
Subject: [PATCH 05/12] Add stub section for containers.

---
 docs/mr_proposal.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 474d7b7202..c943a174eb 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -141,7 +141,7 @@ public:
     void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); }
     void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); }
  private:
-    virtual void* do_allocate(size_t n, size_t alignment, cuda::stream_view s) = 0;
+    virtual void* do_allocate(size_t nk, size_t alignment, cuda::stream_view s) = 0;
     virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
 };
 ```
@@ -171,6 +171,11 @@ TBD
 
 TBD
 
+### Containers
+
+TBD. libcu++ will provide memory owning container types that work with `cuda::memory_resource/cuda::stream_ordered_memory`. 
+
+
 
 
 

From 62b09acafd188af115e35be08f5f918297e304b3 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:11:26 -0600
Subject: [PATCH 06/12] Apply suggestions from code review
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Co-authored-by: Michał Zientkiewicz <mzient@gmail.com>
Co-authored-by: Mark Harris <mharris@nvidia.com>
---
 docs/mr_proposal.md | 20 ++++++++------------
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index c943a174eb..48570c0d3e 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -2,22 +2,22 @@
 
 ## Motivation
 
-Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead. 
+Performance sensitive applications that make frequent dynamic memory allocations often find allocating memory to be a significant overhead. 
 CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`.
 As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves. 
 However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations. 
 Interoperation among these applications is difficult without an interface to enable sharing a common allocator.
 
-In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface.
-C++17 introduced [`<memory_resource>`](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s. 
-This polymorphic interface provides the lingua franca for those who trade in custom memory allocators. 
+In Standard C++, [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) has traditionally provided this common interface.
+C++17 introduced [`<memory_resource>`](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`. 
+This polymorphic interface provides a standard way to define, expose and share custom memory allocation. 
 
 
 <!--- In addition, `<memory_resource>` provides a handful of standard `memory_resource` implementations akin to custom allocators that seek to perform better than standard allocation, e.g., [`unsynchronized_pool_resource`](https://en.cppreference.com/w/cpp/memory/unsynchronized_pool_resource). --->
 
 However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model.
-For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory.
-Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)<sup>[1](#link-footnote)</sup> to include memory allocation as stream-ordered events.
+For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications may access at least four different kinds of dynamically allocated memory.
+Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync` and `cudaFreeAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)<sup>[1](#link-footnote)</sup>, which provide stream-ordered memory allocation and deallocation.
 Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++.
 
 <!--- standard device memory, standard pageable host memory, unified memory pageable between host and device, and pinned host memory directly accessible from device --->
@@ -35,7 +35,7 @@ We propose extending `<memory_resource>` to provide a common memory allocation i
 
 We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:
 
-- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `<memory_resource>` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4).
+- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. An allocator interface with a common look and feel will ease working between Standard and CUDA C++. For more information on `<memory_resource>` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4).
 
 - The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`. 
 
@@ -62,7 +62,7 @@ enum class memory_kind {
 ### `stream_view`
 
 A strongly typed, non-owning, view-type for `cudaStream_t`. 
-This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream.
+This type provides a more typesafe C++ wrapper around `cudaStream_t` and serves as the input argument type for any libcu++ API that takes a CUDA stream.
 
 ### `cuda::memory_resource`
 
@@ -121,9 +121,6 @@ private:
 However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type.
 The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`. 
 
-
-
-
 ### `cuda::stream_ordered_memory_resource`
 
 The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation.
@@ -184,4 +181,3 @@ TBD. libcu++ will provide memory owning container types that work with `cuda::me
 
 
 
-

From ca21851e96e14005440efce5f0d6854778201a30 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:15:29 -0600
Subject: [PATCH 07/12] Add _async.

---
 docs/mr_proposal.md | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 48570c0d3e..aa945fb1b1 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -124,7 +124,7 @@ The `cuda::pmr_adaptor` class is intended to provide this functionality by inher
 ### `cuda::stream_ordered_memory_resource`
 
 The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation.
-This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). 
+This is similar to `cuda::memory_resource` but `allocate_async` and `deallocate_async` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html). 
 
 ```c++
 template <memory_kind Kind>
@@ -133,13 +133,13 @@ class stream_ordered_memory_resource : public memory_resource<_Kind /* default c
 public:
     static constexpr size_t default_alignment = alignof(max_align_t);
     // Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
-    void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); }
-    void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); }
-    void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); }
-    void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); }
+    void* allocate_async(size_t n, cuda::stream_view s){ return do_allocate_async(n, default_alignment, s); }
+    void* allocate_async(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate_async(n, alignment, s); }
+    void deallocate_async(void* p, size_t n, cuda::stream_view s){ return do_deallocate_async(p, n, default_alignment, s); }
+    void deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate_async(p, n, alignment, s); }
  private:
-    virtual void* do_allocate(size_t nk, size_t alignment, cuda::stream_view s) = 0;
-    virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
+    virtual void* do_allocate_async(size_t nk, size_t alignment, cuda::stream_view s) = 0;
+    virtual void do_deallocate_async(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
 };
 ```
 

From 125975d97fdbee4ddf8c5065d81af427273ff6aa Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:15:38 -0600
Subject: [PATCH 08/12] Fix typo.

---
 docs/mr_proposal.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index aa945fb1b1..81d817a311 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -155,7 +155,7 @@ Just as `<memory_resource>` provides concrete, derived implementations of `std::
    - Uses `cudaMallocManaged/cudaFree` for unified memory
 - `cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>`
    - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory
-- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource<memory_kind::device>`
+- `cuda::cuda_async_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>`
    - Uses `cudaMallocAsync/cudaFreeAsync` for device memory
 
 Other resource implementations may be added as deemed appropriate.

From ffd93854a8a8cc2ce05ccc0f35494f0fd45e74f9 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:33:42 -0600
Subject: [PATCH 09/12] Update future work section.

---
 docs/mr_proposal.md | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 81d817a311..6e971b92be 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -160,24 +160,8 @@ Just as `<memory_resource>` provides concrete, derived implementations of `std::
 
 Other resource implementations may be added as deemed appropriate.
 
-### `cuda::polymorphic_allocator`
-
-TBD
-
-### `cuda::stream_ordered_allocator`
-
-TBD
-
-### Containers
-
-TBD. libcu++ will provide memory owning container types that work with `cuda::memory_resource/cuda::stream_ordered_memory`. 
-
-
-
-
-
-
-
-
+## Future Work
 
+Future work will include the design of allocators similar to `std::pmr::polymorphic_allocator` to work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource`. 
 
+Likewise, containers that work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource` will be future work. 

From 45c5ab971295122b645cfb9780ea26dcbca46e95 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:35:09 -0600
Subject: [PATCH 10/12] Rename derived implementations.

---
 docs/mr_proposal.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 6e971b92be..56f209d1dd 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -149,13 +149,13 @@ Just as `<memory_resource>` provides concrete, derived implementations of `std::
 
 - `cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>`
    - Uses `::operator new()`/`::operator delete()` for allocating host memory
-- `cuda::cuda_resource : public cuda::memory_resource<memory_kind::device>`
+- `cuda::device_resource : public cuda::memory_resource<memory_kind::device>`
    - Uses `cudaMalloc/cudaFree` for allocating device memory
 - `cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>`
    - Uses `cudaMallocManaged/cudaFree` for unified memory
 - `cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>`
    - Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory
-- `cuda::cuda_async_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>`
+- `cuda::async_device_resource : public cuda::stream_ordered_memory_resource<memory_kind::device>`
    - Uses `cudaMallocAsync/cudaFreeAsync` for device memory
 
 Other resource implementations may be added as deemed appropriate.

From 544754f95f17baabe3f4e45b189f864dec1c3c15 Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:44:28 -0600
Subject: [PATCH 11/12] Update footnote.

---
 docs/mr_proposal.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 56f209d1dd..099f750cf1 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -25,9 +25,9 @@ Therefore, there is a need for a common allocator interface similar to `std::pmr
 <!--- in response to the proliferation of custom CUDA device memory allocators to provide users with an allocator faster than `cudaMalloc/cudaFree` and reduce the burden of many developers to maintain their own allocator. --->
 
 
-<a name="link-footnote">[1]</a>: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface.
+<a name="link-footnote">[1]</a>: Note that `cudaMallocAsync` obviates neither the need for custom, CUDA-aware allocators, nor the need for a common allocation interface. 
 There will never be one allocator that satisfies all users. 
-Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc. 
+Furthermore, a common interface allows composing and layering utilities like logging, limiting, leak checking, and tracking.
 
 ## Description
 

From acf4247c1d849b47a28bb0a284c9bd7465c1df8e Mon Sep 17 00:00:00 2001
From: Jake Hemstad <jhemstad@nvidia.com>
Date: Mon, 1 Mar 2021 13:52:31 -0600
Subject: [PATCH 12/12] Update namespace policy.

---
 docs/mr_proposal.md | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/docs/mr_proposal.md b/docs/mr_proposal.md
index 099f750cf1..9ed618672f 100644
--- a/docs/mr_proposal.md
+++ b/docs/mr_proposal.md
@@ -160,6 +160,14 @@ Just as `<memory_resource>` provides concrete, derived implementations of `std::
 
 Other resource implementations may be added as deemed appropriate.
 
+## `cuda::` Namespace Policy
+
+The current policy of libcu++ is that everything in the `cuda::` namespace must be heterogeneous, i.e., `__host__ __device__`. 
+The facilities described above in `<cuda/memory_resource>` are intended to be host-only at this time. 
+Therefore, we propose to modify the policy to allow host-only constructs in `cuda::`.
+Device-only constructs will still be disallowed in `cuda::`.
+Any device-only construct would go into `cuda::device::`.
+
 ## Future Work
 
 Future work will include the design of allocators similar to `std::pmr::polymorphic_allocator` to work with `cuda::memory_resource` and `cuda::stream_ordered_memory_resource`.