Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of cuda::memory_resource design document #1

Open
wants to merge 13 commits into
base: cuda_memory_resource
Choose a base branch
from
187 changes: 187 additions & 0 deletions docs/mr_proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# CUDA `<memory_resource>` Extension Proposal

## Motivation

Performance sensitive applications that make frequent dynamic memory allocations often find that allocating memory to be a significant overhead.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
CUDA developers are even more acutely aware of the costs of dynamic allocation due to the relatively higher cost of `cudaMalloc/cudaFree` compared to standard `malloc/free`.
As a result, developers devise custom, high-performance memory allocators as optimized as the application the allocator serves.
However, what works well for one application will not always satisfy another, which leads to a proliferation of custom allocator implementations.
Interoperation among these applications is difficult without an interface to enable sharing a common allocator.

In Standard C++, [`Allocator`s](https://en.cppreference.com/w/cpp/named_req/Allocator) have traditionally provided this common interface.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
C++17 introduced [`<memory_resource>`](https://en.cppreference.com/w/cpp/header/memory_resource) and the [`std::pmr::memory_resource`](https://en.cppreference.com/w/cpp/memory/memory_resource) abstract class that defines a minimal interface for (de)allocating raw bytes and sits below `Allocator`s.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
This polymorphic interface provides the lingua franca for those who trade in custom memory allocators.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved


<!--- In addition, `<memory_resource>` provides a handful of standard `memory_resource` implementations akin to custom allocators that seek to perform better than standard allocation, e.g., [`unsynchronized_pool_resource`](https://en.cppreference.com/w/cpp/memory/unsynchronized_pool_resource). --->

However, the `std::pmr::memory_resource` interface is insufficient to capture the unique features of the CUDA C++ programming model.
For example, Standard C++ only recognizes a single, universally accessible memory space; whereas CUDA C++ applications trade in at least four different kinds of dynamically allocated memory.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
Furthermore, CUDA's "stream"-based asynchronous execution model was extended in CUDA 11.2 with the addition of [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html)<sup>[1](#link-footnote)</sup> to include memory allocation as stream-ordered events.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
Therefore, there is a need for a common allocator interface similar to `std::pmr::memory_resource` that accounts for the unique features of CUDA C++.

<!--- standard device memory, standard pageable host memory, unified memory pageable between host and device, and pinned host memory directly accessible from device --->

<!--- in response to the proliferation of custom CUDA device memory allocators to provide users with an allocator faster than `cudaMalloc/cudaFree` and reduce the burden of many developers to maintain their own allocator. --->


<a name="link-footnote">[1]</a>: Note that `cudaMallocAsync` does not obviate the need for custom, CUDA-aware allocators nor a common allocation interface.
There will never be one allocator that satisfies all users.
Furthermore, a common interface allows composing and layering utilities like logging, leak checking, tracking, etc.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved

## Description

We propose extending `<memory_resource>` to provide a common memory allocation interface that meets the needs of CUDA C++ programmers.

We chose `<memory_resource>` as the basis for a CUDA-specific allocator interface for several reasons:

- `<memory_resource>` is the direction taken by Standard C++ for custom, stateful allocators. It will ease working between Standard and CUDA C++ for there to be an allocator interface with a common look and feel. For more information on `<memory_resource>` see [here](https://www.youtube.com/watch?v=l14Zkx5OXr4) and [here](https://www.youtube.com/watch?v=l14Zkx5OXr4).
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved

- The [RAPIDS Memory Management](https://github.com/rapidsai/rmm) library has had three years of [success](https://developer.nvidia.com/blog/fast-flexible-allocation-for-cuda-with-rapids-memory-manager/) using its `rmm::device_memory_resource` interface based on `std::pmr::memory_resource`.

- Likewise, [Thrust](https://github.com/NVIDIA/thrust) has had similar success with its `thrust::mr::memory_resource` interface.

Given the direction of Standard C++ and the success of two widely used CUDA libraries with a similar interface, `<memory_resource>` is the logical choice.

This proposal includes the addition of the following to libcu++:

### `memory_kind`

A scoped enumerator demarcating the different kinds of dynamically allocated CUDA memory.
This is intended to be similar to the existing `thread_scope` enum.

```c++
enum class memory_kind {
device, ///< Device memory accessible only from device
unified, ///< Unified memory accessible from both host and device
pinned, ///< Page-locked system memory accessible from both host and device
host ///< System memory only accessible from host code
};
```

### `stream_view`

A strongly typed, non-owning, view-type for `cudaStream_t`.
This type provides a more typesafe C++ wrapper around `cudaStream_t` and will serve as the input argument type for any libcu++ API that takes a CUDA stream.
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved

### `cuda::memory_resource`

The `cuda::memory_resource` class template is the abstract base class interface akin to `std::pmr::memory_resource` with two main differences:

1. The `Kind` template parameter determines the `memory_kind` allocated by the resource.

2. The `Context` template parameter determines the "execution context" in which memory allocated by the resource can be accessed without synchronization.
By default, the `Context` is the `any_context` tag type that indicates storage may be accessed immediately on any thread or CUDA stream without synchronization.

```c++
/**
* @brief Tag type for the default context of `memory_resource`.
*
* Default context in which storage may be used immediately on any thread or any
* CUDA stream without synchronization.
*/
struct any_context{};

template <memory_kind Kind, typename Context = any_context>
class memory_resource{
public:
void* allocate(size_t n, size_t alignment){ return do_allocate(n, alignment); }
void deallocate(void * p, size_t n, size_t alignment){ return do_deallocate(p, n, alignment); }
Context get_context(){ return do_get_context(); }
private:
virtual void* do_allocate(size_t n, size_t alignment) = 0;
virtual void do_deallocate(void* p, size_t n, size_t alignment) = 0;
virtual void do_get_context() = 0;
};
```

The purpose of the `Context` template parameter is to allow for more generic allocation semantics.
For example, consider a "stream-bound" memory resource where allocated memory may only be accessed without synchronization on a particular stream bound at construction:

```c++
struct stream_context{
cuda::stream_view s;
};

template <memory_kind Kind>
class stream_bound_memory_resource : public cuda::memory_resource<Kind, stream_context>{
public:
stream_bound_memory_resource(cuda::stream_view s) : s_{s} {}
private:
void* do_allocate(size_t n, size_t alignment) override { // always allocate on `s` }
void do_deallocate(void* p, size_t n, size_t alignment) override { // always deallocate on `s` }
stream_context do_get_context(){ return s_; }
stream_context s_;
};
```

### `cuda::pmr_adaptor`

`cuda::memory_resource` is similar to `std::pmr::memory_resource`, but they do not share a common inheritance hierarchy, therefore an object that derives from `cuda::memory_resource` cannot be used polymorphically as a `std::pmr::memory_resource`, i.e., a `cuda::memory_resource` derived type cannot be passed to a function that expects a `std::pmr::memory_resource` pointer or reference.
However, there may be situations where one wishes to use a `cuda::memory_resource` derived type as if it were a `std::pmr::memory_resource` derived type.
The `cuda::pmr_adaptor` class is intended to provide this functionality by inheriting from `std::pmr::memory_resource` and adapting an appropriate `cuda::memory_resource`.



jrhemstad marked this conversation as resolved.
Show resolved Hide resolved

### `cuda::stream_ordered_memory_resource`

The `cuda::stream_ordered_memory_resource` class template is the abstract base class interface for _stream-ordered_ memory allocation.
This is similar to `cuda::memory_resource` but `allocate` and `deallocate` both take a stream argument and follow stream-ordered memory allocation semantics as defined by [`cudaMallocAsync`](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html).

```c++
template <memory_kind Kind>
class stream_ordered_memory_resource : public memory_resource<_Kind /* default context */>
{
public:
static constexpr size_t default_alignment = alignof(max_align_t);
// Two overloads exist so that callers can still implicitly use the `default_alignment` when passing a stream
void* allocate(size_t n, cuda::stream_view s){ return do_allocate(n, default_alignment, s); }
void* allocate(size_t n, size_t alignment, cuda::stream_view s){ return do_allocate(n, alignment, s); }
void deallocate(void* p, size_t n, cuda::stream_view s){ return do_deallocate(p, n, default_alignment, s); }
void deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s){ return do_deallocate(p, n, alignment, s); }
private:
virtual void* do_allocate(size_t nk, size_t alignment, cuda::stream_view s) = 0;
virtual void do_deallocate(void* p, size_t n, size_t alignment, cuda::stream_view s) = 0;
};
```

### Concrete Resource Implementations:

Just as `<memory_resource>` provides concrete, derived implementations of `std::pmr::memory_resource`, libcu++ will provide the following:

- `cuda::new_delete_resource : public cuda::memory_resource<memory_kind::host>`
- Uses `::operator new()`/`::operator delete()` for allocating host memory
- `cuda::cuda_resource : public cuda::memory_resource<memory_kind::device>`
- Uses `cudaMalloc/cudaFree` for allocating device memory
- `cuda::unified_resource : public cuda::memory_resource<memory_kind::unified>`
- Uses `cudaMallocManaged/cudaFree` for unified memory
- `cuda::pinned_resource : public cuda::memory_resource<memory_kind::pinned>`
- Uses `cudaMallocHost/cudaFreeHost` for page-locked host memory
- `cuda::cuda_async_resource : public cuda::stream_oredered_memory_resource<memory_kind::device>`
jrhemstad marked this conversation as resolved.
Show resolved Hide resolved
- Uses `cudaMallocAsync/cudaFreeAsync` for device memory

Other resource implementations may be added as deemed appropriate.

### `cuda::polymorphic_allocator`

TBD

### `cuda::stream_ordered_allocator`

TBD

### Containers

TBD. libcu++ will provide memory owning container types that work with `cuda::memory_resource/cuda::stream_ordered_memory`.