NVIDIA · wmaxey · Nov 29, 2022 · Mar 15, 2022 · Mar 15, 2022 · Mar 16, 2022
diff --git a/docs/extended_api/synchronization_primitives.md b/docs/extended_api/synchronization_primitives.md
@@ -3,6 +3,7 @@
 ### Atomics
 
 | [`cuda::atomic`]             | System-wide [`cuda::std::atomic`] objects and operations. `(class template)`                                   <br/><br/> 1.0.0 / CUDA 10.2 |
+| [`cuda::atomic_ref`]         | System-wide [`cuda::std::atomic_ref`] objects and operations. `(class template)`                               <br/><br/> 1.7.0 / CUDA 11.6 |
 
 ### Latches
 
@@ -31,6 +32,7 @@ The pipeline library is included in the CUDA Toolkit, but is not part of the
 
 
 [`cuda::std::atomic`]: https://en.cppreference.com/w/cpp/atomic/atomic
+[`cuda::std::atomic_ref`]: https://en.cppreference.com/w/cpp/atomic/atomic_ref
 [`cuda::std::barrier`]: https://en.cppreference.com/w/cpp/thread/barrier
 [`cuda::std::latch`]: https://en.cppreference.com/w/cpp/thread/latch
 [`cuda::std::counting_semaphore`]: https://en.cppreference.com/w/cpp/thread/counting_semaphore
@@ -40,6 +42,7 @@ The pipeline library is included in the CUDA Toolkit, but is not part of the
 [`cuda::memcpy_async`]: {{ "extended_api/asynchronous_operations/memcpy_async.html" | relative_url }}
 
 [`cuda::atomic`]: {{ "extended_api/synchronization_primitives/atomic.html" | relative_url }}
+[`cuda::atomic_ref`]: {{ "extended_api/synchronization_primitives/atomic_ref.html" | relative_url }}
 [`cuda::barrier`]: {{ "extended_api/synchronization_primitives/barrier.html" | relative_url }}
 [`cuda::latch`]: {{ "extended_api/synchronization_primitives/latch.html" | relative_url }}
 [`cuda::counting_semaphore`]: {{ "extended_api/synchronization_primitives/counting_semaphore.html" | relative_url }}
@@ -51,4 +54,3 @@ The pipeline library is included in the CUDA Toolkit, but is not part of the
 [`cuda::make_pipeline`]: {{ "extended_api/synchronization_primitives/make_pipeline.html" | relative_url }}
 [`cuda::pipeline_consumer_wait_prior`]: {{ "extended_api/synchronization_primitives/pipeline_consumer_wait_prior.html" | relative_url }}
 [`cuda::pipeline_producer_commit`]: {{ "extended_api/synchronization_primitives/pipeline_producer_commit.html" | relative_url }}
-
diff --git a/docs/extended_api/synchronization_primitives/atomic_ref.md b/docs/extended_api/synchronization_primitives/atomic_ref.md
@@ -0,0 +1,103 @@
+---
+grand_parent: Extended API
+parent: Synchronization Primitives
+nav_order: 1
+---
+
+# `cuda::atomic_ref`
+
+Defined in header `<cuda/atomic>`:
+
+```cuda
+template <typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
+class cuda::atomic_ref;
+```
+
+The class template `cuda::atomic_ref` is an extended form of [`cuda::std::atomic_ref`]
+  that takes an additional [`cuda::thread_scope`] argument, defaulted to
+  `cuda::std::thread_scope_system`.
+It has the same interface and semantics as [`cuda::std::atomic_ref`], with the
+  following additional operations.
+This class additionally deviates from the standard by being backported to C++11.
+
+## Limitations
+
+`cuda::atomic_ref<T>` and `cuda::std::atomic_ref<T>` may only be instantiated with a T that are either 4 or 8 bytes.
+
+No object or subobject of an object referenced by an `atomic_ref` shall be concurrently referenced by any other `atomic_ref` that has a different `Scope`. 
+
+For `cuda::atomic_ref<T>` and `cuda::std::atomic_ref<T>` the type `T` must satisfy the following:
+- `4 <= sizeof(T) <= 8`
+- `T` must not have "padding bits", i.e., 'T`'s [object representation](https://en.cppreference.com/w/cpp/language/object#Object_representation_and_value_representation) must not have bits that do not participate in it's value representation  
+
+## Atomic Extrema Operations
+
+| [`cuda::atomic_ref::fetch_min`] | Atomically find the minimum of the stored value and a provided value. `(member function)` |
+| [`cuda::atomic_ref::fetch_max`] | Atomically find the maximum of the stored value and a provided value. `(member function)` |
+
+## Concurrency Restrictions
+
+An object of type `cuda::atomic_ref` or [`cuda::std::atomic_ref`] shall not be accessed
+  concurrently by CPU and GPU threads unless:
+- it is in unified memory and the [`concurrentManagedAccess` property] is 1, or
+- it is in CPU memory and the [`hostNativeAtomicSupported` property] is 1.
+
+Note, for objects of scopes other than `cuda::thread_scope_system` this is a
+  data-race, and thefore also prohibited regardless of memory characteristics.
+
+Under CUDA Compute Capability 6 (Pascal), an object of type `atomic_ref` may not be
+  used:
+- with automatic storage duration, or
+- if `is_always_lock_free()` is `false`.
+
+Under CUDA Compute Capability prior to 6 (Pascal), objects of type
+  `cuda::atomic_ref` or [`cuda::std::atomic_ref`] may not be used.
+
+## Implementation-Defined Behavior
+
+For each type `T` and [`cuda::thread_scope`] `S`, the value of
+  `cuda::atomic_ref<T, S>::is_always_lock_free()` is as follows:
+
+| Type `T` | [`cuda::thread_scope`] `S` | `cuda::atomic_ref<T, S>::is_always_lock_free()` |
+|----------|----------------------------|---------------------------------------------|
+| Any      | Any                        | `sizeof(T) <= 8`                            |
+
+## Example
+
+```cuda
+#include <cuda/atomic>
+
+__global__ void example_kernel(int *gmem, int *pinned_mem) {
+  // This atomic is suitable for all threads in the system.
+  cuda::atomic_ref<int, cuda::thread_scope_system> a(pinned_mem);
+
+  // This atomic has the same type as the previous one (`a`).
+  cuda::atomic_ref<int> b(pinned_mem);
+
+  // This atomic is suitable for all threads on the current processor (e.g. GPU).
+  cuda::atomic_ref<int, cuda::thread_scope_device> c(gmem);
+
+  __shared__ int shared_v;
+  // This atomic is suitable for threads in the same thread block.
+  cuda::atomic_ref<int, cuda::thread_scope_block> d(&shared);
+}
+```
+
+[See it on Godbolt](https://godbolt.org/z/fr4K7ErEh){: .btn }
+
+
+[`cuda::thread_scope`]: ../thread_scopes.md
+
+[`cuda::atomic_thread_fence`]: ./atomic/atomic_thread_fence.md
+
+[`cuda::atomic_ref::fetch_min`]: ./atomic/fetch_min.md
+[`cuda::atomic_ref::fetch_max`]: ./atomic/fetch_max.md
+
+[`cuda::std::atomic_ref`]: https://en.cppreference.com/w/cpp/atomic/atomic_ref
+
+[atomics.types.int]: https://eel.is/c++draft/atomics.types.int
+[atomics.types.pointer]: https://eel.is/c++draft/atomics.types.pointer
+
+[`concurrentManagedAccess` property]: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_116f9619ccc85e93bc456b8c69c80e78b
+[`hostNativeAtomicSupported` property]: https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html#structcudaDeviceProp_1ef82fd7d1d0413c7d6f33287e5b6306f
+
diff --git a/docs/extended_api/synchronization_primitives/barrier.md b/docs/extended_api/synchronization_primitives/barrier.md
@@ -1,7 +1,7 @@
 ---
 grand_parent: Extended API
 parent: Synchronization Primitives
-nav_order: 2
+nav_order: 3
 ---
 
 # `cuda::barrier`

diff --git a/docs/extended_api/synchronization_primitives/binary_semaphore.md b/docs/extended_api/synchronization_primitives/binary_semaphore.md
@@ -1,7 +1,7 @@
 ---
 grand_parent: Extended API
 parent: Synchronization Primitives
-nav_order: 4
+nav_order: 5
 ---
 
 # `cuda::binary_semaphore`

diff --git a/docs/extended_api/synchronization_primitives/counting_semaphore.md b/docs/extended_api/synchronization_primitives/counting_semaphore.md
@@ -1,7 +1,7 @@
 ---
 grand_parent: Extended API
 parent: Synchronization Primitives
-nav_order: 3
+nav_order: 4
 ---
 
 # `cuda::counting_semaphore`

diff --git a/docs/extended_api/synchronization_primitives/latch.md b/docs/extended_api/synchronization_primitives/latch.md
@@ -1,7 +1,7 @@
 ---
 grand_parent: Extended API
 parent: Synchronization Primitives
-nav_order: 1
+nav_order: 2
 ---
 
 # `cuda::latch`