[libcu++] Fix undefined behavior in atomics to automatic storage #478

gonzalobg · 2023-09-25T11:39:09Z

The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior.

Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity.

This change tests whether the address of an atomic operation is in local memory using __isLocal, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated.

Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use __isLocal, like our workaround here. Instead, we use PTX isspace instruction to perform the detection.

copy-pr-bot · 2023-09-25T11:39:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior. Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity. This change tests whether the address of an atomic operation is in local memory using `__isLocal`, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated. Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use `__isLocal`, like our workaround here. Instead, we use PTX `isspace` instruction to perform the detection.

libcudacxx/.upstream-tests/test/cuda/atomics/atomic.local.pass.cpp

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

Co-authored-by: Michael Schellenberger Costa <[email protected]>

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

gonzalobg · 2023-09-25T16:10:11Z

@miscco have hopefully addressed all the issues. This is now blocked by #479

miscco · 2023-09-25T16:56:44Z

I would have added the macro within this PR

miscco · 2023-09-25T20:02:12Z

/ok to test

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

Co-authored-by: Georgy Evtushenko <[email protected]>

libcudacxx/codegen/codegen.cpp

miscco · 2023-10-12T17:16:01Z

/ok to test

libcudacxx/codegen/codegen.cpp

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

…ic/atomic_cuda_local.h

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h

…ic/atomic_cuda_local.h

libcudacxx/codegen/codegen.cpp

gevtushenko · 2023-10-16T19:11:01Z

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_derived.h

@@ -40,6 +44,8 @@ void _LIBCUDACXX_DEVICE __atomic_exchange_cuda(_Type volatile *__ptr, _Type *__v

 template<class _Type, class _Delta, class _Scope, typename _CUDA_VSTD::enable_if<sizeof(_Type)<=2, int>::type = 0>
 _Type _LIBCUDACXX_DEVICE __atomic_fetch_add_cuda(_Type volatile *__ptr, _Delta __val, int __memorder, _Scope __s) {
+    _Type __ret;
+    if (__cuda_fetch_add_weak_if_local(__ptr, __val, &__ret)) return __ret;


important: compiler is unable to see through the memory and identify that it's not local. This affects codegen and overall performance. Here's a simple kernel:

using device_atomic_t = cuda::atomic<int, cuda::thread_scope_device>; __global__ void use(device_atomic_t *d_atomics) { d_atomics->fetch_add(threadIdx.x, cuda::memory_order_relaxed); }

On RTX 6000 Ada the change leads to the following slowdown (up to ~3x slower)

In the case of the block-scope atomics the performance difference is even more pronounced:

template <int BlockSize> __launch_bounds__(BlockSize) __global__ void use(device_atomic_t *d_atomics, int mv) { __shared__ block_atomic_t b_atomics; if (threadIdx.x == 0) { new (&b_atomics) block_atomic_t{}; } __syncthreads(); b_atomics.fetch_add(threadIdx.x, cuda::memory_order_relaxed); __syncthreads(); if (threadIdx.x == 0) { if (b_atomics.load(cuda::memory_order_relaxed) > mv) { d_atomics->fetch_add(1, cuda::memory_order_relaxed); } } }

Results for RTX 6000 Ada illustrate up to ~4x slowdown:

I think I agree with:

Since this only impact objects with automatic storage, the impact is not very widespread

Given this, I think we should explore options not to penalize widespread use cases. If compiler is able to see through the local space check, this would be a solution. Otherwise, we can consider refining the:

it affects an object in GPU memory and only GPU threads access it.

requirement to talk about global, cluster or block memory + add a check of automatic storage in debug build.

This is known but the analysis is incomplete since:

this lands on CUDA CTK 12.4,

the impact is zero on CUDA CTK 12.3 and newer, and

the impact is zero on CUDA CTK 12.2 and older iff cuda atomics are used through the cuda::atomic bundled in the CTK, since those are not impacted by this.

The performance regression is scoped to:

users of CUDA 12.2 and older,

that are not using the CUDA C++ standard library bundled with their CTK, but instead picking a different version from github.

For those users, we could - in a subsequent PR - provide a way to opt out into broken behavior via some feature macro, e.g., LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE, that users define before including the headers consistently to avoid ODR issues:

#define LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE #include <cuda/atomic>

From the slack discussion, an alternative is to enable the check in CTK 12.2 and older only in debug mode, to avoid the perf hit.

Is this something where we could work with attributes e.g [[likely]] / [[unlikely]]?

gonzalobg · 2024-07-18T09:33:52Z

libcudacxx/codegen/codegen.cpp

@@ -142,6 +144,7 @@ int main() {
            for(auto& cv: cv_qualifier) {
                out << "template<class _Type, _CUDA_VSTD::__enable_if_t<sizeof(_Type)==" << sz/8 << ", int> = 0>\n";
                out << "_LIBCUDACXX_DEVICE void __atomic_load_cuda(const " << cv << "_Type *__ptr, _Type *__ret, int __memorder, " << scopenametag(s.first) << ") {\n";
+                out << "    if (__cuda_load_weak_if_local(__ptr, __ret)) return;\n";


This should be weak_if_local_or_const_or_grid_param, since:

__constant__ cuda::atomic<int> x; x.load(); // UB, should use weak load

and

__global__ void kernel(__grid_constant__ const cuda::atomic<int> x) { x.load(); }

have the same issue.

gonzalobg requested review from a team as code owners September 25, 2023 11:39

gonzalobg requested review from ericniebler and wmaxey and removed request for a team September 25, 2023 11:39

gonzalobg mentioned this pull request Sep 25, 2023

Fix UB in atomic memory operations to memory locations with automatic storage NVIDIA/libcudacxx#427

Closed

2 tasks

gonzalobg force-pushed the bugfix/atomic_automatic_storage branch from a7673cf to a3f0405 Compare September 25, 2023 11:42

gonzalobg requested review from griwes, miscco and jrhemstad September 25, 2023 14:41

miscco reviewed Sep 25, 2023

View reviewed changes

gonzalobg and others added 2 commits September 25, 2023 17:55

Cleanup includes

10ce88a

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Cleanup test

0e05292

Co-authored-by: Michael Schellenberger Costa <[email protected]>

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

88b1d61

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

6b07744

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

3f46c1f

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

b50314f

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

e51c1b9

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

e5104cc

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Fix formatting

236824a

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Add missing include guard comment

9f2fe4b

gonzalobg commented Sep 25, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Use _LIBCUDACXX_CUDACC_BELOW_12_3

f9ad757

wmaxey mentioned this pull request Sep 25, 2023

[FEA]: Diff libcudacxx codegen output against header in repo #480

Open

1 task

gevtushenko requested changes Sep 26, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

nvcc accepts variable names starting with percent, but clang does not

3099287

Co-authored-by: Georgy Evtushenko <[email protected]>

gonzalobg commented Sep 26, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

gonzalobg and others added 2 commits September 26, 2023 21:26

Fmt

618f71d

Merge branch 'main' into pr/gonzalobg/478

91d36fa

griwes requested changes Oct 13, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Show resolved Hide resolved

gonzalobg added 2 commits October 13, 2023 23:33

Update libcudacxx/include/cuda/std/detail/libcxx/include/support/atom…

18ba198

…ic/atomic_cuda_local.h

Update libcudacxx/codegen/codegen.cpp

72c3e5a

gonzalobg commented Oct 13, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h Outdated Show resolved Hide resolved

Update libcudacxx/include/cuda/std/detail/libcxx/include/support/atom…

ead9dac

…ic/atomic_cuda_local.h

gonzalobg commented Oct 13, 2023

View reviewed changes

libcudacxx/codegen/codegen.cpp Outdated Show resolved Hide resolved

Update libcudacxx/codegen/codegen.cpp

c85ff17

griwes approved these changes Oct 13, 2023

View reviewed changes

gevtushenko requested changes Oct 16, 2023

View reviewed changes

gonzalobg commented Jul 18, 2024

View reviewed changes

wmaxey added a commit that referenced this pull request Sep 10, 2024

Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

0fe24d3

wmaxey mentioned this pull request Oct 16, 2024

[BUG]: Invalid results with automatic storage when using std::atomic #2585

Open

1 task

wmaxey added a commit that referenced this pull request Oct 16, 2024

Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

0bbed36

wmaxey mentioned this pull request Oct 16, 2024

Fix UB in atomics with automatic storage #2586

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[libcu++] Fix undefined behavior in atomics to automatic storage #478

[libcu++] Fix undefined behavior in atomics to automatic storage #478

gonzalobg commented Sep 25, 2023

copy-pr-bot bot commented Sep 25, 2023

gonzalobg commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Oct 12, 2023

gevtushenko Oct 16, 2023

gonzalobg Oct 16, 2023 •

edited

Loading

gonzalobg Oct 16, 2023

miscco Oct 17, 2023 •

edited

Loading

gonzalobg Jul 18, 2024 •

edited

Loading

[libcu++] Fix undefined behavior in atomics to automatic storage #478

Are you sure you want to change the base?

[libcu++] Fix undefined behavior in atomics to automatic storage #478

Conversation

gonzalobg commented Sep 25, 2023

copy-pr-bot bot commented Sep 25, 2023

gonzalobg commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Sep 25, 2023

miscco commented Oct 12, 2023

gevtushenko Oct 16, 2023

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023

Choose a reason for hiding this comment

miscco Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

gonzalobg Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

gonzalobg Oct 16, 2023 •

edited

Loading

miscco Oct 17, 2023 •

edited

Loading

gonzalobg Jul 18, 2024 •

edited

Loading