-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libcu++] Fix undefined behavior in atomics to automatic storage #478
base: main
Are you sure you want to change the base?
Conversation
The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior. Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity. This change tests whether the address of an atomic operation is in local memory using `__isLocal`, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated. Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use `__isLocal`, like our workaround here. Instead, we use PTX `isspace` instruction to perform the detection.
a7673cf
to
a3f0405
Compare
libcudacxx/.upstream-tests/test/cuda/atomics/atomic.local.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/.upstream-tests/test/cuda/atomics/atomic.local.pass.cpp
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
I would have added the macro within this PR |
/ok to test |
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
Co-authored-by: Georgy Evtushenko <[email protected]>
/ok to test |
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Show resolved
Hide resolved
libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda_local.h
Outdated
Show resolved
Hide resolved
…ic/atomic_cuda_local.h
@@ -40,6 +44,8 @@ void _LIBCUDACXX_DEVICE __atomic_exchange_cuda(_Type volatile *__ptr, _Type *__v | |||
|
|||
template<class _Type, class _Delta, class _Scope, typename _CUDA_VSTD::enable_if<sizeof(_Type)<=2, int>::type = 0> | |||
_Type _LIBCUDACXX_DEVICE __atomic_fetch_add_cuda(_Type volatile *__ptr, _Delta __val, int __memorder, _Scope __s) { | |||
_Type __ret; | |||
if (__cuda_fetch_add_weak_if_local(__ptr, __val, &__ret)) return __ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
important: compiler is unable to see through the memory and identify that it's not local. This affects codegen and overall performance. Here's a simple kernel:
using device_atomic_t = cuda::atomic<int, cuda::thread_scope_device>;
__global__ void use(device_atomic_t *d_atomics) {
d_atomics->fetch_add(threadIdx.x, cuda::memory_order_relaxed);
}
On RTX 6000 Ada the change leads to the following slowdown (up to ~3x slower)
In the case of the block-scope atomics the performance difference is even more pronounced:
template <int BlockSize>
__launch_bounds__(BlockSize) __global__ void use(device_atomic_t *d_atomics, int mv) {
__shared__ block_atomic_t b_atomics;
if (threadIdx.x == 0) {
new (&b_atomics) block_atomic_t{};
}
__syncthreads();
b_atomics.fetch_add(threadIdx.x, cuda::memory_order_relaxed);
__syncthreads();
if (threadIdx.x == 0) {
if (b_atomics.load(cuda::memory_order_relaxed) > mv) {
d_atomics->fetch_add(1, cuda::memory_order_relaxed);
}
}
}
Results for RTX 6000 Ada illustrate up to ~4x slowdown:
I think I agree with:
Since this only impact objects with automatic storage, the impact is not very widespread
Given this, I think we should explore options not to penalize widespread use cases. If compiler is able to see through the local space check, this would be a solution. Otherwise, we can consider refining the:
it affects an object in GPU memory and only GPU threads access it.
requirement to talk about global, cluster or block memory + add a check of automatic storage in debug build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is known but the analysis is incomplete since:
- this lands on CUDA CTK 12.4,
- the impact is zero on CUDA CTK 12.3 and newer, and
- the impact is zero on CUDA CTK 12.2 and older iff cuda atomics are used through the cuda::atomic bundled in the CTK, since those are not impacted by this.
The performance regression is scoped to:
- users of CUDA 12.2 and older,
- that are not using the CUDA C++ standard library bundled with their CTK, but instead picking a different version from github.
For those users, we could - in a subsequent PR - provide a way to opt out into broken behavior via some feature macro, e.g., LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE
, that users define before including the headers consistently to avoid ODR issues:
#define LIBCUDACXX_UNSAFE_ATOMIC_AUTOMATIC_STORAGE
#include <cuda/atomic>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the slack discussion, an alternative is to enable the check in CTK 12.2 and older only in debug mode, to avoid the perf hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something where we could work with attributes e.g [[likely]]
/ [[unlikely]]
?
@@ -142,6 +144,7 @@ int main() { | |||
for(auto& cv: cv_qualifier) { | |||
out << "template<class _Type, _CUDA_VSTD::__enable_if_t<sizeof(_Type)==" << sz/8 << ", int> = 0>\n"; | |||
out << "_LIBCUDACXX_DEVICE void __atomic_load_cuda(const " << cv << "_Type *__ptr, _Type *__ret, int __memorder, " << scopenametag(s.first) << ") {\n"; | |||
out << " if (__cuda_load_weak_if_local(__ptr, __ret)) return;\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be weak_if_local_or_const_or_grid_param, since:
__constant__ cuda::atomic<int> x;
x.load(); // UB, should use weak load
and
__global__ void kernel(__grid_constant__ const cuda::atomic<int> x) {
x.load();
}
have the same issue.
The current implementation of atomic operations is unsound. It issues generic PTX atomic instructions even if the address falls in the local memory address space, causing well-formed CUDA C++ programs to exhibit PTX undefined behavior.
Since this only impact objects with automatic storage, the impact is not very widespread, but it does impact beginners trying to learn libcu++ atomic operations, and it also impacts most of the examples in our documentation which use automatic storage for simplicity.
This change tests whether the address of an atomic operation is in local memory using
__isLocal
, and when that is the case, it uses weak memory operations instead. This is sound because CUDA C++ does not allow sharing the address of automatic variables across threads. If that ever changes, this would need to be updated.Unfortunately, nvidia compilers from toolkits older than 12.3 have a bug that miscompiles programs that use
__isLocal
, like our workaround here. Instead, we use PTXisspace
instruction to perform the detection.