Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Decouple Atomic from host on MSVC #43

Merged
merged 8 commits into from
Nov 14, 2020

Conversation

wmaxey
Copy link
Member

@wmaxey wmaxey commented Oct 7, 2020

No description provided.

@wmaxey wmaxey requested review from griwes and brycelelbach October 7, 2020 19:28
include/cuda/std/detail/__atomic Outdated Show resolved Hide resolved
include/cuda/std/detail/__atomic_derived Show resolved Hide resolved
@griwes
Copy link
Collaborator

griwes commented Oct 7, 2020

Looks good overall pending the first review comment; please ping me when you un-WIP this.

include/cuda/std/detail/__atomic_derived Show resolved Hide resolved
#endif
}
template<class _Tp, int _Sco>
__host__ __device__ inline _Tp __cxx_atomic_load(__cxx_atomic_base_impl_default<_Tp, _Sco> const volatile* __a, int __order) {
#ifdef __CUDA_ARCH__
return __cxx_atomic_alignment_unwrap(detail::__atomic_load_n_cuda(&__a->__a_value, __order, detail::__scope_tag<_Sco>()));
#else
return __cxx_atomic_alignment_unwrap(::std::atomic_load_explicit(&__a->__a_value, (::std::memory_order)__order));
alignas(_Tp) unsigned char __buf[sizeof(_Tp)];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@griwes it looks like I tore off another alignment unwrap here.

Are these char* output buffers a code smell? This was done to fix initialization warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use __atomic_load_n here instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__atomic_load_n does not work for non-integral non-pointer types: https://gcc.godbolt.org/z/osrcs1

Maybe I'm mistaken in some way about its usage?

@wmaxey wmaxey force-pushed the feature/atomic_decouple branch 2 times, most recently from 149fcee to 264a9f2 Compare October 14, 2020 21:53
@wmaxey wmaxey requested a review from griwes October 14, 2020 21:53
@wmaxey
Copy link
Member Author

wmaxey commented Oct 14, 2020

Testing on a GV100 looks good. No failures with 1 unsupported test. I'll kick off a CI run now.

I've replaced the __cxx_atomic_alignment_unwrap/wrap and in the case of some functions like __cxx_atomic_compare_exchange_strong I've removed them because the underlying atomic handles it.

I'm unsure if that is the correct thing to do however.

@wmaxey wmaxey changed the title WIP: Atomic decouple Decouple Atomic from host STL on MSVC Oct 14, 2020
@wmaxey wmaxey changed the title Decouple Atomic from host STL on MSVC Decouple Atomic from host on MSVC Oct 14, 2020
#endif
}
template<class _Tp, int _Sco>
__host__ __device__ inline _Tp __cxx_atomic_load(__cxx_atomic_base_impl_default<_Tp, _Sco> const volatile* __a, int __order) {
#ifdef __CUDA_ARCH__
return __cxx_atomic_alignment_unwrap(detail::__atomic_load_n_cuda(&__a->__a_value, __order, detail::__scope_tag<_Sco>()));
#else
return __cxx_atomic_alignment_unwrap(::std::atomic_load_explicit(&__a->__a_value, (::std::memory_order)__order));
alignas(_Tp) unsigned char __buf[sizeof(_Tp)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use __atomic_load_n here instead?

#endif
}
template<class _Tp, int _Sco>
__host__ __device__ inline _Tp* __cxx_atomic_fetch_add(__cxx_atomic_base_impl_default<_Tp*, _Sco> volatile* __a, ptrdiff_t __delta, int __order) {
#ifdef __CUDA_ARCH__
return detail::__atomic_fetch_add_cuda(&__a->__a_value, __delta, __order, detail::__scope_tag<_Sco>());
#else
return ::std::atomic_fetch_add_explicit(&__a->__a_value, __delta, (::std::memory_order)__order);
return __atomic_fetch_add(&__a->__a_value, __delta * __skip_amt<_Tp*>::value, __order);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right to me that we should need the skip amount in this layer. The layer below should be doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, every deviation between the CUDA_ARCH side and this side looks like a bug to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem right to me that we should need the skip amount in this layer. The layer below should be doing that.

I'd agree, but there is no layer at the compiler intrinsic level for GCC. At that point incrementing by the sizeof(_Tp) is necessary. https://github.com/NVIDIA/libcudacxx/blob/main/libcxx/include/atomic#L846

Basically, every deviation between the CUDA_ARCH side and this side looks like a bug to me.

@griwes, @jrhemstad, and I had a meeting today about how we could resolve some of this with a better platform layering framework. There's some neat ideas on the table for making this nesting doll thing be a bit cleaner.

It would be relevant to know what things are being done wrong ahead of time.

@brycelelbach brycelelbach modified the milestones: 1.4.0, 2.0.0 Oct 29, 2020
@wmaxey wmaxey requested a review from ogiroux October 29, 2020 19:57
Copy link
Contributor

@ogiroux ogiroux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

@wmaxey wmaxey force-pushed the feature/atomic_decouple branch from 264a9f2 to 7d802a6 Compare November 13, 2020 04:16
@wmaxey
Copy link
Member Author

wmaxey commented Nov 13, 2020

Atomic decouple builds clean on CI.

SC: 29322243.2

@wmaxey wmaxey merged commit 7b37c3d into NVIDIA:main Nov 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants