-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
amrex::GpuComplex
not ideal for GPU
#3677
Comments
|
https://en.cppreference.com/w/cpp/numeric/complex Casting from std::complex to T[2] or T is allowed. |
NVIDIA/libcudacxx#151 Looks like cuda::std::complex aligns to 2*sizeof(T) now. |
@AlexanderSinn How did you create the array of GpuComplex? I am asking because the alignmen of amrex::Arena allocation is 16. |
|
That probably means cudaMalloc and cudaHostMalloc allocate memory with smaller alignment size, I guess, because in the Arean we make sure the blocks we give out are multiples of 16 bytes. |
So we might be able to fix the alignment in Arena without changing GpuComplex. |
The alignment needs to be known at compile time so nvcc can do the 16-byte read/write. cudaMalloc is actually 256 bytes aligned. |
It generates different types of load and store instructions for aligned and non-aligned types. |
Yes, you are right. Since the basic types like unsigned long long need 16 bytes alignment, malloc is at least 16 bytes aligned. |
I think we should add the required alignment to amrex::GpuComplex. @AlexanderSinn Could you submit a PR? |
Yes |
## Summary As discussed in #3677, this PR makes the alignment of `amrex::GpuComplex` stricter to allow for coalesced memory accesses of arrays of GpuComplex by nvidia GPUs such as A100. Note that this may break `reinterpret_cast` from an array allocated as `std::complex` to `amrex::GpuComplex`, but not the other way around. ## Additional background Typical allocators (malloc, amrex CArena) give memory aligned to 16 bytes and CUDA allocators aligned to 256 bytes, which is sufficient for `amrex::GpuComplex<double>`. ## Checklist The proposed changes: - [x] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate
Fixed in #3691 |
Specifically when accessing an array of
amrex::GpuComplex
on the GPU, it would be better if the alignment ofamrex::GpuComplex
was2*alignof(T)
instead ofalignof(T)
. This is because one GPU thread can write both real and imaginary part in a single memory transaction if the memory region is aligned to its size. If this is not the case the memory access from the GPU warp is non-coalesced (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses).Using
AlignedComplex
was more than 10 times faster thanamrex::GpuComplex<amrex::Real>
(Real = double) when writing to a large array in pinned memory form an A100 (pinned memory is especially sensitive to non-coalesced memory access).Note: increasing the alignment would break
reinterpret_cast
fromstd::complex
toamrex::GpuComplex
.The text was updated successfully, but these errors were encountered: