`amrex::GpuComplex` not ideal for GPU #3677

AlexanderSinn · 2023-12-18T19:52:27Z

Specifically when accessing an array of amrex::GpuComplex on the GPU, it would be better if the alignment of amrex::GpuComplex was 2*alignof(T) instead of alignof(T). This is because one GPU thread can write both real and imaginary part in a single memory transaction if the memory region is aligned to its size. If this is not the case the memory access from the GPU warp is non-coalesced (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses).

struct alignas(2*sizeof(amrex::Real)) AlignedComplex {
    amrex::Real real;
    amrex::Real imag;
};

Using AlignedComplex was more than 10 times faster than amrex::GpuComplex<amrex::Real> (Real = double) when writing to a large array in pinned memory form an A100 (pinned memory is especially sensitive to non-coalesced memory access).

Note: increasing the alignment would break reinterpret_cast from std::complex to amrex::GpuComplex.

The text was updated successfully, but these errors were encountered:

BenWibking · 2023-12-18T20:04:49Z

reinterpret_cast between different types (except when casting to char, or unsigned vs. signed variants of the same type) is always undefined behavior. That should never be done, and abuse of reintepret_cast has caused incorrect results with our AMReX code on AMD GPUs.

WeiqunZhang · 2023-12-18T20:08:23Z

https://en.cppreference.com/w/cpp/numeric/complex Casting from std::complex to T[2] or T is allowed.

WeiqunZhang · 2023-12-18T20:10:58Z

NVIDIA/libcudacxx#151 Looks like cuda::std::complex aligns to 2*sizeof(T) now.

WeiqunZhang · 2023-12-18T21:29:14Z

@AlexanderSinn How did you create the array of GpuComplex? I am asking because the alignmen of amrex::Arena allocation is 16.

AlexanderSinn · 2023-12-18T21:32:32Z

amrex::BaseFab<AlignedComplex> or amrex::BaseFab<amrex::GpuComplex<amrex::Real>>
fab.resize(domain, 1, amrex::The_Pinned_Arena());
fab.setVal<amrex::RunOn::Host>({0,0});

WeiqunZhang · 2023-12-18T21:39:14Z

That probably means cudaMalloc and cudaHostMalloc allocate memory with smaller alignment size, I guess, because in the Arean we make sure the blocks we give out are multiples of 16 bytes.

WeiqunZhang · 2023-12-18T21:40:13Z

So we might be able to fix the alignment in Arena without changing GpuComplex.

AlexanderSinn · 2023-12-18T21:44:13Z

The alignment needs to be known at compile time so nvcc can do the 16-byte read/write. cudaMalloc is actually 256 bytes aligned.

AlexanderSinn · 2023-12-18T21:54:34Z

It generates different types of load and store instructions for aligned and non-aligned types.
https://godbolt.org/z/hMv9TWxzz

WeiqunZhang · 2023-12-18T22:32:26Z

Yes, you are right. Since the basic types like unsigned long long need 16 bytes alignment, malloc is at least 16 bytes aligned.

WeiqunZhang · 2024-01-09T17:35:37Z

I think we should add the required alignment to amrex::GpuComplex. @AlexanderSinn Could you submit a PR?

AlexanderSinn · 2024-01-09T17:44:58Z

Yes

## Summary As discussed in #3677, this PR makes the alignment of `amrex::GpuComplex` stricter to allow for coalesced memory accesses of arrays of GpuComplex by nvidia GPUs such as A100. Note that this may break `reinterpret_cast` from an array allocated as `std::complex` to `amrex::GpuComplex`, but not the other way around. ## Additional background Typical allocators (malloc, amrex CArena) give memory aligned to 16 bytes and CUDA allocators aligned to 256 bytes, which is sufficient for `amrex::GpuComplex<double>`. ## Checklist The proposed changes: - [x] fix a bug or incorrect behavior in AMReX - [ ] add new capabilities to AMReX - [ ] changes answers in the test suite to more than roundoff level - [ ] are likely to significantly affect the results of downstream AMReX users - [ ] include documentation in the code and/or rst files, if appropriate

AlexanderSinn · 2024-01-09T20:56:50Z

Fixed in #3691

AlexanderSinn added the GPU label Dec 18, 2023

AlexanderSinn mentioned this issue Jan 9, 2024

Align GpuComplex to its size #3691

Merged

5 tasks

AlexanderSinn closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`amrex::GpuComplex` not ideal for GPU #3677

`amrex::GpuComplex` not ideal for GPU #3677

AlexanderSinn commented Dec 18, 2023

BenWibking commented Dec 18, 2023 •

edited

Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023 •

edited

Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023 •

edited

Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Jan 9, 2024

AlexanderSinn commented Jan 9, 2024

AlexanderSinn commented Jan 9, 2024

amrex::GpuComplex not ideal for GPU #3677

amrex::GpuComplex not ideal for GPU #3677

Comments

AlexanderSinn commented Dec 18, 2023

BenWibking commented Dec 18, 2023 • edited Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023 • edited Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023

AlexanderSinn commented Dec 18, 2023 • edited Loading

WeiqunZhang commented Dec 18, 2023

WeiqunZhang commented Jan 9, 2024

AlexanderSinn commented Jan 9, 2024

AlexanderSinn commented Jan 9, 2024

`amrex::GpuComplex` not ideal for GPU #3677

`amrex::GpuComplex` not ideal for GPU #3677

BenWibking commented Dec 18, 2023 •

edited

Loading

AlexanderSinn commented Dec 18, 2023 •

edited

Loading

AlexanderSinn commented Dec 18, 2023 •

edited

Loading