Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amrex::GpuComplex not ideal for GPU #3677

Closed
AlexanderSinn opened this issue Dec 18, 2023 · 13 comments
Closed

amrex::GpuComplex not ideal for GPU #3677

AlexanderSinn opened this issue Dec 18, 2023 · 13 comments
Labels

Comments

@AlexanderSinn
Copy link
Member

Specifically when accessing an array of amrex::GpuComplex on the GPU, it would be better if the alignment of amrex::GpuComplex was 2*alignof(T) instead of alignof(T). This is because one GPU thread can write both real and imaginary part in a single memory transaction if the memory region is aligned to its size. If this is not the case the memory access from the GPU warp is non-coalesced (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses).

struct alignas(2*sizeof(amrex::Real)) AlignedComplex {
    amrex::Real real;
    amrex::Real imag;
};

Using AlignedComplex was more than 10 times faster than amrex::GpuComplex<amrex::Real> (Real = double) when writing to a large array in pinned memory form an A100 (pinned memory is especially sensitive to non-coalesced memory access).

Note: increasing the alignment would break reinterpret_cast from std::complex to amrex::GpuComplex.

@BenWibking
Copy link
Contributor

BenWibking commented Dec 18, 2023

reinterpret_cast between different types (except when casting to char, or unsigned vs. signed variants of the same type) is always undefined behavior. That should never be done, and abuse of reintepret_cast has caused incorrect results with our AMReX code on AMD GPUs.

@WeiqunZhang
Copy link
Member

https://en.cppreference.com/w/cpp/numeric/complex Casting from std::complex to T[2] or T is allowed.

@WeiqunZhang
Copy link
Member

NVIDIA/libcudacxx#151 Looks like cuda::std::complex aligns to 2*sizeof(T) now.

@WeiqunZhang
Copy link
Member

@AlexanderSinn How did you create the array of GpuComplex? I am asking because the alignmen of amrex::Arena allocation is 16.

@AlexanderSinn
Copy link
Member Author

AlexanderSinn commented Dec 18, 2023

amrex::BaseFab<AlignedComplex> or amrex::BaseFab<amrex::GpuComplex<amrex::Real>>
fab.resize(domain, 1, amrex::The_Pinned_Arena());
fab.setVal<amrex::RunOn::Host>({0,0});

@WeiqunZhang
Copy link
Member

That probably means cudaMalloc and cudaHostMalloc allocate memory with smaller alignment size, I guess, because in the Arean we make sure the blocks we give out are multiples of 16 bytes.

@WeiqunZhang
Copy link
Member

So we might be able to fix the alignment in Arena without changing GpuComplex.

@AlexanderSinn
Copy link
Member Author

The alignment needs to be known at compile time so nvcc can do the 16-byte read/write. cudaMalloc is actually 256 bytes aligned.

@AlexanderSinn
Copy link
Member Author

AlexanderSinn commented Dec 18, 2023

It generates different types of load and store instructions for aligned and non-aligned types.
https://godbolt.org/z/hMv9TWxzz

@WeiqunZhang
Copy link
Member

Yes, you are right. Since the basic types like unsigned long long need 16 bytes alignment, malloc is at least 16 bytes aligned.

@WeiqunZhang
Copy link
Member

I think we should add the required alignment to amrex::GpuComplex. @AlexanderSinn Could you submit a PR?

@AlexanderSinn
Copy link
Member Author

Yes

WeiqunZhang pushed a commit that referenced this issue Jan 9, 2024
## Summary

As discussed in #3677, this PR makes the alignment of
`amrex::GpuComplex` stricter to allow for coalesced memory accesses of
arrays of GpuComplex by nvidia GPUs such as A100.

Note that this may break `reinterpret_cast` from an array allocated as
`std::complex` to `amrex::GpuComplex`, but not the other way around.

## Additional background

Typical allocators (malloc, amrex CArena) give memory aligned to 16
bytes and CUDA allocators aligned to 256 bytes, which is sufficient for
`amrex::GpuComplex<double>`.

## Checklist

The proposed changes:
- [x] fix a bug or incorrect behavior in AMReX
- [ ] add new capabilities to AMReX
- [ ] changes answers in the test suite to more than roundoff level
- [ ] are likely to significantly affect the results of downstream AMReX
users
- [ ] include documentation in the code and/or rst files, if appropriate
@AlexanderSinn
Copy link
Member Author

Fixed in #3691

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants