Add support for AMD GPUs #157

samhatfield · 2024-10-02T10:34:45Z

Finally, we are able to run the develop branch on AMD GPUs.

Many many thanks to @PaulMullowney for getting this done.

Note that #155 should be merged first.

samhatfield · 2024-10-02T10:38:53Z

Note that graph support has not been tested on AMD GPUs yet.

src/trans/gpu/internal/trltom_mod.F90

Co-authored-by: Paul Mullowney <[email protected]>

wdeconinck · 2024-10-11T11:38:35Z

For future reference and my understanding of this bizarre MPI related change.
It is needed because the MPI_F08 version of MPI_ALLTOALLV seems to have a bug in the CRAY MPI library related to GPU-aware MPI communications.

Removing MPI_ALLTOALLV from the USE statement means that not the MPI_F08-module-mangled symbol (e.g. _mpi_f08_mod_mpi_alltoallv is being used, but rather an interface-less MPI_F77 _mpi_alltoallv symbol, which by luck(?) happens to be linked in as well, but that is not guaranteed necessarily.

Also there is no checking on arguments because of any missing interface, meaning we can pass whatever and any number of arguments. It so happens (again by luck?) that the signature matches on a byte-level, meaning that TYPE(MPI_COMM) argument is compatible with the INT argument expected for the MPI_F77 API.

I don't consider this as a final solution, but good enough for now, seeing that it works experimentally on some of our tested platforms / compilers / mpi libraries, and seeing that the GPU adaptations are very much in flux.

samhatfield · 2024-10-11T11:44:23Z

Exactly. Also, the implication is that there is something wrong with _mpi_f08_mod_mpi_alltoallv on the only Cray platform we have access to and care about: LUMI-G. When you do USE MPI_ALLTOALLV, presumably the _mpi_f08_mod_mpi_alltoallv symbol is used, as you say, and this version somehow doesn't work correctly when device pointers are passed for the buffer arguments. Hence, GPU-aware communication isn't possible -> the outcoming data is all NaNs.

What would be great would be to confirm @wdeconinck's hypothesis by inspecting the object code generated when you import and don't import MPI_ALLTOALLV.

Co-authored-by: Paul Mullowney <[email protected]> Co-authored-by: Willem Deconinck <[email protected]>

samhatfield added bug Something isn't working gpu labels Oct 2, 2024

PaulMullowney reviewed Oct 2, 2024

View reviewed changes

src/trans/gpu/internal/trltom_mod.F90 Show resolved Hide resolved

Workaround for ASYNC statements in TRMTOL for AMD GPUs

234c6a9

Co-authored-by: Paul Mullowney <[email protected]>

samhatfield force-pushed the samhatfield/amd_gpu_workarounds branch 2 times, most recently from a8592fe to f1301b3 Compare October 9, 2024 10:24

wdeconinck approved these changes Oct 11, 2024

View reviewed changes

Workaround for supporting GPU-aware MPI on Cray/AMD platforms

6840754

Co-authored-by: Paul Mullowney <[email protected]> Co-authored-by: Willem Deconinck <[email protected]>

wdeconinck force-pushed the samhatfield/amd_gpu_workarounds branch from 0a5b6df to 6840754 Compare October 11, 2024 11:48

wdeconinck merged commit b4a4cc1 into develop Oct 11, 2024
21 checks passed

samhatfield deleted the samhatfield/amd_gpu_workarounds branch October 11, 2024 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for AMD GPUs #157

Add support for AMD GPUs #157

samhatfield commented Oct 2, 2024

samhatfield commented Oct 2, 2024

wdeconinck commented Oct 11, 2024 •

edited

Loading

samhatfield commented Oct 11, 2024

Add support for AMD GPUs #157

Add support for AMD GPUs #157

Conversation

samhatfield commented Oct 2, 2024

samhatfield commented Oct 2, 2024

wdeconinck commented Oct 11, 2024 • edited Loading

samhatfield commented Oct 11, 2024

wdeconinck commented Oct 11, 2024 •

edited

Loading