Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for AMD GPUs #157

Merged
merged 2 commits into from
Oct 11, 2024
Merged

Conversation

samhatfield
Copy link
Collaborator

Finally, we are able to run the develop branch on AMD GPUs.

Many many thanks to @PaulMullowney for getting this done.

Note that #155 should be merged first.

@samhatfield samhatfield added bug Something isn't working gpu labels Oct 2, 2024
@samhatfield
Copy link
Collaborator Author

Note that graph support has not been tested on AMD GPUs yet.

@samhatfield samhatfield force-pushed the samhatfield/amd_gpu_workarounds branch 2 times, most recently from a8592fe to f1301b3 Compare October 9, 2024 10:24
@wdeconinck
Copy link
Collaborator

wdeconinck commented Oct 11, 2024

For future reference and my understanding of this bizarre MPI related change.
It is needed because the MPI_F08 version of MPI_ALLTOALLV seems to have a bug in the CRAY MPI library related to GPU-aware MPI communications.

Removing MPI_ALLTOALLV from the USE statement means that not the MPI_F08-module-mangled symbol (e.g. _mpi_f08_mod_mpi_alltoallv is being used, but rather an interface-less MPI_F77 _mpi_alltoallv symbol, which by luck(?) happens to be linked in as well, but that is not guaranteed necessarily.

Also there is no checking on arguments because of any missing interface, meaning we can pass whatever and any number of arguments. It so happens (again by luck?) that the signature matches on a byte-level, meaning that TYPE(MPI_COMM) argument is compatible with the INT argument expected for the MPI_F77 API.

I don't consider this as a final solution, but good enough for now, seeing that it works experimentally on some of our tested platforms / compilers / mpi libraries, and seeing that the GPU adaptations are very much in flux.

@samhatfield
Copy link
Collaborator Author

Exactly. Also, the implication is that there is something wrong with _mpi_f08_mod_mpi_alltoallv on the only Cray platform we have access to and care about: LUMI-G. When you do USE MPI_ALLTOALLV, presumably the _mpi_f08_mod_mpi_alltoallv symbol is used, as you say, and this version somehow doesn't work correctly when device pointers are passed for the buffer arguments. Hence, GPU-aware communication isn't possible -> the outcoming data is all NaNs.

What would be great would be to confirm @wdeconinck's hypothesis by inspecting the object code generated when you import and don't import MPI_ALLTOALLV.

Co-authored-by: Paul Mullowney <[email protected]>
Co-authored-by: Willem Deconinck <[email protected]>
@wdeconinck wdeconinck force-pushed the samhatfield/amd_gpu_workarounds branch from 0a5b6df to 6840754 Compare October 11, 2024 11:48
@wdeconinck wdeconinck merged commit b4a4cc1 into develop Oct 11, 2024
21 checks passed
@samhatfield samhatfield deleted the samhatfield/amd_gpu_workarounds branch October 11, 2024 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gpu
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants